100+ datasets found
  1. h

    wikidata

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philippe Saadé (2025). wikidata [Dataset]. https://huggingface.co/datasets/philippesaade/wikidata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2025
    Authors
    Philippe Saadé
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Wikidata Entities Connected to Wikipedia

    This dataset is a multilingual, JSON-formatted version of the Wikidata dump from September 18, 2024. It only includes Wikidata entities that are connected to a Wikipedia page in any language. A total of 112,467,802 entities are included in the original data dump, of which 30,072,707 are linked to a Wikipedia page (26.73% of all entities have at least one Wikipedia sitelink).

    Curated by: Jonathan Fraine & Philippe Saadé, Wikimedia Deutschland… See the full description on the dataset page: https://huggingface.co/datasets/philippesaade/wikidata.

  2. Wikidata Entities of Interest

    • opensanctions.org
    csv
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2024). Wikidata Entities of Interest [Dataset]. https://www.opensanctions.org/datasets/wd_curated/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Persons of interest profiles from Wikidata, the structured data version of Wikipedia.

  3. b

    Wikidata

    • bioregistry.io
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Wikidata [Dataset]. http://identifiers.org/biolink:WIKIDATA
    Explore at:
    Dataset updated
    Nov 13, 2021
    License

    https://bioregistry.io/spdx:CC0-1.0https://bioregistry.io/spdx:CC0-1.0

    Description

    Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of certain types of data which can be used by Wikimedia projects such as Wikipedia. Wikidata functions as a document-oriented database, centred on individual items. Items represent topics, for which basic information is stored that identifies each topic.

  4. a

    Wikidata PageRank

    • danker.s3.amazonaws.com
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Thalhammer (2025). Wikidata PageRank [Dataset]. https://danker.s3.amazonaws.com/index.html
    Explore at:
    tsv, application/n-triples, application/vnd.hdt, ttlAvailable download formats
    Dataset updated
    Jun 14, 2025
    Authors
    Andreas Thalhammer
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Regularly published dataset of PageRank scores for Wikidata entities. The underlying link graph is formed by a union of all links accross all Wikipedia language editions. Computation is performed Andreas Thalhammer with 'danker' available at https://github.com/athalhammer/danker . If you find the downloads here useful please feel free to leave a GitHub ⭐ at the repository and buy me a ☕ https://www.buymeacoffee.com/thalhamm

  5. wikidata-20220103-all.json.gz

    • academictorrents.com
    bittorrent
    Updated Jan 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikidata.org (2022). wikidata-20220103-all.json.gz [Dataset]. https://academictorrents.com/details/229cfeb2331ad43d4706efd435f6d78f40a3c438
    Explore at:
    bittorrent(109042925619)Available download formats
    Dataset updated
    Jan 24, 2022
    Dataset provided by
    Wikidata//wikidata.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A BitTorrent file to download data with the title 'wikidata-20220103-all.json.gz'

  6. h

    wikidata-en-descriptions

    • huggingface.co
    Updated Aug 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Erenrich (2023). wikidata-en-descriptions [Dataset]. https://huggingface.co/datasets/derenrich/wikidata-en-descriptions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2023
    Authors
    Daniel Erenrich
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    derenrich/wikidata-en-descriptions dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. Wikidata Politically Exposed Persons

    • opensanctions.org
    Updated Aug 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2025). Wikidata Politically Exposed Persons [Dataset]. https://www.opensanctions.org/datasets/wd_peps/
    Explore at:
    application/json+ftmAvailable download formats
    Dataset updated
    Aug 9, 2025
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Profiles of politically exposed persons from Wikidata, the structured data version of Wikipedia.

  8. Wikidata Persons in Relevant Categories

    • opensanctions.org
    Updated Aug 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2025). Wikidata Persons in Relevant Categories [Dataset]. https://www.opensanctions.org/datasets/wd_categories/
    Explore at:
    Dataset updated
    Aug 13, 2025
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Category-based imports from Wikidata, the structured data version of Wikipedia.

  9. h

    wikiData

    • huggingface.co
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya (2024). wikiData [Dataset]. https://huggingface.co/datasets/aditya998/wikiData
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2024
    Authors
    Aditya
    Description

    aditya998/wikiData dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. Z

    Wikidata dump from 2018-12-17 in JSON

    • data.niaid.nih.gov
    Updated Jan 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Škoda, Petr (2021). Wikidata dump from 2018-12-17 in JSON [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4436355
    Explore at:
    Dataset updated
    Jan 15, 2021
    Dataset provided by
    Škoda, Petr
    Klímek, Jakub
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dump from Wikidata from 2018-12-17 in JSON. This one is not avavailable anymore from Wikidata. It was downloaded originally from https://dumps.wikimedia.org/other/wikidata/20181217.json.gz and recompressed to fit on Zenodo.

  11. h

    Wikidata Companies Graph

    • data.hellenicdataservice.gr
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Wikidata Companies Graph [Dataset]. https://data.hellenicdataservice.gr/dataset/f7341a62-a513-4931-99b8-0e302dc46d66
    Explore at:
    Dataset updated
    Jun 20, 2019
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).

  12. wikidata-20240701-all.json.bz2

    • academictorrents.com
    bittorrent
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata Contributors (2024). wikidata-20240701-all.json.bz2 [Dataset]. https://academictorrents.com/details/dc083577b9f773ef0d41a3eba21b8694d5a56e99
    Explore at:
    bittorrent(89940529332)Available download formats
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Wikidata//wikidata.org/
    Authors
    Wikidata Contributors
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A BitTorrent file to download data with the title 'wikidata-20240701-all.json.bz2'

  13. Wikidata item quality labels

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Glorian Yapinus; Amir Sarabadani; Aaron Halfaker (2023). Wikidata item quality labels [Dataset]. http://doi.org/10.6084/m9.figshare.5035796.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Glorian Yapinus; Amir Sarabadani; Aaron Halfaker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains quality labels for 5000 Wikidata items applied by Wikidata editors. The labels correspond to the quality scale described at https://www.wikidata.org/wiki/Wikidata:Item_quality Each line is a JSON blob with the following fields: - item_quality: The labeled quality class (A-E)- rev_id: the revision identifier of the version of the item that was labeled- strata: The size of the item in bytes at the time it was sampled- page_len: The actual size of the item in bytes- page_title: The Qid of the item- claims: A dictionary including P31 "instance-of" values for filtering out certain types of itemsThe # of observations by class is: - A class: 322- B class: 438- C class: 1773- D class: 997- E class: 1470

  14. f

    Wikidata Reference

    • figshare.com
    application/gzip
    Updated Mar 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Hertling; Nandana Mihindukulasooriya (2025). Wikidata Reference [Dataset]. http://doi.org/10.6084/m9.figshare.28602170.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    figshare
    Authors
    Sven Hertling; Nandana Mihindukulasooriya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.

  15. h

    wikidata-enwiki-categories-and-statements

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Erenrich (2025). wikidata-enwiki-categories-and-statements [Dataset]. https://huggingface.co/datasets/derenrich/wikidata-enwiki-categories-and-statements
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2025
    Authors
    Daniel Erenrich
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    derenrich/wikidata-enwiki-categories-and-statements dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. Wikidata Graph Pattern Benchmark (WGPB) for RDF/SPARQL

    • zenodo.org
    • data.niaid.nih.gov
    bz2, zip
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aidan Hogan; Aidan Hogan; Cristian Riveros; Cristian Riveros; Carlos Rojas; Carlos Rojas; Adrián Soto; Adrián Soto (2021). Wikidata Graph Pattern Benchmark (WGPB) for RDF/SPARQL [Dataset]. http://doi.org/10.5281/zenodo.4035223
    Explore at:
    zip, bz2Available download formats
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Aidan Hogan; Aidan Hogan; Cristian Riveros; Cristian Riveros; Carlos Rojas; Carlos Rojas; Adrián Soto; Adrián Soto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Wikidata Graph Pattern Benchmark (WGPB) is a benchmark consisting of 50 instances of 17 different abstract query patterns giving a total of 850 SPARQL queries. The goal of the benchmark is to test the performance of query engines for more complex basic graph patterns. The benchmark was designed for evaluating worst-case optimal join algorithms but also serves as a general-purpose benchmark for evaluating (basic) graph patterns. The queries are provided in SPARQL syntax and all return at least one solution. We limit the number of results returned to a maximum of 1,000.

    Queries

    We provide an example of a "square" basic graph pattern (comments are added here for readability):

    SELECT * WHERE { 
     ?x1 <http://www.wikidata.org/prop/direct/P149> ?x2 . # architectural style
     ?x2 <http://www.wikidata.org/prop/direct/P1269> ?x3 . # facet of
     ?x3 <http://www.wikidata.org/prop/direct/P156> ?x4 . # followed by
     ?x1 <http://www.wikidata.org/prop/direct/P135> ?x4 . # movement
    } LIMIT 1000

    There are 49 other queries similar to this one in the dataset (replacing the predicates with other predicates), and 50 queries for 16 other abstract query patterns. For more details on these patterns, we refer to the publication mentioned below.

    Note that you can try the queries on the public Wikidata Query Service, though some might give a timeout.

    Generation

    The queries were generated over a reduced version of the Wikidata truthy dump from November 15, 2018 that we call the Wikidata Core Graph (WCG). Specifically, in order to reduce the data volume, multilingual labels, comments, etc., were removed as they have limited use for evaluating joins (English labels were kept under schema:name). Thereafter, in order to facilitate the generation of the queries, triples with rare predicates appearing in fewer than 1,000 triples, and very common predicates appearing in more than 1,000,000 triples, were removed. The queries provided will generate the same results over both graphs.

    Files

    In this dataset, we then include three files:

    • wgpb-queries.zip The list of 850 queries
    • wikidata-wcg.nt.gz Wikidata truthy graph with English labels
    • wikidata-wcg-filtered.nt.bz2 Wikidata truthy graph with English labels filtering triples with rare (<1000 triples) and very common (>1000000) predicates

    Code

    We provide the code for generating the datasets, queries, etc., along with scripts and instructions on how to run these queries in a variety of SPARQL engines (Blazegraph, Jena, Virtuoso and our worst-case optimal variant of Jena), .

    Publication

    The benchmark is proposed, described and used in the following paper. You can find more details about how it was generated, the 17 abstract patterns that were used, as well as results for prominent SPARQL engines.

    • Aidan Hogan, Cristian Riveros, Carlos Rojas and Adrián Soto. "A Worst-Case Optimal Join Algorithm for SPARQL". In the Proceedings of the 18th International Semantic Web Conference (ISWC), Auckland, New Zealand, October 26–30, 2019.
  17. wikidata-20180813-all.json.bz2

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2020). wikidata-20180813-all.json.bz2 [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3268724
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A copy of a dump which was available from WikiMedia: https://dumps.wikimedia.org/wikidatawiki/entities/

  18. Wikidata

    • web.archive.org
    full json dump +3
    Updated Oct 23, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2018). Wikidata [Dataset]. https://www.wikidata.org/wiki/Wikidata:Data_access
    Explore at:
    simplified ("truthy") rdf n-triples dump, sparql endpoint, full json dump, full rdf turtle dumpAvailable download formats
    Dataset updated
    Oct 23, 2018
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikidata offers a wide range of general data about our universe as well as links to other databases. The data is published under the CC0 "Public domain dedication" license. It can be edited by anyone and is maintained by Wikidata's editor community.

  19. o

    Wikidata Dump english_dump

    • explore.openaire.eu
    Updated Dec 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Fünfstück (2022). Wikidata Dump english_dump [Dataset]. http://doi.org/10.5281/zenodo.7458652
    Explore at:
    Dataset updated
    Dec 19, 2022
    Authors
    Benno Fünfstück
    Description

    RDF dump of wikidata produced with wdumper.View on wdumperentity count: 99962315, statement count: 1424821118, triple count: 1924502485

  20. Z

    Wikidata Causal Event Triple Data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sola (2023). Wikidata Causal Event Triple Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7196048
    Explore at:
    Dataset updated
    Feb 7, 2023
    Dataset provided by
    Oktie
    Debarun
    Sola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains triples curated from Wikidata surrounding news events with causal relations, and is released as part of our WWW'23 paper, "Event Prediction using Case-Based Reasoning over Knowledge Graphs".

    Starting from a set of classes that we consider to be types of "events", we queried Wikidata to collect entities that were an instanceOf an event class and that were connected to another such event entity by a causal triple (https://www.wikidata.org/wiki/Wikidata:List_of_properties/causality). For all such cause-effect event pairs, we then collected a 3-hop neighborhood of outgoing triples.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Philippe Saadé (2025). wikidata [Dataset]. https://huggingface.co/datasets/philippesaade/wikidata

wikidata

philippesaade/wikidata

Wikidata Entities Connected to Wikipedia

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2025
Authors
Philippe Saadé
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Wikidata Entities Connected to Wikipedia

This dataset is a multilingual, JSON-formatted version of the Wikidata dump from September 18, 2024. It only includes Wikidata entities that are connected to a Wikipedia page in any language. A total of 112,467,802 entities are included in the original data dump, of which 30,072,707 are linked to a Wikipedia page (26.73% of all entities have at least one Wikipedia sitelink).

Curated by: Jonathan Fraine & Philippe Saadé, Wikimedia Deutschland… See the full description on the dataset page: https://huggingface.co/datasets/philippesaade/wikidata.

Search
Clear search
Close search
Google apps
Main menu