100+ datasets found
  1. h

    wikidata

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philippe Saadé (2025). wikidata [Dataset]. https://huggingface.co/datasets/philippesaade/wikidata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2025
    Authors
    Philippe Saadé
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Wikidata Entities Connected to Wikipedia

    This dataset is a multilingual, JSON-formatted version of the Wikidata dump from September 18, 2024. It only includes Wikidata entities that are connected to a Wikipedia page in any language. A total of 112,467,802 entities are included in the original data dump, of which 30,072,707 are linked to a Wikipedia page (26.73% of all entities have at least one Wikipedia sitelink).

    Curated by: Jonathan Fraine & Philippe Saadé, Wikimedia Deutschland… See the full description on the dataset page: https://huggingface.co/datasets/philippesaade/wikidata.

  2. P

    Wikidata-Disamb Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Cetoli; Mohammad Akbari; Stefano Bragaglia; Andrew D. O'Harney; Marc Sloan (2021). Wikidata-Disamb Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata-disamb
    Explore at:
    Dataset updated
    Feb 5, 2021
    Authors
    Alberto Cetoli; Mohammad Akbari; Stefano Bragaglia; Andrew D. O'Harney; Marc Sloan
    Description

    The Wikidata-Disamb dataset is intended to allow a clean and scalable evaluation of NED with Wikidata entries, and to be used as a reference in future research.

  3. wikidata-20220103-all.json.gz

    • academictorrents.com
    bittorrent
    Updated Jan 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikidata.org (2022). wikidata-20220103-all.json.gz [Dataset]. https://academictorrents.com/details/229cfeb2331ad43d4706efd435f6d78f40a3c438
    Explore at:
    bittorrent(109042925619)Available download formats
    Dataset updated
    Jan 24, 2022
    Dataset provided by
    Wikidata//wikidata.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A BitTorrent file to download data with the title 'wikidata-20220103-all.json.gz'

  4. P

    Wikidata5M Dataset

    • paperswithcode.com
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang (2023). Wikidata5M Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata5m
    Explore at:
    Dataset updated
    Jun 14, 2023
    Authors
    Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang
    Description

    Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

    The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.

  5. P

    Wikidata Dataset

    • paperswithcode.com
    Updated Dec 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Wikidata Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata
    Explore at:
    Dataset updated
    Dec 31, 2023
    Description

    Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. It acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

  6. Wikidata Entities of Interest

    • opensanctions.org
    csv
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2024). Wikidata Entities of Interest [Dataset]. https://www.opensanctions.org/datasets/wd_curated/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Persons of interest profiles from Wikidata, the structured data version of Wikipedia.

  7. h

    Wikidata

    • huggingface.co
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuqing Yang (2025). Wikidata [Dataset]. https://huggingface.co/datasets/ayyyq/Wikidata
    Explore at:
    Dataset updated
    May 23, 2025
    Authors
    Yuqing Yang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset accompanies the paper: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction It includes the original Wikidata questions used in our experiments, with train/test split. For a detailed explanation of the dataset construction and usage, please refer to the paper. Code: https://github.com/ayyyq/llm-retraction

      Citation
    

    @misc{yang2025llmsadmitmistakesunderstanding, title={When Do LLMs Admit Their Mistakes? Understanding the Role of… See the full description on the dataset page: https://huggingface.co/datasets/ayyyq/Wikidata.

  8. a

    Wikidata PageRank

    • danker.s3.amazonaws.com
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Thalhammer (2025). Wikidata PageRank [Dataset]. https://danker.s3.amazonaws.com/index.html
    Explore at:
    tsv, application/n-triples, application/vnd.hdt, ttlAvailable download formats
    Dataset updated
    Jun 14, 2025
    Authors
    Andreas Thalhammer
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Regularly published dataset of PageRank scores for Wikidata entities. The underlying link graph is formed by a union of all links accross all Wikipedia language editions. Computation is performed Andreas Thalhammer with 'danker' available at https://github.com/athalhammer/danker . If you find the downloads here useful please feel free to leave a GitHub ⭐ at the repository and buy me a ☕ https://www.buymeacoffee.com/thalhamm

  9. b

    Wikidata

    • bioregistry.io
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Wikidata [Dataset]. http://identifiers.org/biolink:WIKIDATA
    Explore at:
    Dataset updated
    Nov 13, 2021
    License

    https://bioregistry.io/spdx:CC0-1.0https://bioregistry.io/spdx:CC0-1.0

    Description

    Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of certain types of data which can be used by Wikimedia projects such as Wikipedia. Wikidata functions as a document-oriented database, centred on individual items. Items represent topics, for which basic information is stored that identifies each topic.

  10. A dataset of scholarly journals in wikidata : (selected) external...

    • zenodo.org
    zip
    Updated Nov 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis-Michel Mugabushaka; Alexis-Michel Mugabushaka (2022). A dataset of scholarly journals in wikidata : (selected) external identifiers [Dataset]. http://doi.org/10.5281/zenodo.6347127
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexis-Michel Mugabushaka; Alexis-Michel Mugabushaka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For an updated list , see

    Matching OpenAlex venues to Wikidata identifiers

    Motivation : the selective/Inclusive approach in bibliometric databases

    An important difference between bibliometric databases is their “inclusion policy”.

    Some databases like Web Of Science and Scopus select the sources they index, while others like Dimensions and OpenAlex are more inclusive (they index for example all data from a given source such as Crossref).

    WOS

    selectivity remained a hallmark of coverage because Garfield had decided early on to focus on internationally influential journals.” (...).”

    SCOPUS

    Serial content (i.e., journals, conference proceedings, and book series) submitted for possible inclusion in Scopus by editors and publishers is reviewed and selected, based on criteria of scientific quality and rigor. This selection process is carried out by an external Content Selection and Advisory Board (CSAB) of editorially independent scientists, each of which are subject matter experts in their respective fields. This ensures that only high-quality curated content is indexed in the database and affirms the trustworthiness of Scopus

    Dimensions

    We have decided to take an “inclusive” approach to the publications we index in Dimensions. We believe that Dimensions should be a comprehensive data source, not a judgment call, and so we index as broad a swath of content as possible and have developed a number of features (e.g., the Dimensions API, journal list filters that limit search results to journals that appear in sources such as Pubmed or the 2015 Australian ERA6 journal list) that allow users to filter and select the data that is most relevant to their specific needs.



    Using wikidata to enable the filtering of “ venues subsets” in OpenAlex

    We are interested in creating subsets of venues in OpenAlex (for example for comparative analysis with inclusive databases or other use cases). This would require matching identifiers of OpenAlex venues to other identifiers.

    Thanks to WikiCite, a project to record and link scholarly data, Wikidata has a large collection of metadata related to Scholarly journals. This repository provides a subset of the scholarly journals in Wikidata, focusing mainly on external identifiers.

    The dataset will be used to explore the extent to which wikidata journal external identifiers can be used to select the content in OpenAlex.

    (see here an list of openly available lists of journals )

    Dataset creation & Documentation

    Some numbers :

    Number of journals in wikidata : 113,797 ; With issn_l 95,888 , With OpenAlex_venue id : 29,150

    external identifiers

    https://www.wikidata.org/wiki/Property:P236 # ext_id_issn

    https://www.wikidata.org/wiki/Property:P7363 # ext_id_issn_l

    https://www.wikidata.org/wiki/Property:P8375 # ext_id_crossref_journal_id

    https://www.wikidata.org/wiki/Property:P1055 # ext_id_nlm_unique_id

    https://www.wikidata.org/wiki/Property:P1058 # ext_id_era_journal_id

    https://www.wikidata.org/wiki/Property:P1250 # ext_id_danish_bif_id

    https://www.wikidata.org/wiki/Property:P10283 #ext_id_openalex_id

    https://www.wikidata.org/wiki/Property:P1156 # ext_id_scopus_source_id


    Indexing services

    https://www.wikidata.org/wiki/Property:P8875

    https://www.wikidata.org/wiki/Q371467 # Scopus

    https://www.wikidata.org/wiki/Q104047209 # Science Citation Index Expanded

    https://www.wikidata.org/wiki/Q22908122 # Emerging Sources Citation Index

    https://www.wikidata.org/wiki/Q1090953 # Social Sciences Citation Index

    https://www.wikidata.org/wiki/Q713927 # Arts and Humanities Citation index

  11. h

    Wikidata Companies Graph

    • data.hellenicdataservice.gr
    • zenodo.org
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Wikidata Companies Graph [Dataset]. https://data.hellenicdataservice.gr/dataset/f7341a62-a513-4931-99b8-0e302dc46d66
    Explore at:
    Dataset updated
    Jun 20, 2019
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).

  12. Wikidata Politically Exposed Persons

    • opensanctions.org
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2025). Wikidata Politically Exposed Persons [Dataset]. https://www.opensanctions.org/datasets/wd_peps/
    Explore at:
    application/json+ftmAvailable download formats
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Profiles of politically exposed persons from Wikidata, the structured data version of Wikipedia.

  13. wikidata-20240701-all.json.bz2

    • academictorrents.com
    bittorrent
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata Contributors (2024). wikidata-20240701-all.json.bz2 [Dataset]. https://academictorrents.com/details/dc083577b9f773ef0d41a3eba21b8694d5a56e99
    Explore at:
    bittorrent(89940529332)Available download formats
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Wikidata//wikidata.org/
    Authors
    Wikidata Contributors
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A BitTorrent file to download data with the title 'wikidata-20240701-all.json.bz2'

  14. Wikidata dump 2017-12-27

    • zenodo.org
    bz2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WikiData; WikiData (2020). Wikidata dump 2017-12-27 [Dataset]. http://doi.org/10.5281/zenodo.1211767
    Explore at:
    bz2Available download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    WikiData; WikiData
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
  15. f

    Wikidata Reference

    • figshare.com
    application/gzip
    Updated Mar 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Hertling; Nandana Mihindukulasooriya (2025). Wikidata Reference [Dataset]. http://doi.org/10.6084/m9.figshare.28602170.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    figshare
    Authors
    Sven Hertling; Nandana Mihindukulasooriya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.

  16. P

    Wikidata-14M Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jul 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kholoud Alghamdi; Miaojing Shi; Elena Simperl (2021). Wikidata-14M Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata-14m
    Explore at:
    Dataset updated
    Jul 12, 2021
    Authors
    Kholoud Alghamdi; Miaojing Shi; Elena Simperl
    Description

    Wikidata-14M is a recommender system dataset for recommending items to Wikidata editors. It consists of 220,000 editors responsible for 14 million interactions with 4 million items.

  17. h

    wikidata-en-descriptions-small

    • huggingface.co
    Updated Aug 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Erenrich (2023). wikidata-en-descriptions-small [Dataset]. https://huggingface.co/datasets/derenrich/wikidata-en-descriptions-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2023
    Authors
    Daniel Erenrich
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    derenrich/wikidata-en-descriptions-small dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. Wikidata item quality labels

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Glorian Yapinus; Amir Sarabadani; Aaron Halfaker (2023). Wikidata item quality labels [Dataset]. http://doi.org/10.6084/m9.figshare.5035796.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Glorian Yapinus; Amir Sarabadani; Aaron Halfaker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains quality labels for 5000 Wikidata items applied by Wikidata editors. The labels correspond to the quality scale described at https://www.wikidata.org/wiki/Wikidata:Item_quality Each line is a JSON blob with the following fields: - item_quality: The labeled quality class (A-E)- rev_id: the revision identifier of the version of the item that was labeled- strata: The size of the item in bytes at the time it was sampled- page_len: The actual size of the item in bytes- page_title: The Qid of the item- claims: A dictionary including P31 "instance-of" values for filtering out certain types of itemsThe # of observations by class is: - A class: 322- B class: 438- C class: 1773- D class: 997- E class: 1470

  19. o

    Wikidata Dump wikidata

    • explore.openaire.eu
    Updated Jan 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Fünfstück (2022). Wikidata Dump wikidata [Dataset]. http://doi.org/10.5281/zenodo.5833973
    Explore at:
    Dataset updated
    Jan 10, 2022
    Authors
    Benno Fünfstück
    Description

    RDF dump of wikidata produced with wdumps. View on wdumper entity count: 0, statement count: 0, triple count: 0

  20. h

    wikidata-enwiki-categories-and-statements

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Erenrich (2025). wikidata-enwiki-categories-and-statements [Dataset]. https://huggingface.co/datasets/derenrich/wikidata-enwiki-categories-and-statements
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2025
    Authors
    Daniel Erenrich
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    derenrich/wikidata-enwiki-categories-and-statements dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Philippe Saadé (2025). wikidata [Dataset]. https://huggingface.co/datasets/philippesaade/wikidata

wikidata

philippesaade/wikidata

Wikidata Entities Connected to Wikipedia

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2025
Authors
Philippe Saadé
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Wikidata Entities Connected to Wikipedia

This dataset is a multilingual, JSON-formatted version of the Wikidata dump from September 18, 2024. It only includes Wikidata entities that are connected to a Wikipedia page in any language. A total of 112,467,802 entities are included in the original data dump, of which 30,072,707 are linked to a Wikipedia page (26.73% of all entities have at least one Wikipedia sitelink).

Curated by: Jonathan Fraine & Philippe Saadé, Wikimedia Deutschland… See the full description on the dataset page: https://huggingface.co/datasets/philippesaade/wikidata.

Search
Clear search
Close search
Google apps
Main menu