https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Wikidata Entities Connected to Wikipedia
This dataset is a multilingual, JSON-formatted version of the Wikidata dump from September 18, 2024. It only includes Wikidata entities that are connected to a Wikipedia page in any language. A total of 112,467,802 entities are included in the original data dump, of which 30,072,707 are linked to a Wikipedia page (26.73% of all entities have at least one Wikipedia sitelink).
Curated by: Jonathan Fraine & Philippe Saadé, Wikimedia Deutschland… See the full description on the dataset page: https://huggingface.co/datasets/philippesaade/wikidata.
The Wikidata-Disamb dataset is intended to allow a clean and scalable evaluation of NED with Wikidata entries, and to be used as a reference in future research.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A BitTorrent file to download data with the title 'wikidata-20220103-all.json.gz'
Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.
The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.
Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. It acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Persons of interest profiles from Wikidata, the structured data version of Wikipedia.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset accompanies the paper: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction It includes the original Wikidata questions used in our experiments, with train/test split. For a detailed explanation of the dataset construction and usage, please refer to the paper. Code: https://github.com/ayyyq/llm-retraction
Citation
@misc{yang2025llmsadmitmistakesunderstanding, title={When Do LLMs Admit Their Mistakes? Understanding the Role of… See the full description on the dataset page: https://huggingface.co/datasets/ayyyq/Wikidata.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Regularly published dataset of PageRank scores for Wikidata entities. The underlying link graph is formed by a union of all links accross all Wikipedia language editions. Computation is performed Andreas Thalhammer with 'danker' available at https://github.com/athalhammer/danker . If you find the downloads here useful please feel free to leave a GitHub ⭐ at the repository and buy me a ☕ https://www.buymeacoffee.com/thalhamm
https://bioregistry.io/spdx:CC0-1.0https://bioregistry.io/spdx:CC0-1.0
Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of certain types of data which can be used by Wikimedia projects such as Wikipedia. Wikidata functions as a document-oriented database, centred on individual items. Items represent topics, for which basic information is stored that identifies each topic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For an updated list , see
Matching OpenAlex venues to Wikidata identifiers
Motivation : the selective/Inclusive approach in bibliometric databases
An important difference between bibliometric databases is their “inclusion policy”.
Some databases like Web Of Science and Scopus select the sources they index, while others like Dimensions and OpenAlex are more inclusive (they index for example all data from a given source such as Crossref).
“selectivity remained a hallmark of coverage because Garfield had decided early on to focus on internationally influential journals.” (...).”
“Serial content (i.e., journals, conference proceedings, and book series) submitted for possible inclusion in Scopus by editors and publishers is reviewed and selected, based on criteria of scientific quality and rigor. This selection process is carried out by an external Content Selection and Advisory Board (CSAB) of editorially independent scientists, each of which are subject matter experts in their respective fields. This ensures that only high-quality curated content is indexed in the database and affirms the trustworthiness of Scopus”
We have decided to take an “inclusive” approach to the publications we index in Dimensions. We believe that Dimensions should be a comprehensive data source, not a judgment call, and so we index as broad a swath of content as possible and have developed a number of features (e.g., the Dimensions API, journal list filters that limit search results to journals that appear in sources such as Pubmed or the 2015 Australian ERA6 journal list) that allow users to filter and select the data that is most relevant to their specific needs.
Using wikidata to enable the filtering of “ venues subsets” in OpenAlex
We are interested in creating subsets of venues in OpenAlex (for example for comparative analysis with inclusive databases or other use cases). This would require matching identifiers of OpenAlex venues to other identifiers.
Thanks to WikiCite, a project to record and link scholarly data, Wikidata has a large collection of metadata related to Scholarly journals. This repository provides a subset of the scholarly journals in Wikidata, focusing mainly on external identifiers.
The dataset will be used to explore the extent to which wikidata journal external identifiers can be used to select the content in OpenAlex.
(see here an list of openly available lists of journals )
Dataset creation & Documentation
Wikidata dump from 2022-02-21
Extract entities with following properties:
https://www.wikidata.org/wiki/Q5633421 # scientific journal (Q5633421)
https://www.wikidata.org/wiki/Q737498 # academic journal (Q737498)
Extract the properties related to (selected) external identifiers
Some numbers :
Number of journals in wikidata : 113,797 ; With issn_l 95,888 , With OpenAlex_venue id : 29,150
external identifiers
https://www.wikidata.org/wiki/Property:P236 # ext_id_issn
https://www.wikidata.org/wiki/Property:P7363 # ext_id_issn_l
https://www.wikidata.org/wiki/Property:P8375 # ext_id_crossref_journal_id
https://www.wikidata.org/wiki/Property:P1055 # ext_id_nlm_unique_id
https://www.wikidata.org/wiki/Property:P1058 # ext_id_era_journal_id
https://www.wikidata.org/wiki/Property:P1250 # ext_id_danish_bif_id
https://www.wikidata.org/wiki/Property:P10283 #ext_id_openalex_id
https://www.wikidata.org/wiki/Property:P1156 # ext_id_scopus_source_id
Indexing services
https://www.wikidata.org/wiki/Property:P8875
https://www.wikidata.org/wiki/Q371467 # Scopus
https://www.wikidata.org/wiki/Q104047209 # Science Citation Index Expanded
https://www.wikidata.org/wiki/Q22908122 # Emerging Sources Citation Index
https://www.wikidata.org/wiki/Q1090953 # Social Sciences Citation Index
https://www.wikidata.org/wiki/Q713927 # Arts and Humanities Citation index
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Profiles of politically exposed persons from Wikidata, the structured data version of Wikipedia.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title 'wikidata-20240701-all.json.bz2'
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikidata dump retrieved from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 on 27 Dec 2017
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.
Wikidata-14M is a recommender system dataset for recommending items to Wikidata editors. It consists of 220,000 editors responsible for 14 million interactions with 4 million items.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
derenrich/wikidata-en-descriptions-small dataset hosted on Hugging Face and contributed by the HF Datasets community
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains quality labels for 5000 Wikidata items applied by Wikidata editors. The labels correspond to the quality scale described at https://www.wikidata.org/wiki/Wikidata:Item_quality Each line is a JSON blob with the following fields: - item_quality: The labeled quality class (A-E)- rev_id: the revision identifier of the version of the item that was labeled- strata: The size of the item in bytes at the time it was sampled- page_len: The actual size of the item in bytes- page_title: The Qid of the item- claims: A dictionary including P31 "instance-of" values for filtering out certain types of itemsThe # of observations by class is: - A class: 322- B class: 438- C class: 1773- D class: 997- E class: 1470
RDF dump of wikidata produced with wdumps. View on wdumper entity count: 0, statement count: 0, triple count: 0
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
derenrich/wikidata-enwiki-categories-and-statements dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Wikidata Entities Connected to Wikipedia
This dataset is a multilingual, JSON-formatted version of the Wikidata dump from September 18, 2024. It only includes Wikidata entities that are connected to a Wikipedia page in any language. A total of 112,467,802 entities are included in the original data dump, of which 30,072,707 are linked to a Wikipedia page (26.73% of all entities have at least one Wikipedia sitelink).
Curated by: Jonathan Fraine & Philippe Saadé, Wikimedia Deutschland… See the full description on the dataset page: https://huggingface.co/datasets/philippesaade/wikidata.