Facebook
Twitterhttps://bioregistry.io/spdx:CC0-1.0https://bioregistry.io/spdx:CC0-1.0
Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of certain types of data which can be used by Wikimedia projects such as Wikipedia. Wikidata functions as a document-oriented database, centred on individual items. Items represent topics, for which basic information is stored that identifies each topic.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Wikidata Extraction
This dataset contains all RDF triples extracted from the latest Wikidata, converted from the N-Triples format to Parquet. The data originates from Wikidata, a free and open knowledge base that acts as central storage for structured data used by Wikipedia and other Wikimedia projects. The source file is the "truthy" N-Triples dump (latest-truthy.nt.bz2), which contains only the current, non-deprecated statements. The code to extract this data is available at… See the full description on the dataset page: https://huggingface.co/datasets/piebro/wikidata-extraction.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides entity mappings between Freebase and Wikidata, enabling seamless integration between two large-scale knowledge graphs. It is based on the Wikidata data dump from October 28, 2013, and was originally published by Google under the CC0 (Public Domain) license.
The mappings are carefully filtered to ensure high reliability:
This strict filtering results in high-confidence entity alignments, making the dataset useful for research and real-world applications in knowledge graph systems.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dump from Wikidata from 2018-12-17 in JSON. This one is not avavailable anymore from Wikidata. It was downloaded originally from https://dumps.wikimedia.org/other/wikidata/20181217.json.gz and recompressed to fit on Zenodo.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Persons of interest profiles from Wikidata, the structured data version of Wikipedia.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
mapping between freebase and wikidata entities
This dataset maps freebase ids to wikidata ids and labels. It is useful for visualising and better understanding when working with datasets like fb15k-237 How it was created:
Download freebase-wikidata mapping from here. [compressed size: 21.2 MB] Download wikidata entities data from here. [compressed size: 81GB] Align labels with the freebase,wikidata id
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.
Facebook
TwitterWith this feature the user is able to extend CSV datasets with existing information in the Wikidata KG. The tool applies entity linking to all concepts in the same column and enable the user to use the extracted entities to extend the dataset.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Wikidata Entity Embeddings 0.2
Dataset Summary
Wikidata Entity Embeddings is a dataset of embedding vectors for Wikidata entities. Each vector represents a Wikidata item (Q...) or property (P...) based on textual information extracted from Wikidata. The dataset is part of the Wikidata Embedding Project, an initiative led by Wikimedia Deutschland in collaboration with Jina AI and IBM DataStax. The project provides a publicly accessible Wikidata Vector Database to enable… See the full description on the dataset page: https://huggingface.co/datasets/philippesaade/Wikidata_Vectors_0.2.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Category-based imports from Wikidata, the structured data version of Wikipedia.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a collection of Gender Indicators from Wikidata and Wikipedia of Human Biographies. Data is derived from the 2016-01-03 Wikidata snapshot.Each file describe the humans in Wikidata aggregated by Gender (Property:P21), and dissaggregated by the following Wikidata Properties: - Date of Birth (P569)- Date of Death (P570)- Place of Birth (P19)- Country of Citizenship (P27)- Ethnic Group (P172)- Field of Work (P101)- Occupation (P106)- Wikipedia Language ("Sitelinks") Further aggregations of the data are: - World Map (Countries derived from place of birth and citizenship)- World Cultures (Inglehart Welzel Map applied to World Map)- Gender Co-Occurence (Humans with multiple genders).Wikidata labels have be translated to English for convenience when possible. You may still see values with "QIDs" which means there was no English translation possible. In the case where there were multiple values, such as for occupation, the we count the gender as co-occuring with each occupation separately.For more information. http://wigi.wmflabs.org/
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikidata dump retrieved from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 on 27 Dec 2017
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
WikiSnap25 is a collection of analysis‑ready tables that integrate content, metadata, and readership information for English Wikipedia articles and Wikidata entities as of mid‑2025. The goal is to lower the engineering barrier for empirical Wikimedia research by abstracting away the work of parsing XML/JSON dumps and aggregating pageview data into compact, well‑documented Parquet datasets.
WikiSnap25 combines several official Wikimedia sources: the June 1, 2025 English Wikipedia pages‑articles‑multistream XML dump for article text and metadata, the June 2, 2025 full Wikidata JSON dump for entity‑level information, a decade of Wikimedia REST API monthly pageview data from July 2015 through June 2025, and additional dumps (e.g., stub meta‑history and redirect mappings) for revision context and canonicalization. A modular processing pipeline (XML parsing, wikitext processing, mwparserfromhell, and aggregation scripts) produces a consistent article‑level snapshot anchored at mid‑2025, enriched with long‑term readership and knowledge‑graph features.
The following datasets are included:
WikiSnap_2025_WP_Articles.parquet – Article‑level data for 7,011,415 English Wikipedia articles, including normalized titles, total pageviews (2015–2025), first two sentences of article text, outlink metrics (e.g., num_articles_linked_out), and mappings to Wikidata QIDs. This table supports studies of article popularity, connectivity, and entity types.
WikiSnap_2025_WP_Edges.parquet – 314,337,621 intra‑Wikipedia hyperlinks between articles, with from/to article IDs and titles, endpoint pageviews, and a duplicate‑link flag for graph cleaning. This edge list is suitable for large‑scale network analyses of the article link graph.
WikiSnap_2025_WP_Network_Metrics.parquet – Network metrics for articles with intra-links to other Wikipedia articles (7,008,375 rows), including centrality scores, PageRank, traffic‑weighted PageRank, HITS hub/authority, degrees, and Leiden community assignments computed from a deduplicated article link graph.
WikiSnap_2025_WP_Article_Metrics.parquet – Additional article‑level metrics (7,011,415 rows) such as creation timestamp, edit lifespan in days, total revisions, number of unique editors, and a Gini‑style editor inequality index across redirect clusters.
WikiSnap_2025_WD_Entities.parquet – Wikidata entity information for 116,183,072 items, including labels, instance‑of labels, temporal attributes (begin/end year), country of citizenship labels, and sitelinks to English Wikipedia article titles, with disambiguation pages filtered and truthy claims prioritized.
WikiSnap_2025_WD_Metrics.parquet – Wikidata metrics for the same 116M entities, including counts of claims, in/out links, references, qualifiers, notable properties, article‑quality badges, and external identifiers, enabling notability and completeness assessments.
WikiSnap_2025_WD_Humans.parquet – A subset for human entities (P31=Q5) with 12,371,744 rows, containing birth and death years, occupation labels, gender, citizenship, and notability metrics, tailored to demographic and biographical analysis.
These datasets are intended to support a wide range of tasks, including correlating network centrality with pageview patterns, assessing Wikidata notability via property richness and sitelinks, and exploring editorial activity and inequality, without requiring researchers to build their own large‑scale Wikimedia processing pipelines from scratch.
Limitations. The current release focuses on English Wikipedia main‑namespace articles as of June 1, 2025; Wikidata content as of early June 2025; and article‑level pageviews aggregated over July 2015–June 2025 without finer temporal granularity. Redirects, stubs, and full revision histories are not included in these tables.
All datasets and the full processing code are released to facilitate reproducible Wikimedia research and to enable cross‑study comparability on a shared, 2025‑anchored snapshot.
License and citation. All WikiSnap25 datasets and associated processing code archived in this record are dedicated to the public domain under the Creative Commons Zero v1.0 Universal (CC0 1.0) public domain dedication. Users are free to reuse, modify, and redistribute the materials without restriction.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset, "Global Companies from Wikidata", is a curated collection of 3,579 global companies with information sourced from Wikidata.
Key Features of the Dataset:
Dataset Columns:
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
introspector/wikidata dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is an export of the list of Wikidata elements with a dataset identifier data.gouv.fr (Property P6526).
See also
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comprises over 260,000 scientific references from Wikidata 2023, showcasing the academic impact of each referenced paper. It includes data extracted from the OpenAlex platform, such as DOI, publication year, citation count, domain categorization, and journal metrics like the H-index.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Profiles of politically exposed persons from Wikidata, the structured data version of Wikipedia.
Facebook
TwitterWikiWebQuestions: a highquality question answering benchmark for Wikidata.
./training_data/best.json
For more detail see https://github.com/stanford-oval/wikidata-emnlp23 and https://github.com/stanford-oval/wikidata-emnlp23
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the so-called "truthy" dump of Wikidata from on or about May 21, 2022, shared for usage in SemTab 2022 Challenge.
Downloaded from https://www.wikidata.org/wiki/Wikidata:Database_download
See the License section of the above page for license information.
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Facebook
Twitterhttps://bioregistry.io/spdx:CC0-1.0https://bioregistry.io/spdx:CC0-1.0
Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of certain types of data which can be used by Wikimedia projects such as Wikipedia. Wikidata functions as a document-oriented database, centred on individual items. Items represent topics, for which basic information is stored that identifies each topic.