Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 417,937 biographical records of notable individuals, extracted from Wikidata using SPARQL queries via the Wikidata Query Service.
country_of_birth. image_url is mandatory). occupation_groups.csv (Science & Academia, Arts & Culture, Public Figures, Sports, Business). | Column | Description | Notes |
|---|---|---|
wikidata_url | Unique Wikidata URL identifier for the entry | Mandatory |
label | Primary name/label of the person (usually in English) | Mandatory |
name_in_native_languages | Name(s) in the person’s native language(s) | ;-separated values |
pseudonyms | Alternative names or aliases used by the person | ;-separated values |
sex_or_gender | Gender information | Mandatory |
date_of_birth | Birth date | Mandatory |
place_of_birth | City or region of birth | |
country_of_birth | Country of birth | Mandatory |
date_of_death | Death date (if applicable) | |
place_of_death | City or region of death (if applicable) | |
country_of_death | Country of death (if applicable) | |
citizenships | Nationalities or citizenships held | ;-separated values |
occupations | Specific occupations or roles | Mandatory, ;-separated |
occupation_groups | Broad occupational categories | Mandatory, ;-separated |
awards | Awards, honors, or recognitions received | ;-separated values |
signature_url | URL to an image of the person’s signature | |
image_url | URL to the person's image/portrait | Mandatory |
date_of_image | Date when the image was created (if available) |
The data may contain some number of inaccuracies, due to inconsistencies or errors in the original Wikidata entries. This can sometimes be seen in date fields, especially date_of_image.
image_url and signature_url are hosted on Wikimedia Commons and may have individual licenses (e.g., CC BY-SA, Public Domain). Please check the license terms on the source page before using.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Wikidata parallel descriptions en-ja
Parallel corpus for machine translation generated from wikidata dump (2024-05-06). Currently we processed only English/Japanese pair. The jsonl file is ready-to-train by Hugging Face transformers trainer for translation tasks.
Dataset Details
https://www.wikidata.org/wiki/Wikidata:Database_download
Dataset Creation
As Wikidata description field does not represent exact direct translation, filtering is required for… See the full description on the dataset page: https://huggingface.co/datasets/Mitsua/wikidata-parallel-descriptions-en-ja.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by ABEL BIHINDA
Released under Apache 2.0
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
derenrich/wikidata-en-descriptions-small dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Regularly published dataset of PageRank scores for Wikidata entities. The underlying link graph is formed by a union of all links accross all Wikipedia language editions. Computation is performed Andreas Thalhammer with 'danker' available at https://github.com/athalhammer/danker . If you find the downloads here useful please feel free to leave a GitHub ⭐ at the repository and buy me a ☕ https://www.buymeacoffee.com/thalhamm
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikidata dump retrieved from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 on 27 Dec 2017
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Profiles of politically exposed persons from Wikidata, the structured data version of Wikipedia.
Facebook
TwitterWith this feature the user is able to extend CSV datasets with existing information in the Wikidata KG. The tool applies entity linking to all concepts in the same column and enable the user to use the extracted entities to extend the dataset.
Facebook
TwitterThe Wikidata dataset created by linking the Wikipedia English Corpus to Wikidata. It includes sentences with multiple relations and has 353 unique relations, comprising 372,059 sentences in training and 360,334 for testing.
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title 'wikidata-20240701-all.json.bz2'
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a collection of Gender Indicators from Wikidata and Wikipedia of Human Biographies. Data is derived from the 2016-01-03 Wikidata snapshot.Each file describe the humans in Wikidata aggregated by Gender (Property:P21), and dissaggregated by the following Wikidata Properties: - Date of Birth (P569)- Date of Death (P570)- Place of Birth (P19)- Country of Citizenship (P27)- Ethnic Group (P172)- Field of Work (P101)- Occupation (P106)- Wikipedia Language ("Sitelinks") Further aggregations of the data are: - World Map (Countries derived from place of birth and citizenship)- World Cultures (Inglehart Welzel Map applied to World Map)- Gender Co-Occurence (Humans with multiple genders).Wikidata labels have be translated to English for convenience when possible. You may still see values with "QIDs" which means there was no English translation possible. In the case where there were multiple values, such as for occupation, the we count the gender as co-occuring with each occupation separately.For more information. http://wigi.wmflabs.org/
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
derenrich/wikidata-enwiki-categories-and-statements dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Persons of interest profiles from Wikidata, the structured data version of Wikipedia.
Facebook
TwitterThe dataset used in the paper is Wikidata, which contains a large number of entities and their corresponding semantic types.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains quality labels for 5000 Wikidata items applied by Wikidata editors. The labels correspond to the quality scale described at https://www.wikidata.org/wiki/Wikidata:Item_quality Each line is a JSON blob with the following fields: - item_quality: The labeled quality class (A-E)- rev_id: the revision identifier of the version of the item that was labeled- strata: The size of the item in bytes at the time it was sampled- page_len: The actual size of the item in bytes- page_title: The Qid of the item- claims: A dictionary including P31 "instance-of" values for filtering out certain types of itemsThe # of observations by class is: - A class: 322- B class: 438- C class: 1773- D class: 997- E class: 1470
Facebook
TwitterRDF dump of wikidata produced with wdumps. entities with affiliation roman catholic View on wdumper entity count: 0, statement count: 0, triple count: 0
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
RDF dump of wikidata produced with wdumps.
<p>
<br>
<a href="https://tools.wmflabs.org/wdumps/dump/1752">View on wdumper</a>
</p>
<p>
<b>entity count</b>: 0, <b>statement count</b>: 0, <b>triple count</b>: 0
</p>
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 417,937 biographical records of notable individuals, extracted from Wikidata using SPARQL queries via the Wikidata Query Service.
country_of_birth. image_url is mandatory). occupation_groups.csv (Science & Academia, Arts & Culture, Public Figures, Sports, Business). | Column | Description | Notes |
|---|---|---|
wikidata_url | Unique Wikidata URL identifier for the entry | Mandatory |
label | Primary name/label of the person (usually in English) | Mandatory |
name_in_native_languages | Name(s) in the person’s native language(s) | ;-separated values |
pseudonyms | Alternative names or aliases used by the person | ;-separated values |
sex_or_gender | Gender information | Mandatory |
date_of_birth | Birth date | Mandatory |
place_of_birth | City or region of birth | |
country_of_birth | Country of birth | Mandatory |
date_of_death | Death date (if applicable) | |
place_of_death | City or region of death (if applicable) | |
country_of_death | Country of death (if applicable) | |
citizenships | Nationalities or citizenships held | ;-separated values |
occupations | Specific occupations or roles | Mandatory, ;-separated |
occupation_groups | Broad occupational categories | Mandatory, ;-separated |
awards | Awards, honors, or recognitions received | ;-separated values |
signature_url | URL to an image of the person’s signature | |
image_url | URL to the person's image/portrait | Mandatory |
date_of_image | Date when the image was created (if available) |
The data may contain some number of inaccuracies, due to inconsistencies or errors in the original Wikidata entries. This can sometimes be seen in date fields, especially date_of_image.
image_url and signature_url are hosted on Wikimedia Commons and may have individual licenses (e.g., CC BY-SA, Public Domain). Please check the license terms on the source page before using.