100+ datasets found
  1. Notable People Dataset (Wikidata-based)

    • kaggle.com
    zip
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekaterina Solovyeva (2025). Notable People Dataset (Wikidata-based) [Dataset]. https://www.kaggle.com/datasets/qqsolov/notable-people-dataset-wikidata-based
    Explore at:
    zip(32237057 bytes)Available download formats
    Dataset updated
    Jun 5, 2025
    Authors
    Ekaterina Solovyeva
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Overview

    This dataset contains 417,937 biographical records of notable individuals, extracted from Wikidata using SPARQL queries via the Wikidata Query Service.

    Key Selection Criteria:

    • Timeframe: Individuals born in the 20th or 21st century (1901–present).
    • Country of Birth: Entries must include the country_of_birth.
    • Photo Availability: Each entry includes an associated image (image_url is mandatory).
    • Profession Filter: Focused on individuals with occupations categorized in occupation_groups.csv (Science & Academia, Arts & Culture, Public Figures, Sports, Business).

    Column Descriptions

    ColumnDescriptionNotes
    wikidata_urlUnique Wikidata URL identifier for the entryMandatory
    labelPrimary name/label of the person (usually in English)Mandatory
    name_in_native_languagesName(s) in the person’s native language(s);-separated values
    pseudonymsAlternative names or aliases used by the person;-separated values
    sex_or_genderGender informationMandatory
    date_of_birthBirth dateMandatory
    place_of_birthCity or region of birth
    country_of_birthCountry of birthMandatory
    date_of_deathDeath date (if applicable)
    place_of_deathCity or region of death (if applicable)
    country_of_deathCountry of death (if applicable)
    citizenshipsNationalities or citizenships held;-separated values
    occupationsSpecific occupations or rolesMandatory, ;-separated
    occupation_groupsBroad occupational categoriesMandatory, ;-separated
    awardsAwards, honors, or recognitions received;-separated values
    signature_urlURL to an image of the person’s signature
    image_urlURL to the person's image/portraitMandatory
    date_of_imageDate when the image was created (if available)

    Notes

    The data may contain some number of inaccuracies, due to inconsistencies or errors in the original Wikidata entries. This can sometimes be seen in date fields, especially date_of_image.

    Source and Licensing Notes

    • All data in this dataset was derived from Wikidata. Wikidata content is available under the CC0 1.0 license.
    • Images linked in image_url and signature_url are hosted on Wikimedia Commons and may have individual licenses (e.g., CC BY-SA, Public Domain). Please check the license terms on the source page before using.
  2. Wikidata dump 2017-12-27

    • zenodo.org
    bz2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WikiData; WikiData (2020). Wikidata dump 2017-12-27 [Dataset]. http://doi.org/10.5281/zenodo.1211767
    Explore at:
    bz2Available download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    WikiData; WikiData
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
  3. wikidata

    • kaggle.com
    zip
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ABEL BIHINDA (2025). wikidata [Dataset]. https://www.kaggle.com/datasets/abelbihinda/wikidata
    Explore at:
    zip(5327 bytes)Available download formats
    Dataset updated
    Apr 16, 2025
    Authors
    ABEL BIHINDA
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by ABEL BIHINDA

    Released under Apache 2.0

    Contents

  4. Wikidata Reference

    • figshare.com
    application/gzip
    Updated Mar 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Hertling; Nandana Mihindukulasooriya (2025). Wikidata Reference [Dataset]. http://doi.org/10.6084/m9.figshare.28602170.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Sven Hertling; Nandana Mihindukulasooriya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.

  5. Wikidata Entities of Interest

    • opensanctions.org
    csv
    Updated Jan 9, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2026). Wikidata Entities of Interest [Dataset]. https://www.opensanctions.org/datasets/wd_curated/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 9, 2026
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Persons of interest profiles from Wikidata, the structured data version of Wikipedia.

  6. h

    freebase-wikidata-mapping

    • huggingface.co
    Updated Mar 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Knowledge Discovery & Management Lab, DA-IICT (2024). freebase-wikidata-mapping [Dataset]. https://huggingface.co/datasets/kdm-daiict/freebase-wikidata-mapping
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2024
    Dataset authored and provided by
    Knowledge Discovery & Management Lab, DA-IICT
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    mapping between freebase and wikidata entities

    This dataset maps freebase ids to wikidata ids and labels. It is useful for visualising and better understanding when working with datasets like fb15k-237 How it was created:

    Download freebase-wikidata mapping from here. [compressed size: 21.2 MB] Download wikidata entities data from here. [compressed size: 81GB] Align labels with the freebase,wikidata id

  7. t

    Wikidata Explorer Feature - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wikidata Explorer Feature - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/wikidata-explorer-feature
    Explore at:
    Dataset updated
    Jul 16, 2024
    Description

    With this feature the user is able to extend CSV datasets with existing information in the Wikidata KG. The tool applies entity linking to all concepts in the same column and enable the user to use the extracted entities to extend the dataset.

  8. t

    Wikidata dataset - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wikidata dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/wikidata-dataset
    Explore at:
    Dataset updated
    Nov 25, 2024
    Description

    The Wikidata dataset created by linking the Wikipedia English Corpus to Wikidata. It includes sentences with multiple relations and has 353 unique relations, comprising 372,059 sentences in training and 360,334 for testing.

  9. Wikidata Persons in Relevant Categories

    • opensanctions.org
    Updated Jan 23, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2026). Wikidata Persons in Relevant Categories [Dataset]. https://www.opensanctions.org/datasets/wd_categories/
    Explore at:
    Dataset updated
    Jan 23, 2026
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Category-based imports from Wikidata, the structured data version of Wikipedia.

  10. Wikidata item quality labels

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Glorian Yapinus; Amir Sarabadani; Aaron Halfaker (2023). Wikidata item quality labels [Dataset]. http://doi.org/10.6084/m9.figshare.5035796.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Glorian Yapinus; Amir Sarabadani; Aaron Halfaker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains quality labels for 5000 Wikidata items applied by Wikidata editors. The labels correspond to the quality scale described at https://www.wikidata.org/wiki/Wikidata:Item_quality Each line is a JSON blob with the following fields: - item_quality: The labeled quality class (A-E)- rev_id: the revision identifier of the version of the item that was labeled- strata: The size of the item in bytes at the time it was sampled- page_len: The actual size of the item in bytes- page_title: The Qid of the item- claims: A dictionary including P31 "instance-of" values for filtering out certain types of itemsThe # of observations by class is: - A class: 322- B class: 438- C class: 1773- D class: 997- E class: 1470

  11. h

    wikidata-truthy

    • huggingface.co
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CleverThis (2025). wikidata-truthy [Dataset]. https://huggingface.co/datasets/CleverThis/wikidata-truthy
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset authored and provided by
    CleverThis
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Wikidata Truthy

      Dataset Description
    

    Core facts from Wikidata (preferred statements only) Original Source: https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2

      Dataset Summary
    

    This dataset contains RDF triples from Wikidata Truthy converted to HuggingFace dataset format for easy use in machine learning pipelines.

    Format: Originally ntriples, converted to HuggingFace Dataset Size: 100.0 GB (extracted) Entities: ~100M Triples: ~2B Original… See the full description on the dataset page: https://huggingface.co/datasets/CleverThis/wikidata-truthy.

  12. Wikidata dump from 2018-12-17 in JSON

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip
    Updated Jan 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakub Klímek; Jakub Klímek; Petr Škoda; Petr Škoda (2021). Wikidata dump from 2018-12-17 in JSON [Dataset]. http://doi.org/10.5281/zenodo.4436356
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jakub Klímek; Jakub Klímek; Petr Škoda; Petr Škoda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dump from Wikidata from 2018-12-17 in JSON. This one is not avavailable anymore from Wikidata. It was downloaded originally from https://dumps.wikimedia.org/other/wikidata/20181217.json.gz and recompressed to fit on Zenodo.

  13. t

    Wikidata - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wikidata - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/wikidata
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    The dataset used in the paper is Wikidata, which contains a large number of entities and their corresponding semantic types.

  14. a

    Wikidata PageRank

    • danker.s3.amazonaws.com
    • danker.s3-website.eu-central-1.amazonaws.com
    Updated Jan 6, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Thalhammer (2026). Wikidata PageRank [Dataset]. https://danker.s3.amazonaws.com/index.html
    Explore at:
    tsv, application/n-triples, application/vnd.hdt, ttlAvailable download formats
    Dataset updated
    Jan 6, 2026
    Authors
    Andreas Thalhammer
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Regularly published dataset of PageRank scores for Wikidata entities. The underlying link graph is formed by a union of all links accross all Wikipedia language editions. Computation is performed Andreas Thalhammer with 'danker' available at https://github.com/athalhammer/danker . If you find the downloads here useful please feel free to leave a GitHub ⭐ at the repository and buy me a ☕ https://www.buymeacoffee.com/thalhamm

  15. h

    Wikidata Companies Graph

    • data.hellenicdataservice.gr
    • data.europa.eu
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Wikidata Companies Graph [Dataset]. https://data.hellenicdataservice.gr/dataset/f7341a62-a513-4931-99b8-0e302dc46d66
    Explore at:
    Dataset updated
    Jun 20, 2019
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).

  16. Wikidata Politically Exposed Persons

    • opensanctions.org
    Updated Jan 8, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2026). Wikidata Politically Exposed Persons [Dataset]. https://www.opensanctions.org/datasets/wd_peps/
    Explore at:
    application/json+ftmAvailable download formats
    Dataset updated
    Jan 8, 2026
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Profiles of politically exposed persons from Wikidata, the structured data version of Wikipedia.

  17. E

    Wikidata

    • live.european-language-grid.eu
    json
    Updated Oct 28, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). Wikidata [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/7268
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 28, 2012
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

  18. Drafttopic + Wikidata Exploration

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Johnson (2023). Drafttopic + Wikidata Exploration [Dataset]. http://doi.org/10.6084/m9.figshare.9642617.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Isaac Johnson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset takes the English Wikipedia articles labeled with mid-level WikiProject categories [1] that was used to train the initial drafttopic [2] model and augments it with Wikidata claims.

  19. WikiData - Datasets - OpenData.eol.org

    • opendata.eol.org
    Updated Mar 22, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    eol.org (2017). WikiData - Datasets - OpenData.eol.org [Dataset]. https://opendata.eol.org/dataset/wikidata
    Explore at:
    Dataset updated
    Mar 22, 2017
    Dataset provided by
    Encyclopedia of Lifehttp://eol.org/
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    For questions or use cases calling for large, multi-use aggregate data files, please visit the EOL Services forum at http://discuss.eol.org/c/eol-services read more

  20. h

    wikidata-extraction

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piet, wikidata-extraction [Dataset]. https://huggingface.co/datasets/piebro/wikidata-extraction
    Explore at:
    Authors
    Piet
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Wikidata Extraction

    This dataset contains all RDF triples extracted from the latest Wikidata, converted from the N-Triples format to Parquet. The data originates from Wikidata, a free and open knowledge base that acts as central storage for structured data used by Wikipedia and other Wikimedia projects. The source file is the "truthy" N-Triples dump (latest-truthy.nt.bz2), which contains only the current, non-deprecated statements. The code to extract this data is available at… See the full description on the dataset page: https://huggingface.co/datasets/piebro/wikidata-extraction.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ekaterina Solovyeva (2025). Notable People Dataset (Wikidata-based) [Dataset]. https://www.kaggle.com/datasets/qqsolov/notable-people-dataset-wikidata-based
Organization logo

Notable People Dataset (Wikidata-based)

417K+ biographical records from Wikidata, born in 20th–21st centuries

Explore at:
zip(32237057 bytes)Available download formats
Dataset updated
Jun 5, 2025
Authors
Ekaterina Solovyeva
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset Overview

This dataset contains 417,937 biographical records of notable individuals, extracted from Wikidata using SPARQL queries via the Wikidata Query Service.

Key Selection Criteria:

  • Timeframe: Individuals born in the 20th or 21st century (1901–present).
  • Country of Birth: Entries must include the country_of_birth.
  • Photo Availability: Each entry includes an associated image (image_url is mandatory).
  • Profession Filter: Focused on individuals with occupations categorized in occupation_groups.csv (Science & Academia, Arts & Culture, Public Figures, Sports, Business).

Column Descriptions

ColumnDescriptionNotes
wikidata_urlUnique Wikidata URL identifier for the entryMandatory
labelPrimary name/label of the person (usually in English)Mandatory
name_in_native_languagesName(s) in the person’s native language(s);-separated values
pseudonymsAlternative names or aliases used by the person;-separated values
sex_or_genderGender informationMandatory
date_of_birthBirth dateMandatory
place_of_birthCity or region of birth
country_of_birthCountry of birthMandatory
date_of_deathDeath date (if applicable)
place_of_deathCity or region of death (if applicable)
country_of_deathCountry of death (if applicable)
citizenshipsNationalities or citizenships held;-separated values
occupationsSpecific occupations or rolesMandatory, ;-separated
occupation_groupsBroad occupational categoriesMandatory, ;-separated
awardsAwards, honors, or recognitions received;-separated values
signature_urlURL to an image of the person’s signature
image_urlURL to the person's image/portraitMandatory
date_of_imageDate when the image was created (if available)

Notes

The data may contain some number of inaccuracies, due to inconsistencies or errors in the original Wikidata entries. This can sometimes be seen in date fields, especially date_of_image.

Source and Licensing Notes

  • All data in this dataset was derived from Wikidata. Wikidata content is available under the CC0 1.0 license.
  • Images linked in image_url and signature_url are hosted on Wikimedia Commons and may have individual licenses (e.g., CC BY-SA, Public Domain). Please check the license terms on the source page before using.
Search
Clear search
Close search
Google apps
Main menu