Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
mapping between freebase and wikidata entities
This dataset maps freebase ids to wikidata ids and labels. It is useful for visualising and better understanding when working with datasets like fb15k-237 How it was created:
Download freebase-wikidata mapping from here. [compressed size: 21.2 MB] Download wikidata entities data from here. [compressed size: 81GB] Align labels with the freebase,wikidata id
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title 'wikidata-20240701-all.json.bz2'
Facebook
TwitterAttribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
This data dump of Wikidata is published to allow fair and replicable evaluation of KGQA systems with the QALD-10 benchmark. QALD-10 is newly released and was used in the QALD-10 Challenge. Anyone interested in evaluating their KGQA systems with QALD-10 can download this dump and set up a local Wikidata endpoint in their server.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-sourced projects supported by the Wikimedia Foundation. Wikipedia is nearly 20 years old and recently added its six millionth article in English. Wikidata, its younger machine-readable sister project, was created in 2012 but has been growing rapidly and currently contains more than 75 million items.
These projects contribute to the Wikimedia Foundation's mission of empowering people to develop and disseminate educational content under a free license. They are also heavily utilized by computer science research groups, especially those interested in natural language processing (NLP). The Wikimedia Foundation periodically releases snapshots of the raw data backing these projects, but these are in a variety of formats and were not designed for use in NLP research. In the Kensho R&D group, we spend a lot of time downloading, parsing, and experimenting with this raw data. The Kensho Derived Wikimedia Dataset (KDWD) is a condensed subset of the raw Wikimedia data in a form that we find helpful for NLP work. The KDWD has a CC BY-SA 3.0 license, so feel free to use it in your work too.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F972e4157b97efe8c2c5ea17c983b1504%2Fkdwd_header_logos_2.jpg?generation=1580510520532141&alt=media" alt="">
This particular release consists of two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. We version the KDWD using the raw Wikimedia snapshot dates. The version string for this dataset is kdwd_enwiki_20191201_wikidata_20191202 indicating that this KDWD was built from the English Wikipedia snapshot from 2019 December 1 and the Wikidata snapshot from 2019 December 2. Below we describe these components in more detail.
Dive right in by checking out some of our example notebooks:
page.csv (page metadata and Wikipedia-to-Wikidata mapping)link_annotated_text.jsonl (plaintext of Wikipedia pages with link offsets)item.csv (item labels and descriptions in English)item_aliases.csv (item aliases in English)property.csv (property labels and descriptions in English)property_aliases.csv (property aliases in English)statements.csv (truthy qpq statements)The KDWD is three connected layers of data. The base layer is a plain text English Wikipedia corpus, the middle layer annotates the corpus by indicating which text spans are links, and the top layer connects the link text spans to items in Wikidata. Below we'll describe these layers in more detail.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F19663d43bade0e92f578255f6e0d9dcd%2Fkensho_wiki_triple_layer.svg?generation=1580347573004185&alt=media" alt="">
The first part of the KDWD is derived from Wikipedia. In order to create a corpus of mostly natural text, we restrict our English Wikipedia page sample to those that:
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Wikidata Label Maps 2025-08-20
Label maps extracted from the 2025-08-20 Wikidata dump.Use these to resolve Q and P identifiers to English labels quickly.
Files
entity_map.parquet - columns: id, label, descriptionQ items. 77.4M rows. prop_map.parquet - columns: id, label, description, datatypeP items. 11,568 rows.
All files are Parquet with Zstandard compression.
Download Options
A) Hugging Face snapshot to a local folder
from huggingface_hub import… See the full description on the dataset page: https://huggingface.co/datasets/yashkumaratri/wikidata-label-maps-20250820.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
External References of English Wikipedia (ref-wiki-en) is a corpus of the plain-text content of 2,475,461 external webpages linked from the reference section of articles in English Wikipedia. Specifically:
32,329,989 external reference URLs were extracted from a 2018 HTML dump of English Wikipedia. Removing repeated and ill-formed URLs yielded 23,036,318 unique URLs.These URLs were filtered to remove file extensions for unsupported formats (videos, audio, etc.), yielding 17,781,974 downloadable URLs. The URLs were loaded into Apache Nutch and continuously downloaded from August 2019 to December 2019, resulting in 2,475,461 successfully downloaded URLs. Not all URLs could be accessed. The order in which URLs were accessed was determined by Nutch, which partitions URLs by host and then randomly chooses amongst the URLs for each host.The content of these webpages were indexed in Apache Solr by Nutch. From Solr we extracted a JSON dump of the content.Many URLs offer a redirect; unfortunately Nutch does not index redirect information. This means that connecting the Wikipedia article (with the pre-direct link) to the downloaded webpage (at the post-redirect link) was complicated. However, by inspecting the order of download in the Nutch log files, we managed to recover links for 2,058,896 documents (83%) from their original Wikipedia article(s).We further managed to associate 3,899,953 unique Wikidata items with at least one external reference webpage in the corpus.
The ref-en-wiki corpus is incomplete, i.e., we did not attempt to download all reference URLs for English Wikipedia. We thus also collect a smaller complete corpus for the external references of 5,000 Wikipedia articles (ref-wiki-en-5k). We sampled from 5 ranges of Wikidata items: Q1-10000, Q10001-100000, Q100001-1000000, Q1000001-10000000, and Q10000001-100000000. From each range we sampled 1000 items. We then scraped the external reference URLs for the Wikipedia article corresponding to these items and downloaded them. The resulting corpus contains 37,983 webpages.Each line of the corpus (ref-wiki-en, ref-wiki-en-5k) encodes the webpage of an external reference in JSON format. Specifically, we provide:
tstamp: When the webpage was accessedhost: The domain (FQDN post-redirect) from which the webpage was retrieved.title: The title (meta) of the documenturl: The URL (post-redirect) of the webpageQ: The Q-code identifiers of the Wikidata items whose corresponding Wikipedia article is confirmed to link to this webpage.content: A plain-text encoding of the content of the webpage.
Below we provide an abbreviated example of a line from the corpus:{""tstamp"":""2019-09-26T01:22:43.621Z"",""host"":""geology.isu.edu"",""title"":""Digital Geology of Idaho - Basin And Range"",""url"":""http://geology.isu.edu/Digital_Geology_Idaho/Module9/mod9.htm"",""Q"":[810178],""content"":""Digital Geology of Idaho - Basin And Range 1 - Idaho Basement Rock 2 - Belt Supergroup 3 - Rifting & Passive Margin 4 - Accreted Terranes 5 - Thrust Belt 6 - Idaho Batholith 7 - North Idaho & Mining 8 - Challis Volcanics 9 - Basin and Range 10 - Columbia River Basalts 11 - SRP & Yellowstone 12 - Pleistocene Glaciation 13 - Palouse & Lake Missoula 14 - Lake Bonneville Flood 15 - Snake River Plain Aquifer Basin and Range Province - Teritiary Extension General geology of the Basin and Range Province Mechanisms of Basin and Range faulting Idaho Basin and Range south of the Snake River Plain Idaho Basin and Range north of the Snake River Plain Local areas of active and recent Basin & Range faulting: Borah Peak PDF Slideshows: North of SRP , South of SRP , Borah Earthquake Flythroughs: Teton Valley , Henry's Fork , Big Lost River , Blackfoot , Portneuf , Raft River Valley , Bear River , Salmon Falls Creek , Snake River , Big Wood River Vocabulary Words thrust fault Basin and Range Snake River Plain half-graben transfer zone Fly-throughs General geology of the Basin and Range Province The Basin and Range Province generally includes most of eastern California, eastern Oregon, eastern Washington, Nevada, western Utah, southern and western Arizona, and southeastern Idaho. ...""},A summary of the files we make available:
ref-wiki-en.json.gz: 2,475,461 external reference webpages (JSON format)ref-wiki-en_urls.txt.gz: 23,036,318 unique raw links to external references (plain-text format)ref-wiki-en-5k.json.gz: 37,983 external reference webpages (JSON format)ref-wiki-en-5k_urls.json.gz: 70,375 unique raw links to external references (plain-text format)ref-wiki-en-5k_Q.txt.gz: 5,000 Wikidata Q identifiers forming the 5k dataset (plain-text format)
Further details can be found in the publication:
Suggesting References for Wikidata Claims based on Wikipedia's External References. Paolo Curotto, Aidan Hogan. Wikidata Workshop @ISWC 2020.
Further material relating to this publication (including code for a proof-of-concept interface) is also available.
Facebook
TwitterAttribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This is a collection of pre-processed wikidata jsons which were used in the creation of CSQA dataset (Ref: https://arxiv.org/abs/1801.10314).
Please refer to https://amritasaha1812.github.io/CSQA/download/ for more details.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
A dataset for the NIL-detection and NIL-disambiguation tasks.
The NILK dataset has two main features: 1) It marks NIL-mentions for NIL-detection by extracting mentions which belong to newly added entities in Wikipedia text. 2) It provides an entity label for NIL-disambiguation by marking NIL-mentions with WikiData IDs from the newer dump.
Dataset files contain JSON objects of the following structure:
{"mention":"Walter Damrosch", "offset":348, "length":15, "context":"...the conductor Walter Damrosch. He scored the piece for the standard instruments of the symphony orchestra plus celesta, saxophone, and automobile horns...", "wikipedia_page_id":"309", "wikidata_id":"Q725579", "nil":false}
The dataset contains both linked and not linked mentions, one can distinguish between them by checking "nil" flag. To obtain NIL-mentions, we compared two WikiData dumps: from 2017 and 2021. NIL-mentions have WikiData ID from WikiData 2021, one can use it to check whether these mentions refer to the same entity.
The dataset was designed with the WikiData 2017 as the target knowledge base in mind: https://archive.org/download/wikibase-wikidatawiki-20170213/wikidata-20170213-all.json.gz
nilk_03_2023.zip contains same data with longer contexts (unsplit)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relation extraction dataset with its knowledge graph.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.
The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).
We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.
Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.
The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.
Infoboxes - Compressed: 2GB - Uncompressed: 11GB
Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB
Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921
This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.
This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs
The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).
Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of German parliament debates covering 74 years of plenary protocols across all 16 state parliaments of Germany as well as the German Bundestag. The debates are separated into individual speeches which are enriched with meta data identifying the speaker as a member of the parliament (mp).
When using this data set, please cite the original paper "Lange, K.-R., Jentsch, C. (2023). SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments. Proceedings of the 3rd Workshop on Computational Linguistics for Political Text Analysis@KONVENS 2023.".
The meta data is separated into two different types: time-specific meta-data that contains only information for a legislative period but can change over time (e.g. the party or constituency of an mp) and meta-data that is considered fixed, such as the birth date or the name of a speaker. The former information are stored aong with the speeches as it is considered temporal information of that point in time, but are additionally stored in the file all_mps_mapping.csv if there is the need to double-check something. The rest of the meta-data are stored in the file all_mps_meta.csv. The meta-data from this file can be matched with a speech by comparing the speaker ID-variable "MPID". The speeches of each parliament are saved in a csv format. Along with the speeches, they contain the following meta-data:
The file all_mps_meta.csv contains the following meta information:
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
This information is adjusted to match the city government's official website revamp operation, replacing the existing "New Taipei City Government Event Information" on the platform, and adding attachment downloads, and does not include HTML syntax. For details, please refer to the latest news and instruction files.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Identify fastest-growing Digital Downloads keywords on WooCommerce. Analyze trending scores to identify the most relevant search terms and stay ahead of market trends for your store.
Facebook
Twitterhttps://www.caliper.com/license/maptitude-license-agreement.htmhttps://www.caliper.com/license/maptitude-license-agreement.htm
Address point data for use with GIS mapping software, databases, and web applications are from Caliper Corporation and contain a point layer of over 48 million addresses in 22 states and the District of Columbia.
Facebook
Twitterhttps://www.caliper.com/license/maptitude-license-agreement.htmhttps://www.caliper.com/license/maptitude-license-agreement.htm
Area layers of US, Australia, and Canada building footprints for use with GIS mapping software, databases, and web applications.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Identify fastest-growing Digital Downloads keywords on Magento. Analyze trending scores to identify the most relevant search terms and stay ahead of market trends for your store.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Discover top-performing keywords for Digital Downloads on TikTok Shop. Analyze monthly growth rate rankings to discover trending search terms and capitalize on emerging opportunities for your store.
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
We obtained 130 406 movies description and categories from wikipedia using 1) local wikidata dump to find movies names and 2) wikipediaapi library to download description and categories to each movie.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
5logos is a dataset for object detection tasks - it contains Objects annotations for 3,717 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterhttps://www.caliper.com/license/maptitude-license-agreement.htmhttps://www.caliper.com/license/maptitude-license-agreement.htm
FREE layers of banking compliance data for the United States are now available for users of the current version of Maptitude. Three separate geographic files and one table are included in this download..
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
mapping between freebase and wikidata entities
This dataset maps freebase ids to wikidata ids and labels. It is useful for visualising and better understanding when working with datasets like fb15k-237 How it was created:
Download freebase-wikidata mapping from here. [compressed size: 21.2 MB] Download wikidata entities data from here. [compressed size: 81GB] Align labels with the freebase,wikidata id