Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains triples curated from Wikidata surrounding news events with causal relations, and is released as part of our WWW'23 paper, "Event Prediction using Case-Based Reasoning over Knowledge Graphs".
Starting from a set of classes that we consider to be types of "events", we queried Wikidata to collect entities that were an instanceOf an event class and that were connected to another such event entity by a causal triple (https://www.wikidata.org/wiki/Wikidata:List_of_properties/causality). For all such cause-effect event pairs, we then collected a 3-hop neighborhood of outgoing triples.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
UMLS_Wikidata is a German biomedical entity linking knowledge base that provides good coverage for German entity linking datasets such as WikiMed-DE-BEL. The knowledge base is created by filtering out the Wikidata items that contain the Concept Unique Itentifier (CUI) of UMLS. Each entry in the knowledge base consists of Wikidata QID, label, description, UMLS CUI and aliases. The resulting KB has 731,414 Wikidata QIDs, 599,330 unique CUIs and 671,797 unique (mention, CUI) pairs where mention includes label and aliases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source file: GeneTaxon_wikidata-20190121-all.ttl.gz
More information: https://www.semantic-web-journal.net/content/wikidata-subsetting-approaches-tools-and-evaluation
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
derenrich/wikidata-en-descriptions dataset hosted on Hugging Face and contributed by the HF Datasets community
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
The work consists of tools for the interaction between Wikidata and OBO Foundry and source codes for the use of MeSH keywords of PubMed publications for the enrichment of biomedical knowledge in Wikidata. This work is funded by the Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning Project within the framework of the Wikimedia Foundation Research Fund.To cite the work: Turki, H., Chebil, K., Dossou, B. F. P., Emezue, C. C., Owodunni, A. T., Hadj Taieb, M. A., & Ben Aouicha, M. (2024). A framework for integrating biomedical knowledge in Wikidata with open biomedical ontologies and MeSH keywords. Heliyon, 10(19), e38488. doi:10.1016/j.heliyon.2024.e38448.Wikidata-OBOtool1.py: A tool for the verification of the semantic alignment between Wikidata and OBO ontologies.frame.py: The layout of Tool 1.tool2.py: A tool for extracting Wikidata relations between OBO ontology items.frame2.py: The layout of Tool 2.tool3.py: A tool for extracting multilingual language data for OBO ontology items from Wikidata.frame4.py: The layout of Tool 3.Wikidata-MeSHcorrect_mesh2matrix_dataset.py: A source code for turning MeSH2Matrix into a smaller dataset for the biomedical relation classification based on the MeSH keywords of PubMed publications, named MiniMeSH2Matrix.build_numpy_dataset.py: A source code for building the numpy files for MiniMeSH2Matrix (Relation type-based classification).label_encoded.csv: A table for the conversion of Wikidata Property IDs into MeSH2Matrix Class IDs.new_encoding.csv: A table for the conversion of Wikidata Property IDs into MiniMeSH2Matrix Class IDs.super_classes_new_dataset_labels.npy: The NumPy File of the labels for the superclass-based classification.new_dataset_labels.npy: The NumPy File of the labels for the relation type-based classification.new_dataset_matrices.npy: The Numpy File of the MiniMeSH2Matrix matrices for biomedical relation classification.first_level_new_data.json: The JSON File for the conversion of relation types to superclasses.build_super_classes.py: A source code for building the numpy files for MiniMeSH2Matrix (Superclass-based classification).FC_MeSH_Model_57_New_Data.ipynb: A Jupyter Notebook for training a Dense Model to perform the relation type-based classification.FC_MeSH_Model_57_New_Data_SuperClasses.ipynb: A Jupyter Notebook for training a Dense Model to perform the superclass-based classification.new_data_best_model_1: A stored edition of the best model for the relation type-based classification.new_data_super_classes_best_model_1: A stored edition of the best model for the superclass-based classification.MiniMeSH2Matrix_SuperClasses_Confusion_Matrix.ipynb: A Jupyter Notebook for generating the confusion matrix for the superclass-based supervised classification.MiniMeSH2Matrix_Supervised_Classification_Agreement.ipynb: A Jupyter Notebook for generating the matrix of agreement between the accurate predictions for superclass-based classification and the ones for relation type-based classification.Adding_References_to_Wikidata.ipynb: A Jupyter Notebook to identify the PubMed ID of relevant references to unsupported Wikidata statements between MeSH terms.MeSH_Statistics.xlsx: Statistical data about MeSH-based items and relations in Wikidata.ref_for_unsupported_statements.csv: Retrieved Relevant PubMed References for 1k unsupported Wikidata statements.evaluate_pubmed_ref_assignment.ipynb: A Jupyter Notebook that generates statistics about reference assignment for a sample of 1k unsupported statements.MeSH_Verification.xlsx: A list of inaccurate or duplicated MeSH IDs in Wikidata, as of August 8th, 2023.WikiRelationsPMI.csv: A list of PMI values for the semantic relations between MeSH terms, as available in Wikidata.WikiRelationsPMIDistribution.xlsx: Distribution of PMI values for all Wikidata relations and for specific Wikidata relation types.WikiRelationsToVerify.xlsx: Wikidata relations needing attention because they involve Wikidata items with inaccurate MeSH IDs, they cannot be found in PubMed, or their PMI values are below the threshold of 2.Mesh_part1.py: A Python code that verifies the accuracy of the MeSH IDs for the Wikidata items.MeshWikiPart.py: A Python code that computes the pointwise mutual information values for Wikidata relations between MeSH keywords based on PubMed.Demo.ipynb: A demo of the MeSH-based biomedical relation validation and classification in French.Id_Term.json: A dict of Medical Subject Headings labels corresponding to MeSH Descriptor ID.dict_mesh.json: Number of the occurrences of MeSH keywords in PubMed.finalmatrix.xlsx: Matrix of PMI values between the 5k most common MeSH Keywords.finalmatrixrev.pkl: Pickle File Edition of the PMI matrix.pmi2.xlsx: List of significant PMI associations between the 5k most common MeSH Keywords reaching a threshold of 2.Generate5kMatrix.py: A Python code that generates the PMI matrix.clean_pmi2.py: A Python code to remove the relations already available in Wikidata from pmi.xlsx.missing_rels.xlsx: The final list of the significant PMI associations that do not exist in Wikidata.item_category.json: A dict for MeSH tree categories corresponding to MeSH items.item_categorization.py: A Python code that generates a dict for MeSH tree categories corresponding to MeSH items.classification.py: A Python code for classifying PMI-generated semantic relations between the most common MeSH Keywords.results.xlsx: The output of the classification of the PMI-generated semantic relations between the most common MeSH Keywords.ClassificationStats.ipynb: A Jupyter Notebook for generating statistical data about the classification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the so-called "truthy" dump of Wikidata from on or about May 21, 2022, shared for usage in SemTab 2022 Challenge.
Downloaded from https://www.wikidata.org/wiki/Wikidata:Database_download
See the License section of the above page for license information.
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Ce jeu de données est un export de la liste des éléments de Wikidata avec un identifiant de jeu de données data.gouv.fr (Propriété P6526). Voir aussi Organisations de data.gouv.fr reliées à Wikidata
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Listes de personnalités issues de Wikidata’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/57cd85dec751df33bf97bae5 on 16 January 2022.
--- Dataset description provided by original source is as follows ---
Fichiers CSV contenant des listes de personnalités telles qu'existant en 2016, avec les liens vers Wikipédia et Wikidata.
Ces données sont issues du crowdsourcing effectué par les contributeurs au projet Wikidata et sont sous Creative Commons 0 décrite à l'adresse https://creativecommons.org/publicdomain/zero/1.0/
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
RDF dump of wikidata produced with wdumper.
entity count: 1031857, statement count: 9458674, triple count: 24554315
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A simple tab-separated value (TSV) file, gzipped, listing Wikidata identifiers (Q numbers) and their associated DOI. There are over 25 million Wikidata items with linked DOIs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Wikidata Graph Pattern Benchmark (WGPB) is a benchmark consisting of 50 instances of 17 different abstract query patterns giving a total of 850 SPARQL queries. The goal of the benchmark is to test the performance of query engines for more complex basic graph patterns. The benchmark was designed for evaluating worst-case optimal join algorithms but also serves as a general-purpose benchmark for evaluating (basic) graph patterns. The queries are provided in SPARQL syntax and all return at least one solution. We limit the number of results returned to a maximum of 1,000.
Queries
We provide an example of a "square" basic graph pattern (comments are added here for readability):
SELECT * WHERE { ?x1 http://www.wikidata.org/prop/direct/P149 ?x2 . # architectural style ?x2 http://www.wikidata.org/prop/direct/P1269 ?x3 . # facet of ?x3 http://www.wikidata.org/prop/direct/P156 ?x4 . # followed by ?x1 http://www.wikidata.org/prop/direct/P135 ?x4 . # movement } LIMIT 1000
There are 49 other queries similar to this one in the dataset (replacing the predicates with other predicates), and 50 queries for 16 other abstract query patterns. For more details on these patterns, we refer to the publication mentioned below.
Note that you can try the queries on the public Wikidata Query Service, though some might give a timeout.
Generation
The queries were generated over a reduced version of the Wikidata truthy dump from November 15, 2018 that we call the Wikidata Core Graph (WCG). Specifically, in order to reduce the data volume, multilingual labels, comments, etc., were removed as they have limited use for evaluating joins (English labels were kept under schema:name). Thereafter, in order to facilitate the generation of the queries, triples with rare predicates appearing in fewer than 1,000 triples, and very common predicates appearing in more than 1,000,000 triples, were removed. The queries provided will generate the same results over both graphs.
Files
In this dataset, we then include three files:
wgpb-queries.zip The list of 850 queries
wikidata-wcg.nt.gz Wikidata truthy graph with English labels
wikidata-wcg-filtered.nt.bz2 Wikidata truthy graph with English labels filtering triples with rare (<1000 triples) and very common (>1000000) predicates
Code
We provide the code for generating the datasets, queries, etc., along with scripts and instructions on how to run these queries in a variety of SPARQL engines (Blazegraph, Jena, Virtuoso and our worst-case optimal variant of Jena), .
Publication
The benchmark is proposed, described and used in the following paper. You can find more details about how it was generated, the 17 abstract patterns that were used, as well as results for prominent SPARQL engines.
Aidan Hogan, Cristian Riveros, Carlos Rojas and Adrián Soto. "A Worst-Case Optimal Join Algorithm for SPARQL". In the Proceedings of the 18th International Semantic Web Conference (ISWC), Auckland, New Zealand, October 26–30, 2019.
This dataset was created by Daniele Santini
We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
CSV file for submission of Wikidata entries with SMILES (canonical and isomeric) created with https://github.com/egonw/ons-wikidata/blob/master/PubChem/createSDF.groovy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using Wikipedia data to study AI ethics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dump from Wikidata from 2018-12-17 in JSON. This one is not avavailable anymore from Wikidata. It was downloaded originally from https://dumps.wikimedia.org/other/wikidata/20181217.json.gz and recompressed to fit on Zenodo.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wikidata cropped URIs
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The subset has been obtained using wdsub version 0.0.33 and the schema:
PREFIX p: http://www.wikidata.org/prop/ PREFIX ps: http://www.wikidata.org/prop/statement/ PREFIX prov: http://www.w3.org/ns/prov# PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
start = @
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization, Location, Misc. The annotation is in the IOB schema (annotation per token, beginning + inside of the multi-word annotation). NEL annotation contains Wikidata Qnames.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).