WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wordnet', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. Resulting network of meaningfully related words and concepts can be navigated with browser.
This dataset was developed by using the WordNet dataset (https://www.kaggle.com/datasets/duketemon/wordnet-synonyms) and pairing words with possible synonyms.
The format of the dataset allows one to create a network of synonyms. Each row represents an edge of the network. This has been used for exploratory analysis of the English language such as finding the most commonly linked words and identifying communities of words.
WordNet License This license is available as the file LICENSE in any downloaded version of WordNet.
WordNet 3.0 license: (Download)
WordNet Release 3.0 This software and database is being provided to you, the LICENSEE, by Princeton University under the following license. By obtaining, using and/or copying this software and database, you agree that you have read, understood, and will comply with these terms and conditions.: Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty is hereby granted, provided that you agree to comply with the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used in advertising or publicity pertaining to distribution of the software and/or database. Title to copyright in this software, database and any associated documentation shall at all times remain with Princeton University and LICENSEE agrees to preserve same.
Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. It is sourced from the data in two principal language resources: The Oxford®-DZS Comprehensive English-Slovenian Dictionary and the Gigafida 1.0 corpus of written Slovene. The links identified between synonyms were additionally confirmed using the Dictionary of Standard Slovenian Language (SSKJ). The data extraction and structure for the Thesaurus were based on the frequency and manner in which words co-occur in translation strings of the Oxford-DZS Dictionary. This information is the basis for discriminating between ‘core’ and ‘near’ synonyms, with ‘core’ synonyms exhibiting a greater connection to the keyword. In the following step, an approach combining balanced co-occurrence graphs and the Personal PageRank algorithm automatically divides the synonyms into subgroups and ranks them according to the degree of semantic relatedness to the keyword, as well as their frequency in language use. For the creation methodology, see Krek et al. (2017) in the provided references. The database includes dictionary entries: single- and multiword headwords, their part-of-speech and other linguistic features, as well as automatically extracted synonyms, their type (core or near) and relevancy rank. In version 2.0, 4,544 manually revised antonyms were added to the database. Additionally, for a part of the database, synonyms were distributed under the corresponding word senses. Pertaining to how much lexicographic revision was involved in their preparation, database entries can have one of the following three statuses: (a) ssss-automatic (96,064 entries): no manual revision was conducted; (b) ssss-manual (3,421 entries): word senses and semantic indicators were prepared by lexicographers, and synonyms were manually distributed under each corresponding sense; (c) ssss-hybrid (1,352 entries): manually revised senses are combined with data compiled automatically. For novelties of v2.0, see Arhar Holdt et al. (2023) in the provided references.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets that can be used together with the Code in: https://github.com/JanKalo/RuleAlign
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A sample of manually evaluted synonymous predicates in DBpedia-2016-10.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets for reproducing the results of the paper "Knowledge Graph Consolidation by Unifying Synonymous Relationships" published at ISWC 2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a gzipped three-column TSV file that has prefixes, identifiers, and synonyms for lots of biomedical entities drawn from the terminologies and ontologies in PyOBO. It was generated with the following code in the shell: pip install pyobo pyobo database synonyms Any silly name suggestions that include literary references are welcome.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
MultiWordNet is a multilingual lexical database including information about English and Italian words. It is an extension of WordNet 1.6, a lexical database for English developed at the Princeton University. MultiWordNet contains information about the following aspects of the English and Italian lexical:- lexical relations between words- semantic relations between lexical concepts- correspondences between Italian and English lexical concepts- semantic fieldsThe basic lexical relationship in MultiWordNet is synonymy. Groups of synonyms are used to identify lexical concepts, which are also called synsets. Synsets are the most important unit in MultiWordNet. A lot of interesting information is attached to them, such as semantic fields and semantic relationships.MultiWordNet can be used for a variety of NLP tasks including:- Information Retrieval: synonymy relations are used for query expansion to improve the recall of IR; cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval. - Semantic tagging: MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging, i.e. texts are tagged with synset identifiers.- Disambiguation: Semantic relationships are used to measure the semantic distance between words, which can be used to disambiguate the meaning of words in texts. Also semantic fields have proved to be very useful for the disambiguation task.- Ontologies: MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks.- Terminologies: MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies.The release 1.1 of MultiWordNet is currently available. It includes information about 51,000 Italian word meanings and 28,000 synsets (incorrespondence with the English equivalents). It also includes a labelling of most WordNet 1.6 synsets with semantic field labels.Work on MultiWordNet is going on. The next release will contain at least 10,000 new word meanings.Data are contained in a specialized database server, which can be accessed by clients through a socket connection. The database server has been implemented in Lisp under the Unix and Windows environments. An application program interface and graphical browsing interface are provided with the database. A Java implementation of the database is planned for the next release.For more information, visit: http://multiwordnet.itc.it
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
several software suites which support access to the STEDT database. These are written in PERL and PHP, and present different capabilities and dimensions of this linguistic data. This object is a compressed archive of the svn code repository for the project as of January 5, 2015. The active repository is now on GitHub at https://github.com/stedt-project/sss.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The synonyms of biological species names are shown to be an important component in comprehensive searches of electronic scientific literature databases but they are not well leveraged within the major literature databases examined. For accepted or valid species names in the Integrated Taxonomic Information System (ITIS) which have synonyms in the system, and which are found in citations within PLoS, PMC, PubMed or Scopus, both the percentage of species for which citations will not be found if synonyms are not used, and the percentage increase in number of citations found by including synonyms are very often substantial. However, there is no correlation between the number of synonyms per species and the magnitude of the effect. Further, the number of citations found does not generally increase proportionally to the number of synonyms available. Users looking for literature on specific species across all of the resources investigated here are often missing large numbers of citations if they are not manually augmenting their searches with synonyms. Of course, missing citations can have serious consequences by effectively hiding critical information. Literature searches should include synonym relationships and a new web service in ITIS, with examples of how to apply it to this issue, was developed as a result of this study, and is here announced, to aide in this.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The RivFISH database aggregates the available data on freshwater-dependent fish presence in Europe, validated at the river basin level and considering taxonomical synonyms for species names, thus allowing for a maximization of data usage and robustness. This database also promotes interoperability with other datasets, including the IUCN Red List of Threatened Species, FishBase and the Catchment Characterisation and Modelling (CCM2) – River and Catchment Database v2.1. It is, as far as the authors know, the most up-to-date and comprehensive database on the presence of freshwater-dependent fish species for European river basins. The structure of the database is also prepared to deal with future alterations in species taxonomy, as well as new records of species occurrence in river basins.
Hazelnut (Corylus avellana L.) is one of the most important tree nut crops in Europe. Germplasm accessions are conserved in ex situ repositories, located in countries where hazelnut production occurs. In this work, we used ten simple sequence repeat (SSR) markers as the basis to establish a core collection representative of the hazelnut genetic diversity conserved in different European collections. A total of 480 accessions were used: 430 from ex situ collections and 50 landraces maintained on-farm. SSR analysis identified 181 genotypes, that represented our whole hazelnut germplasm collection (WHGC). Four approaches (utilizing MSTRAT, Power Core, and Core Hunter’s single- and multi-strategy) based on the maximization (M) strategy were used to determine the best sampling method. Core Hunter’s multi-strategy, optimizing both allele coverage (Cv) and Cavalli-Sforza and Edwards (Dce) distance with equal weight, outperformed the others and was selected as the best approach. The final core c...
ChemIDplus, the Chemical Identification Plus Database, is no longer updated. These are the final files from February 22, 2023. All ChemIDplus data have been incorporated into PubChem. ChemIDplus was a dictionary of over 400,000 chemicals (names, synonyms, and structures). ChemIDplus includes links to NLM and other databases and resources, including links to federal, state and international agencies. NLM makes a subset of ChemIDplus data available for download. The ChemIDplus Subset does not include the structure or the toxicity data available from the NLM web versions of the database. The ChemIDplus Subset is updated monthly.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital multi-purpose lexico-semantic database of Czech.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Percent Increase in Number of Citations Found Due to Synonyms.
This dataset is a thesaurus (MilkOligoThesaurus) gathering milk oligosaccharide names. This dataset is a table of names of unique milk oligosaccharide (rows) and descriptors (columns) including the name of the molecule, abbreviation, chemical database ID if available, chemical information (monoisotopic mass, osidic composition, molecular formula), synonyms, isomer groups, and scientific articles sources. The archive also includes two RDF serializations (in RDF/XML and TTL) of the dataset based on the W3C SKOS standard. The intermediate tabular file that allowed to produce these serializations with SkosPlay!, a free online convertion tool is included. A datapaper has been published in Data In Brief to detail how data were collected, described and transformed (https://doi.org/10.1016/j.dib.2024.110404) . English
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
A. Available Wordnets Following the announcement of the EuroWordNet databases in the last issue of the ELRA Newsletter (Vol.4 N.2), we are happy to announce that the list of EuroWordNet languages has grown. The following wordnets are now available via ELRA:ELRA ref. Language Synsets Word Meanings Language Internal Relations Equi-valence Relations ELRA-M0015 English Addition to English WordNet 16361 40588 42140 0 ELRA-M0016 Dutch 44015 70201 111639 53448 ELRA-M0017 Spanish 23370 50526 55163 21236 ELRA-M0018 Italian 48529 48499 117068 71789 ELRA-M0019 German 15132 20453 34818 16347 ELRA-M0020 French 22745 32809 49494 22730 ELRA-M0021 Czech 12824 19949 26259 12824 ELRA-M0022 Estonian 9317 13839 16318 9004 B. LR(1) Common Components (All Foreground - Data of layer 1) A. The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created. An ILI-record contains: A.1 synset: set of synonymous words or phrases (mostly from WordNet1.5) A.2 part-of-speech, A.3 one or more Top-Concept classifications (Optional) A.4 one or more Domain labels (Optional) A.5 a gloss in English (mostly from WordNet1.5) A.6 a unique ID linking the synset to its source (mostly WordNet1.5) B. Top-Ontology: an ontology of 63 basic semantic classes based on fundamental distinctions. By means of the Top-Ontology all the wordnets can be accessed using a single language-independent classification-scheme. Top-Concepts are only assigned to ILI-records. C. Domain-ontology: an ontology of subject-domains optionally assigned to ILI-records. D. A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets. These Base-Concepts form the core of all the wordnets. All the Base-Concepts are classified in terms of the Top-Concepts that apply to them. E. WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format. C. LR(2) Language-Specific Components (Data of layer 2- partly Foreground and partly Background) Wordnets produced in the first project (LE2-4003): F. Dutch wordnet G. English wordnet (additional relations which are missing in WordNet1.5) H. Italian wordnet I. Spanish wordnet After extension of the project (LE4-8328): J. German wordnet K. French wordnet L. Czech wordnet M. Estonian wordnet The specific wordnets are language-internal structures, minimally containing:o set of variants or synonyms making up the synset o part-of-speech o language-internal relations to other synsets o equivalence relations with ILI-records o a unique-id linking the synset to its source Each wordnet will be distributed with LR1 and will include documentation on LR1 and the distributed wordnet. All the data will be distributed as text-files in the EuroWordNet import format and as Polaris database files (see below LR3). The EuroWordNet viewer (Periscope, see below LR3) can be used to access the database version. Polaris has to be licensed to modify and...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Dataset tla-demotic-v18-premium
This data set contains demotic sentences in transliteration, with lemmatization, with POS glossing and with a German translation. The data comes from the database of the Thesaurus Linguae Agegyptiae, corpus version 18, and contains only fully intact, unambiguously readable sentences (13,383 of 31,156 sentences), adjusted for philological and editorial markup.
Dataset Details
Dataset Description… See the full description on the dataset page: https://huggingface.co/datasets/thesaurus-linguae-aegyptiae/tla-demotic-v18-premium.
Development of the USGS Biocomplexity Thesaurus began in 2002-2003 through a partnership between the former USGS NBII Program and CSA, a worldwide information company with more than 30 years experience as a leading bibliographic database provider. The original Biocomplexity Thesaurus, first made available online in 2003,?? was a merger of five individual thesauri: - the CSA Aquatic Sciences and Fisheries Thesaurus - the CSA Life Sciences Thesaurus - the CSA Pollution Thesaurus - the CSA Sociological Thesaurus - the CERES/NBII Thesaurus Additional thesuarui, including fire related terminologies, are in the process of being added. The CSA-NBII Biocomplexity thesaurus is being used globally by USGS Partners and other organizations in support of the classification and retrieval of biological data and information.
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wordnet', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.