Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset covers more than 1000 common data science concepts. It covers several topics related to statistics, machine learning, and artificial intelligence. It has two columns, one of which is questions or instructions, the other is responses to these instructions. The dataset can be used in Q&A and text generation.
Facebook
TwitterThe data set "National concept directory in National data catalogue" (Begrepskatalog i Felles datakatalog) contains all terms published in National concept directory in National data catalogue. Each term contains at least information about the recommended term, definition and source of definition. The terms may also include the following information if the owner of the concept has provided such information: additional information about the meaning of the term that does not belong in the definition field; permitted and advised term, example on use of the term, subject area the term belongs to, area of application, legal categories or value ranges of the term, the date the term is valid from, the date the term shall apply to and contact information by e-mail and telephone.
Objective: To make all concepts in the National concept directory in National data catalogue available for downloading
Facebook
Twitter(1) The Chinese Concept Dictionary (CCD) implements Chinese corresponding to the English concepts in the WordNet 1.6 version. The total number of concepts is close to 100,000 (of which the total number of words far exceeds 100,000), including 66025 concepts of nouns, 12127 of verbs, 17915 of adjectives and 3575 of adverbs. The transfer of use rights to a number of research institutes and multinational corporations has promoted the progress of Chinese-English semantic analysis. (2) The Multilingual Concept Dictionary (MCD), based on CCD, Japanese WordNet, Korean WordNet and CoreNet, is built by automatic method and artificial expert checkup. Currently, under the multilingual conceptual dictionary, there are 8,400 Japanese concepts (mainly medium and high-level concepts in language) and 9,700 Korean concepts (also middle and high-level concepts), forming connection information with CCD concepts. Under the framework of WordNet, the basic concepts of East Asian languages (Chinese, Japanese and Korean) are generally described. (3) Please login to download the datafiles.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The terminological dictionary was compiled within the framework of the project Development of Slovene in the Digital Environment. It is an example collection of 413 terms from the field of artificial intelligence, especially from the subfields of machine learning, computer vision, natural language processing, and fuzzy logic. Definitions, English equivalents, and possible synonyms are added to the terms. The dictionary is based on a conceptual approach, according to which terms are perceived as designations for concepts that are related to each other in the conceptual system of the subject field. Consequently, the terms are interrelated in the naming system of the subject field. The dictionary is distributed in XML using the TBX (TermBase eXchange) standard for representing and exchanging information from termbases.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT Diachrony is not always an aspect discussed by terminological studies that can be based on different perspectives. Among them, Diachronic Terminology (DT), conceived as an approach that has the main characteristic of focusing on this aspect, has received more attention in recent decades by several works, whose common characteristic is the adoption of a terminologicaldiachronic approach. The diversity of these research is enormous, which and, according to Dury and Picton (2009), generates vague and imprecise theoretical and methodological contours in Terminology. To contribute to an organization of these contours in Brazilian Portuguese, since, in our country, studies (especially the theoretical ones) in this regard are still incipient, this paper presents an overview of international and national research by highlighting their main characteristics in terms of contribution and basic conceptions. Based on this panorama, the use of some terms in these studies that refer to the phenomena analyzed, and their theoretical and methodological implications are discussed. Thus, it is hoped that this work may arouse more interest in this approach and, going further, it can serve as an initial guide, as it discusses some paths that can be followed by investigations to be developed especially in Brazil.
Facebook
TwitterThe dictionary includes about 1650 terms and concepts (in Armenian, Russian and English) used in forest and landscaping sectors with a brief explanation in Armenian.Citation: J.H. Vardanyan, H.T. Sayadyan, Armenian-Russian-English Dictionary of Forest Terminology, Publishing House of the Institute of Botany of NAS RA, Yerevan, 2008.
Facebook
TwitterThis dataset contains all of the attribute data. This includes RXNORM provided attributes, such as normalized 11-digit National Drug Codes (NDCs), UNII codes, and human or veterinary usage markers, and source-provided attributes, such as labeler, definition, and imprint information. Each attribute has an 'Attribute Name' (ATN) and 'Attribute Value' (ATV) combination. For example, NDCs have an ATN of 'NDC' and an ATV of the actual NDC value.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides historical linguistic frequency data related to sex education discourse in British English and American English from 1922 to 2022. Frequencies were extracted from the Google Ngram Viewer (English-GB and English-US corpora, 2019 version) for terms systematically categorized into four distinct conceptual groups. This dataset aims to support research into the evolution of public discourse, pedagogical approaches, and cultural attitudes surrounding sex education over the past century.
The data in this dataset was extracted from the Google Ngram Viewer.
Ngram frequency data was programmatically extracted from the Google Ngram Viewer by accessing generated HTML pages, which contain embedded JSON data. A custom Python script was used to parse the HTML, extract the time-series frequency data for specific terms, and consolidate it into a structured CSV format. Ngram Viewer smoothing was uniformly set to 3 for all queries to mitigate year-to-year fluctuations.
English (GB) corpus)English (US) corpus)Terms were carefully selected and grouped to analyze different facets of sex education discourse. Each group's terms were queried individually or as grouped queries where indicated (e.g., using (All) quantifier in Ngram Viewer).
These terms represent the central, overarching, and foundational concepts that define or are core to the public conversation surrounding sex education.
sex educationreproductive healthsexual healthcontraceptionabstinenceconsentSTDSTIThis group includes vocabulary related to human anatomy, physiological processes, and biological aspects often discussed in the context of sex education.
pubertymenstruationvaginapenisreproductionspermovulationThese terms reflect contemporary understandings, progressive approaches, inclusivity, and specific modern public health concerns that have gained significant prominence in later decades of the discourse.
LGBTQgender identitysexual orientationbody autonomysafe sexHIV preventionAIDS educationThis group contains vocabulary that was more prevalent in earlier periods, reflecting older approaches, euphemisms, or terms whose primary usage or connotations have significantly shifted over the past century.
venereal diseasechastitymoralityfamily planningthe pillprophylacticThe dataset is organized as follows:
sex_education_final_combined_dataset.csv: This file contains all Ngram frequency data for both British and American English, encompassing all terms from all four groups, consolidated into a single DataFrame.Sex_ED_UK/: Directory containing individual CSV files for each term group relevant to the British English corpus.
group01_Primary Discourse Terms.csvgroup02_Biological & Reproductive Terms.csvgroup03_Evolving Discourse Terms.csvgroup04_Historical Terms.csvSex_ED_USA/: Directory containing individual CSV files for each term group relevant to the American English corpus.
group01_Primary Discourse Terms.csvgroup02_Biological & Reproductive Terms.csvgroup03_Evolving Discourse Terms.csvgroup04_Historical Terms.csvREADME.md: This metadata file.All CSV files (individual and combined) share the following columns:
Year: Integer - The year of publication of the texts from which the Ngram frequencies were calculated (ranging from 1922 to 2022).Term: String - The specific Ngram term or phrase for which the frequency is provided.Frequency: Float - The relative frequency of the Term in the Corpus for that Year. This is a proportion of the total number of Ngrams for that year.Corpus: String - The Google Ngram corpus from which the data was extracted (British English or American English).TermGroup: String - The conceptu...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The resource contains several datasets containing domain-specific data in three languages, English, Slovenian and Croatian, which can be used for various knowledge extraction or knowledge modelling tasks. The resource represents knowledge for the domain of karstology, a subfield of geography studying karst and related phenomena. It contains:
Definitions Plain text files contain definitions of karst concepts from relevant glossaries and encyclopaedia, but also definitions which had been extracted from domain-specific corpora.
Annotated definitions Definitions were manually annotated and curated in the WebAnno tool. Annotations include several layers including definition elements, semantic relations following the frame-based theory of terminology (FBT), relation definitors which can be used for learning relation patterns, and semantic categories defined in the domain model.
Terms, definitions and sources The TermFrame knowledge base contains terms and their corresponding concept identifiers, definitions and definition sources.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains information about the contents of 100 Terms of Service (ToS) of online platforms. The documents were analyzed and evaluated from the point of view of the European Union consumer law. The main results have been presented in the table titled "Terms of Service Analysis and Evaluation_RESULTS." This table is accompanied by the instruction followed by the annotators, titled "Variables Definitions," allowing for the interpretation of the assigned values. In addition, we provide the raw data (analyzed ToS, in the folder "Clear ToS") and the annotated documents (in the folder "Annotated ToS," further subdivided).
SAMPLE: The sample contains 100 contracts of digital platforms operating in sixteen market sectors: Cloud storage, Communication, Dating, Finance, Food, Gaming, Health, Music, Shopping, Social, Sports, Transportation, Travel, Video, Work, and Various. The selected companies' main headquarters span four legal surroundings: the US, the EU, Poland specifically, and Other jurisdictions. The chosen platforms are both privately held and publicly listed and offer both fee-based and free services. Although the sample cannot be treated as representative of all online platforms, it nevertheless accounts for the most popular consumer services in the analyzed sectors and contains a diverse and heterogeneous set.
CONTENT: Each ToS has been assigned the following information: 1. Metadata: 1.1. the name of the service; 1.2. the URL; 1.3. the effective date; 1.4. the language of ToS; 1.5. the sector; 1.6. the number of words in ToS; 1.7–1.8. the jurisdiction of the main headquarters; 1.9. if the company is public or private; 1.10. if the service is paid or free. 2. Evaluative Variables: remedy clauses (2.1– 2.5); dispute resolution clauses (2.6–2.10); unilateral alteration clauses (2.11–2.15); rights to police the behavior of users (2.16–2.17); regulatory requirements (2.18–2.20); and various (2.21–2.25). 3. Count Variables: the number of clauses seen as unclear (3.1) and the number of other documents referred to by the ToS (3.2). 4. Pull-out Text Variables: rights and obligations of the parties (4.1) and descriptions of the service (4.2)
ACKNOWLEDGEMENT: The research leading to these results has received funding from the Norwegian Financial Mechanism 2014-2021, project no. 2020/37/K/HS5/02769, titled “Private Law of Data: Concepts, Practices, Principles & Politics.”
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SSHOC Multilingual Data Stewardship Terminology is a multilingual terminology that collects terms specific to the domain of Data Stewardship, as well as their definitions. A list of domain-specific terms was automatically extracted from a corpus pertaining to the domain of Data Stewardship and Curation, validated by domain experts, assigned a definition, and linked to other existing terminologies (Loterre Open Science Thesaurus, terms4FAIRskills, Linked Open Vocabularies, ISO terms and definitions). Each term-definition pair was then automatically translated into multiple languages (Dutch, French, German, Greek, Italian, Slovenian) by employing Deep-L. The Multilingual Data Stewardship Terminology thus consists of 210 concepts available in Dutch, French, German, Greek, Italian, Slovenian. This resource was created within the frame of the SSHOC (Social Sciences and Humanities Open Cloud) project (H2020-INFRAEOSC-2018-2-823782). It is the result of the work of Task 3.1.2 "extraction of terminology from technical documentation about standards and interoperability", as described in D3.9, carried out jointly by ILC-CNR and CLARIN ERIC.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset reports on a survey on awareness amongst the European public of common data concepts, terms and principles.
Elements of this survey were utilised in the following publication -
O’Grady, M., Mangina, E. Citizen scientists—practices, observations, and experience. Humanit Soc Sci Commun 11, 469 (2024). https://doi.org/10.1057/s41599-024-02966-x
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Slovene Definition Extraction evaluation datasets RSDO-def contains sentences extracted from the Corpus of term-annotated texts RSDO5 1.1 (http://hdl.handle.net/11356/1470), which contains texts with annotated terms from four different domains: biomechanics, linguistics, chemistry, and veterinary science. The file and sentence identifiers are the same as in the original RSDO corpus.
The labels added to the sentences included in the dataset denote: 0: Non-definition 1: Weak definition 2: Definition
The dataset consists of two parts: 1. RSDO-def-random employed a random sampling strategy, with 14 definitions, 98 weak-definitions and 849 non-definitions. 2. RSDO-def-larger added sentences to the random one by the pattern-based definition extraction as presented in Pollak et al. (2014). It contains 169 definitions, 214 weak-definitions and 872 non-definitions.
Both parts were manually annotated by five terminographers. In case of discrepancies between annotators, a consensus was reached and the final label was confirmed by all five annotators. Duplicates were removed in both parts.
The criteria for annotation are based on the standard ISO 1087-1:2000 (E/F) Terminology Work - Vocabulary, Part 1, Theory and Application, which explains a definition as follows: "Representation of a concept by a descriptive statement which serves to differentiate it from related concepts". Weak definition labels were assigned if the extracted sentences contained a term and at least one delimiting feature without a superordinate concept, or sentences consisting of superordinate concepts without delimiting features but with some typical examples. Instances were labeled as Non-definition if the sentence with the extracted concept did not contain any information about the concept or its delimiting features.
The dataset is described in more detail in Tran et al. 2023, where it was used for evaluating definition extraction approaches. If you use this resource, please cite:
Tran, T.H.H., Podpečan, V., Jemec Tomazin, M., Pollak, Senja (2023). Definition Extraction for Slovene: Patterns, Transformer Classifiers and ChatGPT. Proceedings of the ELEX 2023: Electronic lexicography in the 21st century. Invisible lexicography: everywhere lexical data is used without users realizing they make use of a “dictionary” (accepted)
Reference to the pattern-based definition extraction method used for creating RSDO-def-larger: Pollak, S. (2014). Extracting definition candidates from specialized corpora. Slovenščina 2.0: empirical, applied and interdisciplinary research, 2(1), pp. 1–40. https://doi.org/10.4312/slo2.0.2014.1.1-40
Related resources:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ClinSpEn
This repository contains the sample, test and background data for the ClinSpEn track.
ClinSpEn is part of the Biomedical WMT 2022 shared task, having the aim to promote the development and evaluation of machine translation systems adapted to the medical domain with three highly relevant sub-tracks: clinical cases, medical controlled vocabularies/ontologies, and clinical terms and entities extracted from medical content.
Data Description
ClinSpEn proposes three different sub-tracks, each based on a different type of clinical data:
- Clinical Cases:
Parallel EN-EN COVID-19 clinical cases. The direction of this sub-track is EN>ES.
The dataset’s case reports were carefully selected to cover a wide range of aspects related to the disease: different types of patients (children, adults, elderly and pregnant people, babies), different comorbidities (cancer, mental health issues, immunosuppressed patients) and symptomatology (mild and severe presentations, dermatologic, immunologic and psychiatric manifestations, thrombosis, …). The reports were translated from English to Spanish by a professional medical translator on a first step and revised by a clinical expert on a second step.
The sample set files is made up of parallel txt files, with the Spanish version having a “.es” extension and the English files having a “.en” extension. Each report has been parallelized so that every sentence’s line number corresponds to the same sentence’s line number in both languages.
The test and background data is made up of a TSV file with three columns: document number, line number and English line. The clinical cases themselves include COVID-19 case reports as well as diverse content extracted from PubMed.
- Clinical Terminology:
Parallel EN-ES clinical terms extracted from medical literature and clinical records, with particular focus on diseases, symptoms, findings, procedures and professions and translated and revised by professional medical translators. The direction of this sub-track is ES>EN.
The sample set contains 7 000 terms as a tab-separated file (TSV), with the first column corresponding to English terms and the second column to Spanish terms.
The test and background data is made up of a TSV file with two columns: term number and Spanish term.
- Ontology Concepts:
Parallel EN-ES concepts extracted from various open biomedical ontologies and taxonomies and then manually translated by a professional medical translator. The direction of this sub-track is EN>ES.
The sample data includes 400 concepts. The terms are presented as tab-separated file (TSV), with the first column corresponding to English terms and the second column to Spanish terms. The third column includes the term’s origin ontology and its correspondent ID, while the fourth one includes a link to the concept in OBO Library.
The test and background data is made up of a TSV file with two columns: concept number and English concept.
Related Links:
- Sub-track website with more information: https://temu.bsc.es/clinspen/
- WMT website: https://www.statmt.org/wmt22/
- CodaLab: https://codalab.lisn.upsaclay.fr/competitions/6696/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT Denominative variation in terminology, that is, the use of different names to designate the same concept or nuances of the same conceptual reality, is often considered as a mere stylistic resource or a strategy of thematic progression. It can present, however, distinct conceptual patterns and distinct intra-term relations, which means that the units are not always semantically equivalent. Thus, much more than a thematic progression mechanism, variants act as a discursive and cognitive resource to highlight different conceptual nuances of terminological units. In this sense, in this article, based on the assumptions of modern trends in Terminology, in particular on the Communicative Theory of Terminology (CABRÉ, 1999, 2005) and on the classification of conceptual specification patterns by Kageura (2002), we aim to analyze conceptual patterns and intra-term relations present in terminological variants of Economics. Through this analysis, we intend to show which conceptual information is highlighted in these terminological units and how they can influence the understanding and construction of specialized knowledge.
Facebook
TwitterGlossary of Genetic Terms to help everyone understand the terms and concepts used in genetic research. In addition to definitions, specialists in the field of genetics share their descriptions of terms, and many terms include images, animation and links to related terms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Research Data Infrastructure (NFDI) strives to develop FAIR research data and data services for major scientific disciplines, using terminologies as a key factor for semantic annotations and semantic interoperability of data. Several NFDI consortia provide domain-specific terminologies through Terminology services or registries, offering access, search capabilities, visualization, and downloads. Prioritizing user-friendly access, terminology services seamlessly integrate semantic concepts into applications, often operating in the background to enable smooth semantic annotation and data interoperability. We present exemplary fields of application from selected disciplines and how terminology services support semantic search, user experience, annotation workflows, terminology curation and design. This presentation is connected to the following conference paper https://doi.org/10.52825/cordi.v1i.356
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
| Column Name | Description |
|---|---|
| tags | Tags associated with the glossary entry. |
| categories | Categories of glossary entries. |
| topics | Topics related to the glossary entry. |
| title | The title of the glossary entry. |
| es-title | The Spanish translation of the title. |
| url | The URL or link to the glossary entry. |
| bite | A brief description or explanation of the term in English. |
| es-bite | The Spanish translation of the term's description. |
| audience | The intended audience for the glossary entry. |
| segment | The specific segment this entry relates to. |
| insurance-status | Information related to insurance status. |
| state | The state to which the entry pertains. |
| condition | Any specific conditions associated with the entry. |
1. Understand Medical Insurance Terminology: Use the glossary to understand and explain common medical insurance terms and concepts.
2. Language Translation: If you're working in a bilingual setting or need translations of medical insurance terms, the Spanish translations provided in this dataset can be invaluable.
3. Educational Resources: Create educational resources, articles, or content related to medical insurance by using the glossary entries.
4. Data Enrichment: Enhance your medical insurance-related datasets or applications with standardized terminology using this glossary.
5. Reference for Medical Professionals: This glossary can serve as a reference for healthcare professionals, insurance agents, and researchers in the field.
Facebook
TwitterThe UTS API is intended for application developers to perform Web service calls and retrieve UMLS data within their own applications. The UTS API provides the ability to search, retrieve, and filter terms, concepts, attributes, relations, metadata and more from over 160 vocabularies of the UMLS Metathesaurus, as well as the Semantic Network. Paging, sorting and filtering (PSF) capabilities allows users to customize results of Web service calls in many ways: choose to include or exclude specific criteria, sort results by fields, or specify results displayed per page. The documentation provides a suite of Web Services Description Language (WSDL) files, API installation instructions, and sample code.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset covers more than 1000 common data science concepts. It covers several topics related to statistics, machine learning, and artificial intelligence. It has two columns, one of which is questions or instructions, the other is responses to these instructions. The dataset can be used in Q&A and text generation.