Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data collection contains test Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.
The data is provided for testing purposes and thus contains specific data cases, which are sometimes artificially created, sometimes picked from existing data sets. The data contains the following cases:
Please find more information in the paper referenced below.
Version: 1.0.0, 05.05.2023.
Reference
Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.
Please find more information on the provided data in the paper referenced below.
The annotation was funded by
Version: 1.0.0, 7.3.2022. Development data.
Reference
Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish.
AbstractIntroductionAbstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations. AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).DataThe source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 7379 210 229 7818 DEFT DF English 32915 0 0 32915 Aesop fables 49 0 0 49 Guidelines AMRs 970 0 0 970 LORELEI 4441 354 527 5322 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Wikipedia 192 0 0 192 Xinhua MT 741 99 86 926 Totals 55635 1722 1898 59255 Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.
Unfortunately, no README file was found for the datano extension, limiting the ability to provide a detailed and comprehensive description. Therefore, the following description is based on the extension name and general assumptions about data annotation tools within the CKAN ecosystem. The datano
extension for CKAN, presumably short for "data annotation," likely aims to enhance datasets with annotations, metadata enrichment, and quality control features directly within the CKAN environment. It potentially introduces functionalities for adding textual descriptions, classifications, or other forms of annotation to datasets to improve their discoverability, usability, and overall value. This extension could provide an interface for users to collaboratively annotate data, thereby enriching dataset descriptions and making the data more useful for various purposes. Key Features (Assumed): * Dataset Annotation Interface: Provides a user-friendly interface within CKAN for adding structured or unstructured annotations to datasets and associated resources. This allows for a richer understanding of the data's content, purpose, and usage. * Collaborative Annotation: Supports multiple users collaboratively annotating datasets, fostering knowledge sharing and collective understanding of the data. * Annotation Versioning: Maintains a history of annotations, enabling users to track changes and revert to previous versions if necessary. * Annotation Search: Allows users to search for datasets based on annotations, enabling quick discovery of relevant data based on specific criteria. * Metadata Enrichment: Integrates annotations with existing metadata, enhancing metadata schemas to support more detailed descriptions and contextual information. * Quality Control Features: Includes options to rate, validate, or flag annotations to ensure they are accurate and relevant, improving overall data quality. Use Cases (Assumed): 1. Data Discovery Improvement: Enables users to find specific datasets more easily by searching for datasets based on their annotations and enriched metadata. 2. Data Quality Enhancement: Allows data curators to improve the quality of datasets by adding annotations that clarify the data's meaning, provenance, and limitations. 3. Collaborative Data Projects: Facilitates collaborative data annotation efforts, wherein multiple users contribute to the enrichment of datasets with their knowledge and insights. Technical Integration (Assumed): The datano
extension would likely integrate with CKAN's existing plugin framework, adding new UI elements for annotation management and search. It could leverage CKAN's API for programmatic access to annotations and utilize CKAN's security model for managing access permissions. Benefits & Impact (Assumed): By implementing the datano
extension, CKAN users can leverage improvements to data discoverability, quality, and collaborative potential. The enhancement can help data curators to refine the understanding and management of data, making it easier to search, understand and promote data driven decision-making.
Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12). Data The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 6455 210 229 6894 DEFT DF English 19558 0 0 19558 Guidelines AMRs 819 0 0 819 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Xinhua MT 741 99 86 Totals 36521 1368 1371 39260 For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the “split” directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The “unsplit” directory contains the same 39,260 AMRs with no train/dev/test partition.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This data collection contains diachronic semantic relatedness judgments for German word usage pairs. Find a description of the data format, code to process the data and further datasets on the WUGsite.
We provide additional data under misc/
:
testset: a semantic change test set with 22 German lexemes divided into two classes: lexemes for which the authors found
occurring in Deutsches Textarchiv (DTA) in the 19th century. Note that for some lexemes the change is already observable slightly before 1800 and some lexemes occur more than once in the test set (see paper). The columns 'earlier' and 'later' contain the mean of all judgments for the respective word. The columns 'delta_later' and 'compare' contain the predictions of the annotation-based measures of semantic change developed in the paper.
Please find more information on the provided data in the paper referenced below.
Version: 2.0.0, 30.9.2021.
Reference
Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA.
This data collection contains diachronic Word Usage Graphs (WUGs) for Swedish. Find a description of the data format, code to process the data and further datasets on the WUGsite: https://www.ims.uni-stuttgart.de/data/wugs We provide additional data under misc/: semeval: a larger list of words and (noisy) change scores assembled in the pre-annotation phase for SemEval-2020 Task 1. Please find more information on the provided data in the paper referenced below. Reference Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. {"references": ["Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. https://arxiv.org/abs/2104.08540"]} original paper version
Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.
AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title. Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']
This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:
The raw data that accompanies the prompt can be found below:
Hive Annotation Job Results - Raw Data
^ These are the tools I was given to complete my task. The rest of the work is entirely my own.
To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.
Brendan Kelley April 23, 2021
Hive Data Audit Prompt Results
This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.
Observation
The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.
Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.
Assumptions
Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.
Preparation
The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:
• A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic
These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:
For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular
For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The Semantic Web has the purpose of optimizing document recovery, where these documents received synonyms, allowing people and machines to understand the meaning of one information. The semantic annotation entity is the path to promote the semantic in documents. This paper has an objective to build an outline with the Semantic Web concepts that allow to automatically annotate entities in the Lattes Curriculum based on Linked Open Data (LOD), which store terms and expressions’ meaning. The problem addressed in this research is based on what of the Semantic Web concepts can contribute to the Automatic Semantic Annotation Entities of the Lattes Curriculum using Linked Open Data. During the literature review the concepts, tools and technologies related to the theme were presented. The application of these concepts allowed the creation of the Semantic Web Lattes System. An empirical study was conducted with the objective of identifying an Extraction Tool Entity further Effective. The system allows importing the XML curricula in the Lattes Platform, annotates automatically the available data using the open databases and allows to run semantic queries.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This data collection contains diachronic Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.
See previous versions for additional testsets.
Please find more information on the provided data in the paper referenced below.
Version: 2.0.0, 15.12.2021. Important: extends previous versions with one more annotation round and new clusterings.
Reference
Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).
Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.
The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.
About Dataset
from Kaggle Datasets
Context
Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
-------------------------------------
Siehe unten für die deutsche Version.
-------------------------------------
Diachronic Usage Relatedness (DURel) - Test Set and Annotation Data
This data collection supplementing the paper referenced below contains:
- a semantic change test set with 22 German lexemes divided into two classes: (i) lexemes for which the authors found innovative or (ii) reductive meaning change occuring in Deutsches Textarchiv (DTA) in the 19th century. (Note that for some lexemes the change is already observable slightly before 1800 and some lexemes occur more than once in the test set (see paper).) It comes as a tab-separated csv file where each line has the form
lemma POS type description earlier later delta_later compare frequency_1750-1800/1850-1900 source
The columns 'earlier' and 'later' contain the mean of all judgments for the respective word. The columns 'delta_later' and 'compare' contain the predictions of the annotation-based measures of semantic change developed in the paper;
- the full annotation table as annotators received it and a results table with rows in the same order. The result table comes in the form of a tab-separated csv file where each line has the form
lemma date1 date2 group annotator1 annotator2 annotator3 annotator4 annotator5 mean comments1 comments2 comments3 comments4 comments5
The columns 'date1' and 'date2' contain the date of the first and second use in the row. 'mean' contains the mean of all judgments for the use pair in this row without 0-judgments;
- the annotation guidelines in English and German;
- data visualization plots.
Find more information in
Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.
The resources are freely available for education, research and other non-commercial purposes. More information can be requested via email to the authors.
-------
Deutsch
-------
Diachroner Wortverwendungsbezug (DURel) - Test Set und Annotationsdaten
Diese Datensammlung ergänzt den unten zitierten Artikel und enthält folgende Dateien:
- ein Test set für semantischen Wandel mit 22 deutschen Lexemen, die in zwei Klassen fallen: (i) Lexeme, für die die Autoren innovativen oder (ii) reduktiven Bedeutungswandel im Deutschen Textarchiv (DTA) für das 19. Jahrhundert festgestellt haben. (Für einige Lexeme ist der Wandel schon etwas vor 1800 zu beobachten und manche Lexeme kommen mehr als einmal im Test set vor (siehe Artikel).) Hierbei handelt es sich um eine tab-separierte CSV-Datei, in der jede Zeile folgende Form hat:
Lexem Wortart Klasse Beschreibung earlier later delta_later compare Frequenz_1750-1800/1850-1900 Quelle
Die Spalten 'earlier' und 'later' enthalten den Mittelwert der Bewertungen für das jeweilige Wort. Die Spalten 'delta_later' und 'compare' enthalten die Vorhersagen der annotationsbasierten Maße für semantischen Wandel, die im Artikel entwickelt werden;
- Die Annotationstabelle, wie sie die Annotatoren erhalten haben, und eine Ergebnistabelle mit Zeilen in derselben Reihenfolge. Die Ergebnistabelle ist eine tab-separierte CSV-Datei, in der jede Zeile folgende Form hat:
Lexem Datum1 Datum2 Gruppe Annotator1 Annotator2 Annotator3 Annotator4 Annotator5 Mittelwert Kommentar1 Kommentar2 Kommentar3 Kommentar4 Kommentar5
Die Spalten 'Datum1' und 'Datum2' enthalten das Datum der ersten bzw. der zweiten Wortverwendung in der Zeile. 'Mittelwert' enthält den Mittelwert aller Bewertungen für das Verwendungspaar dieser Zeile ohne 0-Bewertungen;
- die Annotationsrichtlinien auf Deutsch und Englisch;
- Visualisierungsplots der Daten.
Mehr Informationen finden Sie in
Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.
Die Ressourcen sind frei verfügbar für Lehre, Forschung sowie andere nicht-kommerzielle Zwecke. Für weitere Informationen schreiben Sie bitte eine E-Mail an die Autoren.
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Sentence-layer annotation represents the most coarse-grained annotation in this corpus. We adhere to definitions of objectivity and subjectivity introduced in (Wiebe et al., 2005). Additionally, we followed guidelines drawn from (Balahur & Steinberger, 2009). Their clarifications proved to be quite effective, raising inter-annotator agreement in a sentence-layer polarity annotation task from about 50% to >80%. All sentences were annotated in two dimensions.
The first dimension covers the factual nature of the sentence, i.e. whether it provides objective information or if it is intended to express an opinion, belief or subjective argument. Therefore, it is either objective or subjective. The second dimension covers the semantic orientation of the sentence, i.e. its polarity. Thus, it is either positive, negative or neutral.
In the second layer, we model the contextually interpreted sentiments on the levels of words and NP/PP phrases. That is, the annotation decisions are based on the meaning of the words in the context of the sentence.
Word sentiment markers: The sentiments on the level of individual words are expressed by single character markers added at the end of the words.
A word might be positive (+), negative(-), neutral(empty), a shifter (~), an intensifier (^), or a diminisher (%).
If a word ends with a hyphen (e.g., "auf beziehungs-_ bzw. partnerschaftliche Probleme-", an underscore is added to the word in order to prevent missinterpretations of the hyphen as a negative marker.
Currently, only words that are part of an NP/PP are marked with sentiment markers. Annotated words are nouns, adjectives, negation particles, prepositions, adverbs.
The world level annotation was done by 3 persons individually. The individual results were harmonized into a single reference annotation.
Phrase level markers:
Each phrase is marked up textually by brackets, e.g. "[auf beziehungs-_ bzw. partnerschaftliche Probleme-]". The type of a phrase (NP/PP) is not written to the brackets. We follow largely the annotation model of TIGER for structuring embedded NPs and PPs.
Currently, the following limitations with regard to TIGER exist: (1) Adjectival phrases are not marked up (2) Relative or infinitival sentences are not included in NPs/PPs if they appear at the end of a phrase or if the are discontiguous. We do not only annotate the phrases which immediately contain words that are marked up as polar. Any dependent subphrase (NP/PP) is integrated into all its dominating NPs/PPs, e.g. "[Die tieferen Ursachen [der Faszination+]]". Dependent subphrases without any polar words are also included, however, there is no internal bracketing for them, e.g. "[hohe+ Ansprüche an Qualität und Lage]"
At the level of phrases, we distinguish the following markers: positive (+), negative (-), neutral(0), bipolar (#). The category 'bipolar' is used mainly for coordinations where negative and positive sentiments of something are kept in balance by the writer. This is quite common for a lot of binomial constructions as "Krieg und Frieden".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the absence of detailed functional annotation for any livestock genome, we used comparative genomics to predict ovine regulatory elements using human data. Reciprocal liftOver was used to predict the ovine genome location of ENCODE promoters and enhancers, along with 12 chromatin states built using 127 diverse epigenome. Here we make available the following files: a) Sheep_epigenome_predicted_features.tar.gz: contains the final reciprocal best alignment from ENCODE proximal as well as chromHMM ROADMAP features. The result of reciprocal liftOver. b) liftOver_sheep_temporary_files.tar.gz: We have added a new tar file with liftOver temporary files containing i) LiftOver temporary files mapping human to sheep, ii) LiftOver temporary files mapping sheep back to human and iii) Dictionary files containing the link between human to sheep coordinates for exact best-reciprocal files.
Lineage: Building a comparative sheep functional annotation. Our approach exploited the wealth of functional annotation data generated by the Epigenome Roadmap and ENCODE studies. We performed reciprocal liftOver (minMatch=0.1), meaning elements that mapped to sheep also needed to map in the reverse direction back to human with high quality. This bi-directional comparative mapping approach was applied to 12 chromatin states defined using 5 core histone modification marks, H3K4me3, H3K4me1, H3K36me3, H3K9me3, H3K27me3. Mapping success is given in Supplementary Table 9. The same approach was applied to ENCODE marks derived from 94 cell types (https://www.encodeproject.org/data/annotations/v2/) with DNase-seq and TF ChIP-seq.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This data collection contains diachronic Word Usage Graphs (WUGs) for German created with reference use sampling. Find a description of the data format, code to process the data and further datasets on the WUGsite.
Please find more information on the provided data in the paper referenced below.
Version: 1.0.0, 30.9.2021.
Reference
Dominik Schlechtweg and Sabine Schulte im Walde. submitted. Clustering Word Usage Graphs: A Flexible Framework to Measure Changes in Contextual Word Meaning.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).
Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data collection contains test Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.
The data is provided for testing purposes and thus contains specific data cases, which are sometimes artificially created, sometimes picked from existing data sets. The data contains the following cases:
Please find more information in the paper referenced below.
Version: 1.0.0, 05.05.2023.
Reference
Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.