41 datasets found
  1. TestWUG EN: Test Word Usage Graphs for English

    • zenodo.org
    zip
    Updated Jun 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg (2023). TestWUG EN: Test Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.7900960
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data collection contains test Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    The data is provided for testing purposes and thus contains specific data cases, which are sometimes artificially created, sometimes picked from existing data sets. The data contains the following cases:

    • afternoon_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 427 judgments. Has clear cluster structure with only one cluster, no graded change, no binary change, and medium agreement of 0.62 Krippendorff's alpha.
    • arm: standard textbook example for semantic proximity (see reference below). Fully connected graph with six words uses, annotated by author.
    • plane_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 1152 judgments. Has clear cluster structure, high graded change, binary change, and high agreement of 0.82 Krippendorff's alpha.
    • target: similar to arm, but with only two repeated sentences. Fully connected graph with six words uses, annotated by author. Same sentence (exactly same string) is annotated with 4, different string is annotated with 1.

    Please find more information in the paper referenced below.

    Version: 1.0.0, 05.05.2023.

    Reference

    Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.

  2. DWUG ES: Diachronic Word Usage Graphs for Spanish

    • zenodo.org
    zip
    Updated Apr 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2025). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.6300105
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    Please find more information on the provided data in the paper referenced below.

    The annotation was funded by

    • ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,
    • ANID - Millennium Science Initiative Program - Code ICN17 002 and
    • SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

    Version: 1.0.0, 7.3.2022. Development data.

    Reference

    Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish.

  3. A

    Abstract Meaning Representation (AMR) Annotation Release 3.0

    • abacus.library.ubc.ca
    iso, txt
    Updated Sep 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2021). Abstract Meaning Representation (AMR) Annotation Release 3.0 [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl%3A11272.1%2FAB2%2F82CVJF&version=&q=&fileAccess=Restricted&fileTag=%22Data%22&fileSortField=name&fileSortOrder=desc
    Explore at:
    iso(276281344), txt(1308)Available download formats
    Dataset updated
    Sep 3, 2021
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroductionAbstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations. AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).DataThe source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 7379 210 229 7818 DEFT DF English 32915 0 0 32915 Aesop fables 49 0 0 49 Guidelines AMRs 970 0 0 970 LORELEI 4441 354 527 5322 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Wikipedia 192 0 0 192 Xinhua MT 741 99 86 926 Totals 55635 1722 1898 59255 Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.

  4. c

    ckanext-datano

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-datano [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-datano
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    Unfortunately, no README file was found for the datano extension, limiting the ability to provide a detailed and comprehensive description. Therefore, the following description is based on the extension name and general assumptions about data annotation tools within the CKAN ecosystem. The datano extension for CKAN, presumably short for "data annotation," likely aims to enhance datasets with annotations, metadata enrichment, and quality control features directly within the CKAN environment. It potentially introduces functionalities for adding textual descriptions, classifications, or other forms of annotation to datasets to improve their discoverability, usability, and overall value. This extension could provide an interface for users to collaboratively annotate data, thereby enriching dataset descriptions and making the data more useful for various purposes. Key Features (Assumed): * Dataset Annotation Interface: Provides a user-friendly interface within CKAN for adding structured or unstructured annotations to datasets and associated resources. This allows for a richer understanding of the data's content, purpose, and usage. * Collaborative Annotation: Supports multiple users collaboratively annotating datasets, fostering knowledge sharing and collective understanding of the data. * Annotation Versioning: Maintains a history of annotations, enabling users to track changes and revert to previous versions if necessary. * Annotation Search: Allows users to search for datasets based on annotations, enabling quick discovery of relevant data based on specific criteria. * Metadata Enrichment: Integrates annotations with existing metadata, enhancing metadata schemas to support more detailed descriptions and contextual information. * Quality Control Features: Includes options to rate, validate, or flag annotations to ensure they are accurate and relevant, improving overall data quality. Use Cases (Assumed): 1. Data Discovery Improvement: Enables users to find specific datasets more easily by searching for datasets based on their annotations and enriched metadata. 2. Data Quality Enhancement: Allows data curators to improve the quality of datasets by adding annotations that clarify the data's meaning, provenance, and limitations. 3. Collaborative Data Projects: Facilitates collaborative data annotation efforts, wherein multiple users contribute to the enrichment of datasets with their knowledge and insights. Technical Integration (Assumed): The datano extension would likely integrate with CKAN's existing plugin framework, adding new UI elements for annotation management and search. It could leverage CKAN's API for programmatic access to annotations and utilize CKAN's security model for managing access permissions. Benefits & Impact (Assumed): By implementing the datano extension, CKAN users can leverage improvements to data discoverability, quality, and collaborative potential. The enhancement can help data curators to refine the understanding and management of data, making it easier to search, understand and promote data driven decision-making.

  5. A

    Abstract Meaning Representation (AMR) Annotation Release 2.0

    • abacus.library.ubc.ca
    iso, txt
    Updated Jun 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2017). Abstract Meaning Representation (AMR) Annotation Release 2.0 [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/8MN4GE
    Explore at:
    iso(157806592), txt(1308)Available download formats
    Dataset updated
    Jun 15, 2017
    Dataset provided by
    Abacus Data Network
    Time period covered
    1997 - 2017
    Area covered
    Israel, Taiwan, Province of China, China, United States, France
    Dataset funded by
    National Science Foundation
    Defense Advanced Research Projects Agency
    Description

    Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12). Data The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 6455 210 229 6894 DEFT DF English 19558 0 0 19558 Guidelines AMRs 819 0 0 819 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Xinhua MT 741 99 86 Totals 36521 1368 1371 39260 For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the “split” directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The “unsplit” directory contains the same 39,260 AMRs with no train/dev/test partition.

  6. DURel: Diachronic Usage Relatedness

    • zenodo.org
    zip
    Updated Apr 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann (2025). DURel: Diachronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5541340
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic semantic relatedness judgments for German word usage pairs. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    We provide additional data under misc/:

    • testset: a semantic change test set with 22 German lexemes divided into two classes: lexemes for which the authors found

      1. innovative or
      2. reductive meaning change

      occurring in Deutsches Textarchiv (DTA) in the 19th century. Note that for some lexemes the change is already observable slightly before 1800 and some lexemes occur more than once in the test set (see paper). The columns 'earlier' and 'later' contain the mean of all judgments for the respective word. The columns 'delta_later' and 'compare' contain the predictions of the annotation-based measures of semantic change developed in the paper.

    • tables: the full annotation table as annotators received it and a results table with rows in the same order. The columns 'date1' and 'date2' contain the date of the first and second use in the row. 'mean' contains the mean of all judgments for the use pair in this row without 0-judgments.
    • plots: data visualization plots.

    Please find more information on the provided data in the paper referenced below.

    Version: 2.0.0, 30.9.2021.

    Reference

    Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA.

  7. o

    DWUG SV: Diachronic Word Usage Graphs for Swedish

    • explore.openaire.eu
    • zenodo.org
    Updated Jul 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina Tahmasebi; Simon Hengchen; Dominik Schlechtweg; Barbara McGillivray; Haim Dubossarsky (2021). DWUG SV: Diachronic Word Usage Graphs for Swedish [Dataset]. http://doi.org/10.5281/zenodo.14028906
    Explore at:
    Dataset updated
    Jul 11, 2021
    Authors
    Nina Tahmasebi; Simon Hengchen; Dominik Schlechtweg; Barbara McGillivray; Haim Dubossarsky
    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for Swedish. Find a description of the data format, code to process the data and further datasets on the WUGsite: https://www.ims.uni-stuttgart.de/data/wugs We provide additional data under misc/: semeval: a larger list of words and (noisy) change scores assembled in the pre-annotation phase for SemEval-2020 Task 1. Please find more information on the provided data in the paper referenced below. Reference Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. {"references": ["Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. https://arxiv.org/abs/2104.08540"]} original paper version

  8. P

    LDC2017T10 Dataset

    • paperswithcode.com
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). LDC2017T10 Dataset [Dataset]. https://paperswithcode.com/dataset/ldc2017t10
    Explore at:
    Dataset updated
    Oct 28, 2021
    Description

    Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

    AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

  9. g

    Dataset with four years of condition monitoring technical language...

    • gimi9.com
    Updated Jan 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset with four years of condition monitoring technical language annotations from paper machine industries in northern Sweden | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-hafd-ms27/
    Explore at:
    Dataset updated
    Jan 8, 2024
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Sweden
    Description

    This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title. Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']

  10. Hive Annotation Job Results - Cleaned and Audited

    • kaggle.com
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Brendan Kelley
    Description

    Context

    This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

    Hive Data Audit Prompt

    The raw data that accompanies the prompt can be found below:

    Hive Annotation Job Results - Raw Data

    ^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

    To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

    Content

    Brendan Kelley April 23, 2021

    Hive Data Audit Prompt Results

    This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

    Observation

    The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

    Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

    Assumptions

    Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

    Preparation

    The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

    • A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

    These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

    For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

    For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...

  11. f

    Data from: Semantic Annotation Automatic of Curriculum Lattes Using Linked...

    • scielo.figshare.com
    jpeg
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walison Dias da Silva; Fernando Silva Parreiras; Luiz Cláudio Gomes Maia; Wladmir Cardoso Brandão (2023). Semantic Annotation Automatic of Curriculum Lattes Using Linked Open Data [Dataset]. http://doi.org/10.6084/m9.figshare.20006418.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SciELO journals
    Authors
    Walison Dias da Silva; Fernando Silva Parreiras; Luiz Cláudio Gomes Maia; Wladmir Cardoso Brandão
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The Semantic Web has the purpose of optimizing document recovery, where these documents received synonyms, allowing people and machines to understand the meaning of one information. The semantic annotation entity is the path to promote the semantic in documents. This paper has an objective to build an outline with the Semantic Web concepts that allow to automatically annotate entities in the Lattes Curriculum based on Linked Open Data (LOD), which store terms and expressions’ meaning. The problem addressed in this research is based on what of the Semantic Web concepts can contribute to the Automatic Semantic Annotation Entities of the Lattes Curriculum using Linked Open Data. During the literature review the concepts, tools and technologies related to the theme were presented. The application of these concepts allowed the creation of the Semantic Web Lattes System. An empirical study was conducted with the objective of identifying an Extraction Tool Entity further Effective. The system allows importing the XML curricula in the Lattes Platform, annotates automatically the available data using the open databases and allows to run semantic queries.

  12. DWUG EN: Diachronic Word Usage Graphs for English

    • zenodo.org
    zip
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg; Haim Dubossarsky; Haim Dubossarsky; Simon Hengchen; Simon Hengchen; Barbara McGillivray; Barbara McGillivray; Nina Tahmasebi; Nina Tahmasebi (2025). DWUG EN: Diachronic Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.5796878
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg; Haim Dubossarsky; Haim Dubossarsky; Simon Hengchen; Simon Hengchen; Barbara McGillivray; Barbara McGillivray; Nina Tahmasebi; Nina Tahmasebi
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    See previous versions for additional testsets.

    Please find more information on the provided data in the paper referenced below.

    Version: 2.0.0, 15.12.2021. Important: extends previous versions with one more annotation round and new clusterings.

    Reference

    Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages.

  13. E

    Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

    • live.european-language-grid.eu
    binary format
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 30, 2021
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

    Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

    The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.

  14. h

    kaggle-entity-annotated-corpus-ner-dataset

    • huggingface.co
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2022
    Authors
    Rafael Arias Calles
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Description

    Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

      About Dataset
    

    from Kaggle Datasets

      Context
    

    Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

  15. DURel: Diachronic Usage Relatedness

    • zenodo.org
    zip
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann (2025). DURel: Diachronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5541275
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    -------------------------------------
    Siehe unten für die deutsche Version.
    -------------------------------------

    Diachronic Usage Relatedness (DURel) - Test Set and Annotation Data


    This data collection supplementing the paper referenced below contains:

    - a semantic change test set with 22 German lexemes divided into two classes: (i) lexemes for which the authors found innovative or (ii) reductive meaning change occuring in Deutsches Textarchiv (DTA) in the 19th century. (Note that for some lexemes the change is already observable slightly before 1800 and some lexemes occur more than once in the test set (see paper).) It comes as a tab-separated csv file where each line has the form

    lemma POS type description earlier later delta_later compare frequency_1750-1800/1850-1900 source

    The columns 'earlier' and 'later' contain the mean of all judgments for the respective word. The columns 'delta_later' and 'compare' contain the predictions of the annotation-based measures of semantic change developed in the paper;

    - the full annotation table as annotators received it and a results table with rows in the same order. The result table comes in the form of a tab-separated csv file where each line has the form

    lemma date1 date2 group annotator1 annotator2 annotator3 annotator4 annotator5 mean comments1 comments2 comments3 comments4 comments5

    The columns 'date1' and 'date2' contain the date of the first and second use in the row. 'mean' contains the mean of all judgments for the use pair in this row without 0-judgments;

    - the annotation guidelines in English and German;
    - data visualization plots.

    Find more information in

    Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.

    The resources are freely available for education, research and other non-commercial purposes. More information can be requested via email to the authors.

    -------
    Deutsch
    -------

    Diachroner Wortverwendungsbezug (DURel) - Test Set und Annotationsdaten


    Diese Datensammlung ergänzt den unten zitierten Artikel und enthält folgende Dateien:

    - ein Test set für semantischen Wandel mit 22 deutschen Lexemen, die in zwei Klassen fallen: (i) Lexeme, für die die Autoren innovativen oder (ii) reduktiven Bedeutungswandel im Deutschen Textarchiv (DTA) für das 19. Jahrhundert festgestellt haben. (Für einige Lexeme ist der Wandel schon etwas vor 1800 zu beobachten und manche Lexeme kommen mehr als einmal im Test set vor (siehe Artikel).) Hierbei handelt es sich um eine tab-separierte CSV-Datei, in der jede Zeile folgende Form hat:

    Lexem Wortart Klasse Beschreibung earlier later delta_later compare Frequenz_1750-1800/1850-1900 Quelle

    Die Spalten 'earlier' und 'later' enthalten den Mittelwert der Bewertungen für das jeweilige Wort. Die Spalten 'delta_later' und 'compare' enthalten die Vorhersagen der annotationsbasierten Maße für semantischen Wandel, die im Artikel entwickelt werden;

    - Die Annotationstabelle, wie sie die Annotatoren erhalten haben, und eine Ergebnistabelle mit Zeilen in derselben Reihenfolge. Die Ergebnistabelle ist eine tab-separierte CSV-Datei, in der jede Zeile folgende Form hat:

    Lexem Datum1 Datum2 Gruppe Annotator1 Annotator2 Annotator3 Annotator4 Annotator5 Mittelwert Kommentar1 Kommentar2 Kommentar3 Kommentar4 Kommentar5

    Die Spalten 'Datum1' und 'Datum2' enthalten das Datum der ersten bzw. der zweiten Wortverwendung in der Zeile. 'Mittelwert' enthält den Mittelwert aller Bewertungen für das Verwendungspaar dieser Zeile ohne 0-Bewertungen;

    - die Annotationsrichtlinien auf Deutsch und Englisch;
    - Visualisierungsplots der Daten.


    Mehr Informationen finden Sie in

    Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.

    Die Ressourcen sind frei verfügbar für Lehre, Forschung sowie andere nicht-kommerzielle Zwecke. Für weitere Informationen schreiben Sie bitte eine E-Mail an die Autoren.

  16. w

    Data from: MLSA - A Multi-layered Reference Corpus for German Sentiment...

    • data.wu.ac.at
    api/sparql +3
    Updated Mar 18, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AKSW (2015). MLSA - A Multi-layered Reference Corpus for German Sentiment Analysis [Dataset]. https://data.wu.ac.at/schema/datahub_io/MDQzYWQ0ZmQtYzgyNy00MjdkLWJmODktMjc3YWI3YTUzY2Vh
    Explore at:
    api/sparql(9514.0), example/turtle(1779.0), png(406789.0), n3(212728.0)Available download formats
    Dataset updated
    Mar 18, 2015
    Dataset provided by
    AKSW
    License

    http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa

    Description

    Sentence-layer annotation represents the most coarse-grained annotation in this corpus. We adhere to definitions of objectivity and subjectivity introduced in (Wiebe et al., 2005). Additionally, we followed guidelines drawn from (Balahur & Steinberger, 2009). Their clarifications proved to be quite effective, raising inter-annotator agreement in a sentence-layer polarity annotation task from about 50% to >80%. All sentences were annotated in two dimensions.

    The first dimension covers the factual nature of the sentence, i.e. whether it provides objective information or if it is intended to express an opinion, belief or subjective argument. Therefore, it is either objective or subjective. The second dimension covers the semantic orientation of the sentence, i.e. its polarity. Thus, it is either positive, negative or neutral.

    In the second layer, we model the contextually interpreted sentiments on the levels of words and NP/PP phrases. That is, the annotation decisions are based on the meaning of the words in the context of the sentence.

    Word sentiment markers: The sentiments on the level of individual words are expressed by single character markers added at the end of the words.

    A word might be positive (+), negative(-), neutral(empty), a shifter (~), an intensifier (^), or a diminisher (%).

    If a word ends with a hyphen (e.g., "auf beziehungs-_ bzw. partnerschaftliche Probleme-", an underscore is added to the word in order to prevent missinterpretations of the hyphen as a negative marker.

    Currently, only words that are part of an NP/PP are marked with sentiment markers. Annotated words are nouns, adjectives, negation particles, prepositions, adverbs.

    The world level annotation was done by 3 persons individually. The individual results were harmonized into a single reference annotation.

    Phrase level markers:

    Each phrase is marked up textually by brackets, e.g. "[auf beziehungs-_ bzw. partnerschaftliche Probleme-]". The type of a phrase (NP/PP) is not written to the brackets. We follow largely the annotation model of TIGER for structuring embedded NPs and PPs.

    Currently, the following limitations with regard to TIGER exist: (1) Adjectival phrases are not marked up (2) Relative or infinitival sentences are not included in NPs/PPs if they appear at the end of a phrase or if the are discontiguous. We do not only annotate the phrases which immediately contain words that are marked up as polar. Any dependent subphrase (NP/PP) is integrated into all its dominating NPs/PPs, e.g. "[Die tieferen Ursachen [der Faszination+]]". Dependent subphrases without any polar words are also included, however, there is no internal bracketing for them, e.g. "[hohe+ Ansprüche an Qualität und Lage]"

    At the level of phrases, we distinguish the following markers: positive (+), negative (-), neutral(0), bipolar (#). The category 'bipolar' is used mainly for coordinations where negative and positive sentiments of something are kept in balance by the writer. This is quite common for a lot of binomial constructions as "Krieg und Frieden".

  17. Preliminary functional annotation of the sheep genome

    • data.csiro.au
    • researchdata.edu.au
    Updated Nov 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marina Naval Sanchez; Quan Nguyen; Sean McWilliam; Laercio Porto Neto; Toni Reverter-Gomez; James Kijas (2017). Preliminary functional annotation of the sheep genome [Dataset]. http://doi.org/10.4225/08/5a03a9c39a0ba
    Explore at:
    Dataset updated
    Nov 9, 2017
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Marina Naval Sanchez; Quan Nguyen; Sean McWilliam; Laercio Porto Neto; Toni Reverter-Gomez; James Kijas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2016 - Jan 1, 2017
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    In the absence of detailed functional annotation for any livestock genome, we used comparative genomics to predict ovine regulatory elements using human data. Reciprocal liftOver was used to predict the ovine genome location of ENCODE promoters and enhancers, along with 12 chromatin states built using 127 diverse epigenome. Here we make available the following files: a) Sheep_epigenome_predicted_features.tar.gz: contains the final reciprocal best alignment from ENCODE proximal as well as chromHMM ROADMAP features. The result of reciprocal liftOver. b) liftOver_sheep_temporary_files.tar.gz: We have added a new tar file with liftOver temporary files containing i) LiftOver temporary files mapping human to sheep, ii) LiftOver temporary files mapping sheep back to human and iii) Dictionary files containing the link between human to sheep coordinates for exact best-reciprocal files.

    Lineage: Building a comparative sheep functional annotation. Our approach exploited the wealth of functional annotation data generated by the Epigenome Roadmap and ENCODE studies. We performed reciprocal liftOver (minMatch=0.1), meaning elements that mapped to sheep also needed to map in the reverse direction back to human with high quality. This bi-directional comparative mapping approach was applied to 12 chromatin states defined using 5 core histone modification marks, H3K4me3, H3K4me1, H3K36me3, H3K9me3, H3K27me3. Mapping success is given in Supplementary Table 9. The same approach was applied to ENCODE marks derived from 94 cell types (https://www.encodeproject.org/data/annotations/v2/) with DNase-seq and TF ChIP-seq.

  18. RefWUG: Diachronic Reference Word Usage Graphs for German

    • zenodo.org
    zip
    Updated Apr 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde (2025). RefWUG: Diachronic Reference Word Usage Graphs for German [Dataset]. http://doi.org/10.5281/zenodo.5544578
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for German created with reference use sampling. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    Please find more information on the provided data in the paper referenced below.

    Version: 1.0.0, 30.9.2021.

    Reference

    Dominik Schlechtweg and Sabine Schulte im Walde. submitted. Clustering Word Usage Graphs: A Flexible Framework to Measure Changes in Contextual Word Meaning.

  19. n

    Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

    • narcis.nl
    • data.mendeley.com
    Updated Apr 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzuki, T (via Mendeley Data) (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.2
    Explore at:
    Dataset updated
    Apr 19, 2022
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Suzuki, T (via Mendeley Data)
    Description

    Brief ExplanationThis dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results. ## Structure of the Dataset1. data directory 1.1. image_URL.txt This file lists URLs of image files. 1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt 1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated. 1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).2. img directory This directory is a placeholder directory to fetch image files for downloading.3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance. 4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.

  20. c

    Data from: Slovenian Word in Context dataset SloWiC 1.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781
    Explore at:
    Dataset updated
    Mar 23, 2023
    Authors
    Timotej Knez; Slavko Žitnik
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

    Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dominik Schlechtweg; Dominik Schlechtweg (2023). TestWUG EN: Test Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.7900960
Organization logo

TestWUG EN: Test Word Usage Graphs for English

Explore at:
zipAvailable download formats
Dataset updated
Jun 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This data collection contains test Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.

The data is provided for testing purposes and thus contains specific data cases, which are sometimes artificially created, sometimes picked from existing data sets. The data contains the following cases:

  • afternoon_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 427 judgments. Has clear cluster structure with only one cluster, no graded change, no binary change, and medium agreement of 0.62 Krippendorff's alpha.
  • arm: standard textbook example for semantic proximity (see reference below). Fully connected graph with six words uses, annotated by author.
  • plane_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 1152 judgments. Has clear cluster structure, high graded change, binary change, and high agreement of 0.82 Krippendorff's alpha.
  • target: similar to arm, but with only two repeated sentences. Fully connected graph with six words uses, annotated by author. Same sentence (exactly same string) is annotated with 4, different string is annotated with 1.

Please find more information in the paper referenced below.

Version: 1.0.0, 05.05.2023.

Reference

Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.

Search
Clear search
Close search
Google apps
Main menu