59 datasets found
  1. DWUG ES: Diachronic Word Usage Graphs for Spanish

    • zenodo.org
    zip
    Updated Feb 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2024). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.6433203
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    Please find more information on the provided data in the paper referenced below.

    The annotation was funded by

    • ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,
    • ANID - Millennium Science Initiative Program - Code ICN17 002 and
    • SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

    Version: 1.0.1, 9.4.2022. Development data.

    Reference

    Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.

  2. A

    Abstract Meaning Representation (AMR) Annotation Release 3.0

    • abacus.library.ubc.ca
    iso, txt
    Updated Sep 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2021). Abstract Meaning Representation (AMR) Annotation Release 3.0 [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/82CVJF
    Explore at:
    txt(1308), iso(276281344)Available download formats
    Dataset updated
    Sep 3, 2021
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroductionAbstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations. AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).DataThe source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 7379 210 229 7818 DEFT DF English 32915 0 0 32915 Aesop fables 49 0 0 49 Guidelines AMRs 970 0 0 970 LORELEI 4441 354 527 5322 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Wikipedia 192 0 0 192 Xinhua MT 741 99 86 926 Totals 55635 1722 1898 59255 Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.

  3. d

    Annotation Curricula to Implicitly Train Non-Expert Annotators - Dataset -...

    • b2find.dkrz.de
    Updated Aug 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Annotation Curricula to Implicitly Train Non-Expert Annotators - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/a5f6640f-4c4c-59be-b3e9-a53b79b57c97
    Explore at:
    Dataset updated
    Aug 29, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowd sourcing scenarios where domain expertise is not required and only annotation guidelines are provided. To alleviate these issues, we propose annotation curricula, a novel approach to implicitly train annotators. We gradually introduce annotators into the task by ordering instances that are annotated according to a learning curriculum. To do so, we first formalize annotation curricula for sentence- and paragraph-level annotation tasks, define an ordering strategy, and identify well-performing heuristics and interactively trained models on three existing English datasets. We then conduct a user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. Our results show that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can provide a novel way to improve data collection. To facilitate future research, we further share our code and data consisting of 2,400 annotations.

  4. TestWUG EN: Test Word Usage Graphs for English

    • zenodo.org
    zip
    Updated Jun 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg (2023). TestWUG EN: Test Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.7900960
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data collection contains test Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    The data is provided for testing purposes and thus contains specific data cases, which are sometimes artificially created, sometimes picked from existing data sets. The data contains the following cases:

    • afternoon_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 427 judgments. Has clear cluster structure with only one cluster, no graded change, no binary change, and medium agreement of 0.62 Krippendorff's alpha.
    • arm: standard textbook example for semantic proximity (see reference below). Fully connected graph with six words uses, annotated by author.
    • plane_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 1152 judgments. Has clear cluster structure, high graded change, binary change, and high agreement of 0.82 Krippendorff's alpha.
    • target: similar to arm, but with only two repeated sentences. Fully connected graph with six words uses, annotated by author. Same sentence (exactly same string) is annotated with 4, different string is annotated with 1.

    Please find more information in the paper referenced below.

    Version: 1.0.0, 05.05.2023.

    Reference

    Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.

  5. o

    DWUG EN: Diachronic Word Usage Graphs for English

    • explore.openaire.eu
    Updated Sep 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Haim Dubossarsky; Simon Hengchen; Barbara McGillivray; Nina Tahmasebi (2021). DWUG EN: Diachronic Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.14028531
    Explore at:
    Dataset updated
    Sep 30, 2021
    Authors
    Dominik Schlechtweg; Haim Dubossarsky; Simon Hengchen; Barbara McGillivray; Nina Tahmasebi
    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite. See previous versions for additional testsets. Please find more information on the provided data in the papers referenced below. Reference Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Dominik Schlechtweg, Pierluigi Cassotti, Bill Noble, David Alfter, Sabine Schulte im Walde, Nina Tahmasebi. More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Extends previous versions with one more annotation round and new clusterings.

  6. SURel: Synchronic Usage Relatedness

    • zenodo.org
    zip
    Updated Feb 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty (2024). SURel: Synchronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5543348
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains synchronic semantic relatedness judgments for German word usage pairs drawn from general language and the domain of cooking. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    We provide additional data under misc/:

    • testset: a semantic meaning shift test set with 22 German lexemes exhibiting different degrees of meaning shifts from general language to the domain of cooking. The 'mean relatedness score' denotes the annotation-based measure of semantic shift described in the paper. 'frequency GEN' and 'frequency SPEC' list the frequencies of the target words in the general-language corpus (GEN) and the domain-specific cooking corpus (SPEC). 'translations' provides English translations across senses, illustrating possible meaning shifts. Note that further senses might exist.
    • tables: the annotated table of each annotator.
    • plots: data visualization plots.

    Please find more information on the provided data in the paper referenced below.

    Version: 2.0.0, 30.9.2021.

    Reference

    Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota, USA, 2019.

  7. Expert annotations for the Catalan Common Voice (v13)

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Expert annotations for the Catalan Common Voice (v13) [Dataset]. http://doi.org/10.5281/zenodo.11104388
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    - Homepage: https://projecteaina.cat/tech/]
    - Point of Contact: langech@bsc.es

    Dataset Summary

    These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

    The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

    The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

    See annotations for more details.

    Supported Tasks and Leaderboards

    Gender classification, Accent classification.

    Languages

    The dataset is in Catalan (ca).

    Dataset Structure

    Instances

    Two xlsx documents are published, one for each round of annotations.

    The following information is available in each of the documents:


    {
    'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b',
    'idx': '31',
    'same speaker': {'AN1': 'SI',
    'AN2': 'SI',
    'AN3': 'SI',
    'agreed': 'SI',
    'percentage': '100'},
    'gender': {'AN1': 'H',
    'AN2': 'H',
    'AN3': 'H',
    'agreed': 'H',
    'percentage': '100'},
    'accent': {'AN1': 'Central',
    'AN2': 'Central',
    'AN3': 'Central',
    'agreed': 'Central',
    'percentage': '100'},
    'audio quality': {'AN1': '4.0',
    'AN2': '3.0',
    'AN3': '3.0',
    'agreed': '3.0',
    'percentage': '66',
    'mean quality': '3.33',
    'stdev quality': '0.58'},
    'comments': {'AN1': '',
    'AN2': 'pujades i baixades de volum',
    'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'},
    }

    We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

    Data Fields

    • speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus
    • idx (int): Id in this corpus
    • AN1 (string): Annotations from Annotator 1
    • AN2 (string): Annotations from Annotator 2
    • AN3 (string): Annotations from Annotator 3
    • agreed (string): Annotation from the majority of the annotators
    • percentage (int): Percentage of annotators that agree with the agreed annotation
    • mean quality (float): Mean of the quality annotation
    • stdev quality (float): Standard deviation of the mean quality

    Data Splits

    The corpus remains undivided into splits, as its purpose does not involve training models.

    Dataset Creation

    Curation Rationale

    During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

    In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Source Data

    The original data comes from the [Catalan sentences of the Common Voice corpus](https://commonvoice.mozilla.org/en/datasets).

    Initial Data Collection and Normalization

    We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

    Who are the source language producers?

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Annotations

    Annotation process

    Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

    A team of three annotators was tasked with annotating:

    • if all the recordings correspond to the same person
    • the gender of the speaker
    • the accent of the speaker
    • the quality of the recording

    They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Who are the annotators?

    The annotation was entrusted to the [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic) team from the University of Barcelona.
    They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

    The annotation team was composed of:

    • Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.
    • Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.
    • 1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

    To do the annotation they used a Google Drive spreadsheet

    Personal and Sensitive Information

    The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent.
    You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    Considerations for Using the Data

    Social Impact of Dataset

    The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

    You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Discussion of Biases

    Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

    For the gender annotation, we have only considered "H" (male) and "D" (female).

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset is licensed under a CC BY 4.0 license.

    It can be used for any purpose, whether academic or commercial, under the terms of the license.
    Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The annotation was entrusted to the STeL team from the University of Barcelona.

  8. P

    LDC2017T10 Dataset

    • paperswithcode.com
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). LDC2017T10 Dataset [Dataset]. https://paperswithcode.com/dataset/ldc2017t10
    Explore at:
    Dataset updated
    Oct 28, 2021
    Description

    Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

    AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

  9. DWUG SV: Diachronic Word Usage Graphs for Swedish

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Nov 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina Tahmasebi; Nina Tahmasebi; Simon Hengchen; Simon Hengchen; Dominik Schlechtweg; Dominik Schlechtweg; Barbara McGillivray; Barbara McGillivray; Haim Dubossarsky; Haim Dubossarsky (2024). DWUG SV: Diachronic Word Usage Graphs for Swedish [Dataset]. http://doi.org/10.5281/zenodo.14028906
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nina Tahmasebi; Nina Tahmasebi; Simon Hengchen; Simon Hengchen; Dominik Schlechtweg; Dominik Schlechtweg; Barbara McGillivray; Barbara McGillivray; Haim Dubossarsky; Haim Dubossarsky
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for Swedish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    See previous versions for additional testsets.

    Please find more information on the provided data in the papers referenced below.

    Reference

    Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.

    Dominik Schlechtweg, Pierluigi Cassotti, Bill Noble, David Alfter, Sabine Schulte im Walde, Nina Tahmasebi. More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

  10. H

    PEARC20 submitted paper: "Scientific Data Annotation and Dissemination:...

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Jul 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056
    Explore at:
    zip(873 bytes)Available download formats
    Dataset updated
    Jul 29, 2020
    Dataset provided by
    HydroShare
    Authors
    Sean Cleveland; Gwen Jacobs; Jennifer Geis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Jul 29, 2020
    Description

    Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.

  11. RefWUG: Diachronic Reference Word Usage Graphs for German

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Feb 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde (2024). RefWUG: Diachronic Reference Word Usage Graphs for German [Dataset]. http://doi.org/10.5281/zenodo.5791269
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for German created with reference use sampling. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    Please find more information on the provided data in the paper referenced below.

    Version: 1.1.0, 15.12.2021.

    Reference

    Dominik Schlechtweg and Sabine Schulte im Walde. submitted. Clustering Word Usage Graphs: A Flexible Framework to Measure Changes in Contextual Word Meaning.

  12. DURel: Diachronic Usage Relatedness

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann (2024). DURel: Diachronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5784453
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic semantic relatedness judgments for German word usage pairs. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    Please find more information on the provided data in the paper referenced below.

    See previous versions for additional plots, tables and testsets.

    Version: 3.0.0, 15.12.2021.

    Reference

    Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA.

  13. Z

    Data from: CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Capllonch-Carrión, Adrián (2022). CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine in Spanish [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6059737
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset provided by
    Moreno-Sandoval, Antonio
    Campillos-Llanos, Leonardo
    Capllonch-Carrión, Adrián
    Valverde-Mateos, Ana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 1200 texts (292 173 tokens) about clinical trials studies and clinical trials announcements in Spanish:

    • 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO).
    • 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos.

    Texts were annotated with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). 46 699 entities were annotated (13.98% are nested entities). 10% of the corpus was doubly annotated, and inter-annotator agreement (IAA) achieved a mean F-measure of 85.65% (±4.79, strict match) and a mean F-measure of 93.94% (±3.31, relaxed match).

    The corpus is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License.

  14. Data from: FluoroMatch 2.0-making automated and comprehensive non-targeted...

    • catalog.data.gov
    • gimi9.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality [Dataset]. https://catalog.data.gov/dataset/fluoromatch-2-0-making-automated-and-comprehensive-non-targeted-pfas-annotation-a-reality
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Data for "Koelmel JP, Stelben P, McDonough CA, Dukes DA, Aristizabal-Henao JJ, Nason SL, Li Y, Sternberg S, Lin E, Beckmann M, Williams AJ, Draper J, Finch JP, Munk JK, Deigl C, Rennie EE, Bowden JA, Godri Pollitt KJ. FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality. Anal Bioanal Chem. 2022 Jan;414(3):1201-1215. doi: 10.1007/s00216-021-03392-7. Epub 2021 May 20. PMID: 34014358.". Portions of this dataset are inaccessible because: The link provided by UCSD doesn't seem to be working. They can be accessed through the following means: Contact Jeremy Koelmel at Yale University, jeremykoelmel@innovativeomics.com. Format: The final annotated excel sheets with feature intensities, annotations, homologous series groupings, etc., are available as a supplemental excel file with the online version of this manuscript. The raw Agilent “.d” files can be downloaded at: ftp://massive.ucsd.edu/MSV000086811/updates/2021-02-05_jeremykoelmel_e5b21166/raw/McDonough_AFFF_3M_ddMS2_Neg.zip (Note use Google Chrome or Firefox, Microsoft Edge and certain other browsers are unable to download from an ftp link). This dataset is associated with the following publication: Koelmel, J.P., P. Stelben, C.A. McDonough, D.A. Dukes, J.J. Aristizabal-Henao, S.L. Nason, Y. Li, S. Sternberg, E. Lin, M. Beckmann, A. Williams, J. Draper, J. Finch, J.K. Munk, C. Deigl, E. Rennie, J.A. Bowden, and K.J. Godri Pollitt. FluoroMatch 2.0—making automated and comprehensive non-targeted PFAS annotation a reality. Analytical and Bioanalytical Chemistry. Springer, New York, NY, USA, 414(3): 1201-1215, (2022).

  15. c

    Amazon Mechanical Turk: Sentence annotation experiments

    • datacatalogue.cessda.eu
    Updated Mar 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lau, J; Lappin, S (2025). Amazon Mechanical Turk: Sentence annotation experiments [Dataset]. http://doi.org/10.5255/UKDA-SN-851337
    Explore at:
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    King
    Authors
    Lau, J; Lappin, S
    Time period covered
    Oct 1, 2012 - Sep 30, 2015
    Area covered
    United Kingdom, United States
    Variables measured
    Individual
    Measurement technique
    Amazon Mechanical Turk crowd sourcing
    Description

    This data collection consists of two .csv files containing lists of sentences with individual and mean sentence ratings (crowd sourced judgements) on three modes of presentation.

    This research holds out the prospect of important impact in two areas. First, it can shed light on the relationship between the representation and acquisition of linguistic knowledge on one hand, and learning and the encoding of knowledge in other cognitive domains. This work can, in turn, help to clarify the respective roles of biologically conditioned learning biases and data driven learning in human cognition.

    Second, this work can contribute to the development of more effective language technology by providing insight, from a computational perspective, into the way in which humans represent the syntactic properties of sentences in their language. To the extent that natural language processing systems take account of this class of representations they will provide more efficient tools for parsing and interpreting text and speech.

    In the past twenty-five years work in natural language technology has made impressive progress across a wide range of tasks, which include, among others, information retrieval and extraction, text interpretation and summarization, speech recognition, morphological analysis, syntactic parsing, word sense identification, and machine translation. Much of this progress has been due to the successful application of powerful techniques for probabilistic modeling and statistical analysis to large corpora of linguistic data. These methods have given rise to a set of engineering tools that are rapidly shaping the digital environment in which we access and process most of the information that we use.

    In recent work (Lappin and Shieber (2007), Clark and Lappin (2011a), Clark and Lappin (2011b)) my co-authors and I have argued that the machine learning methods that are driving the expansion of natural language technology are also directly relevant to understanding central features of human language acquisition. When these methods are used to construct carefully specified formal models and implementations of the grammar induction task, they yield striking insights into the limits and possibility of human learning on the basis of the primary linguistic data to which children are exposed. These models indicate that language learning can be achieved without the sorts of strong innate learning biases that have been posited by traditional theories of universal grammar. Weak biases, some derivable from non-linguistic cognitive domains, and domain general learning procedures are sufficient to support efficient data driven learning of plausible systems of grammatical representation.

    In the current research I am focussing on the problem of how to specify the class of representations that encode human knowledge of the syntax of natural languages. I am pursuing the hypothesis that a representation in this class is best expressed as an enriched statistical language model that assigns probability values to the sentences of a language. A central part of the enrichment of the model consists of a procedure for determining the acceptability (grammaticality) of a sentence as a graded value, relative to the properties of that sentence and the language of which it is a part. This procedure avoids the simple reduction of the grammaticality of a string to its estimated probability of occurrence, while still characterizing grammaticality in probabilistic terms. An enriched model of this kind will provide a straightforward explanation for the fact that individual native speakers generally judge the well formedness of sentences along a continuum, rather than through the imposition of a sharp boundary between acceptable and unacceptable sentences. The pervasiveness of gradedness in the linguistic knowledge of individual speakers poses a serious problem for classical theories of syntax, which partition strings of words into the grammatical sentences of a language and ill formed strings of words.

  16. d

    Data from: FAPM: Functional annotation of proteins using multi-modal models...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Jul 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenkai Xiang; Zhaoping Xiong; Mingyue Zheng; Huan Chen; Zunyun Fu; Wei Zhang; Jiacheng Xiong; Qian Shi; Bing Liu (2024). FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling [Dataset]. https://search.dataone.org/view/sha256%3A3109cbde82aec70d1f9ebc5c91132588916b42fc29cbb08005dc7a0af3366692
    Explore at:
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Wenkai Xiang; Zhaoping Xiong; Mingyue Zheng; Huan Chen; Zunyun Fu; Wei Zhang; Jiacheng Xiong; Qian Shi; Bing Liu
    Description

    Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and “tail labels†with few known examples. Unlike previous methods that mainly focused on protein sequence features, we use a pretrained large natural language model to understand the semantic meaning of protein labels. Specifically, we introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homol..., , , # FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling

    https://doi.org/10.5061/dryad.m905qfv9p

    The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo

    Description of the data and file structure

    The dataset includes:

    1. The information of GO (Gene Ontology). This is a system to describe the functions of proteins.Â

      -The basic version of the GO (file name: go1.4-basic.obo). Source: https://geneontology.org/docs/download-ontology/

      -The mapping between GO numbers and GO descriptions (file name: go_descriptions1.4.txt)

      -GO terms (file names: bp_terms.pkl; mf_terms.pkl; cc_terms.pkl)

    2. Manually annotated data derived from Uniprot database. These datasets are used to finetune the model.

      -File names:

      train_exp_prompt_bp.csv; train_exp_prompt_mf.csv; train_exp_prompt_cc.cs...

  17. E

    Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

    • live.european-language-grid.eu
    binary format
    Updated Jun 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 30, 2021
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

    Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

    The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.

  18. f

    Definition of concept coverage scores for ASSESS CT manual annotation.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Definition of concept coverage scores for ASSESS CT manual annotation. [Dataset]. http://doi.org/10.1371/journal.pone.0209547.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Definition of concept coverage scores for ASSESS CT manual annotation.

  19. e

    Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC)

    • data.europa.eu
    • snd.se
    • +1more
    unknown
    Updated May 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linnéuniversitetet (2024). Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC) [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-5878-002925/embed
    Explore at:
    unknownAvailable download formats
    Dataset updated
    May 21, 2024
    Dataset authored and provided by
    Linnéuniversitetet
    Description

    In this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.

    Purpose:

    The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.

    The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.

    For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.

    The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.

    When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060

  20. Practical Annotation and Exchange of Virtual Anatomy

    • simtk.org
    data/images/video
    Updated Jan 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmet Erdemir; Andinet Enquobahrie (2021). Practical Annotation and Exchange of Virtual Anatomy [Dataset]. https://simtk.org/frs/?group_id=1767
    Explore at:
    (0), data/images/video(74 MB)Available download formats
    Dataset updated
    Jan 15, 2021
    Dataset provided by
    Kitwarehttps://www.kitware.com/
    Cleveland Clinic, Lerner Research Institute
    Authors
    Ahmet Erdemir; Andinet Enquobahrie
    Description

    Representation of anatomy in a virtual form is at the heart of clinical decision making, biomedical research, and medical training. Virtual anatomy is not limited to description of geometry but also requires appropriate and efficient labeling of regions - to define spatial relationships and interactions between anatomical objects; effective strategies for pointwise operations - to define local properties, biological or otherwise; and support for diverse data formats and standards - to facilitate exchange between clinicians, scientists, engineers, and the general public. Development of aeva, a free and open source software package (library, user interfaces, extensions) capable of automated and interactive operations for virtual anatomy annotation and exchange, is in response to these currently unmet requirements. This site serves for aeva outreach, including dissemination the software and use cases. The use cases drive design and testing of aeva features and demonstrate various workflows that rely on virtual anatomy.

    aeva downloads: Downloads (https://simtk.org/frs/?group_id=1767) Kitware data repository (https://data.kitware.com/#folder/5e7a4690af2e2eed356a17f2)

    aeva documentation: Guides and tutorials (https://aeva.readthedocs.io)

    aeva videos: Short instructions (https://www.youtube.com/channel/UCubfUe40LXvBs86UyKci0Fw)

    aeva source code: Kitware source code repository (https://gitlab.kitware.com/aeva)

    aeva forum: Forums (https://simtk.org/plugins/phpBB/indexPhpbb.php?group_id=1767 )



    This project includes the following software/data packages:

    • aevaCMB : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aevaCMB will be familiar to users of ParaView and Computational Model Builder. The interface is customized and new features have been added to support operations for import and export of anatomical representations and for annotation (template based and freeform, including a powerful set of region selection).
    • aevaSlicer : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aevaSlicer will be familiar to users of Slicer. The interface is customized and new features have been added to accommodate a workflow amenable to generation of surface and volume meshes of anatomy from medical images.
    • aeva Tutorials : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aeva Tutorials provide data used and content generated by aevaSlicer and aevaCMB.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2024). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.6433203
Organization logo

DWUG ES: Diachronic Word Usage Graphs for Spanish

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg
License

Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically

Description

This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the paper referenced below.

The annotation was funded by

  • ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,
  • ANID - Millennium Science Initiative Program - Code ICN17 002 and
  • SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

Version: 1.0.1, 9.4.2022. Development data.

Reference

Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.

Search
Clear search
Close search
Google apps
Main menu