59 datasets found

DWUG ES: Diachronic Word Usage Graphs for Spanish
zenodo.org
zip
Updated Feb 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2024). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.6433203
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6433203
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the paper referenced below.

The annotation was funded by

ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,

ANID - Millennium Science Initiative Program - Code ICN17 002 and

SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

Version: 1.0.1, 9.4.2022. Development data.

Reference

Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.
A
Abstract Meaning Representation (AMR) Annotation Release 3.0
abacus.library.ubc.ca
iso, txt
Updated Sep 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2021). Abstract Meaning Representation (AMR) Annotation Release 3.0 [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/82CVJF
Explore at:
txt(1308), iso(276281344)Available download formats
Dataset updated
Sep 3, 2021
Dataset provided by
Abacus Data Network
Description
AbstractIntroductionAbstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations. AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).DataThe source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 7379 210 229 7818 DEFT DF English 32915 0 0 32915 Aesop fables 49 0 0 49 Guidelines AMRs 970 0 0 970 LORELEI 4441 354 527 5322 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Wikipedia 192 0 0 192 Xinhua MT 741 99 86 926 Totals 55635 1722 1898 59255 Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.
d
Annotation Curricula to Implicitly Train Non-Expert Annotators - Dataset -...
b2find.dkrz.de
Updated Aug 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Annotation Curricula to Implicitly Train Non-Expert Annotators - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/a5f6640f-4c4c-59be-b3e9-a53b79b57c97
Explore at:
Dataset updated
Aug 29, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowd sourcing scenarios where domain expertise is not required and only annotation guidelines are provided. To alleviate these issues, we propose annotation curricula, a novel approach to implicitly train annotators. We gradually introduce annotators into the task by ordering instances that are annotated according to a learning curriculum. To do so, we first formalize annotation curricula for sentence- and paragraph-level annotation tasks, define an ordering strategy, and identify well-performing heuristics and interactively trained models on three existing English datasets. We then conduct a user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. Our results show that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can provide a novel way to improve data collection. To facilitate future research, we further share our code and data consisting of 2,400 annotations.
TestWUG EN: Test Word Usage Graphs for English
zenodo.org
zip
Updated Jun 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Dominik Schlechtweg (2023). TestWUG EN: Test Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.7900960
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7900960
Dataset updated
Jun 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data collection contains test Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.

The data is provided for testing purposes and thus contains specific data cases, which are sometimes artificially created, sometimes picked from existing data sets. The data contains the following cases:

afternoon_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 427 judgments. Has clear cluster structure with only one cluster, no graded change, no binary change, and medium agreement of 0.62 Krippendorff's alpha.

arm: standard textbook example for semantic proximity (see reference below). Fully connected graph with six words uses, annotated by author.

plane_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 1152 judgments. Has clear cluster structure, high graded change, binary change, and high agreement of 0.82 Krippendorff's alpha.

target: similar to arm, but with only two repeated sentences. Fully connected graph with six words uses, annotated by author. Same sentence (exactly same string) is annotated with 4, different string is annotated with 1.

Please find more information in the paper referenced below.

Version: 1.0.0, 05.05.2023.

Reference

Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.
o
DWUG EN: Diachronic Word Usage Graphs for English
explore.openaire.eu
Updated Sep 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Haim Dubossarsky; Simon Hengchen; Barbara McGillivray; Nina Tahmasebi (2021). DWUG EN: Diachronic Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.14028531
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14028531
Dataset updated
Sep 30, 2021
Authors
Dominik Schlechtweg; Haim Dubossarsky; Simon Hengchen; Barbara McGillivray; Nina Tahmasebi
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite. See previous versions for additional testsets. Please find more information on the provided data in the papers referenced below. Reference Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Dominik Schlechtweg, Pierluigi Cassotti, Bill Noble, David Alfter, Sabine Schulte im Walde, Nina Tahmasebi. More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Extends previous versions with one more annotation round and new clusterings.
SURel: Synchronic Usage Relatedness
zenodo.org
zip
Updated Feb 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty (2024). SURel: Synchronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5543348
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5543348
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains synchronic semantic relatedness judgments for German word usage pairs drawn from general language and the domain of cooking. Find a description of the data format, code to process the data and further datasets on the WUGsite.

We provide additional data under misc/:

testset: a semantic meaning shift test set with 22 German lexemes exhibiting different degrees of meaning shifts from general language to the domain of cooking. The 'mean relatedness score' denotes the annotation-based measure of semantic shift described in the paper. 'frequency GEN' and 'frequency SPEC' list the frequencies of the target words in the general-language corpus (GEN) and the domain-specific cooking corpus (SPEC). 'translations' provides English translations across senses, illustrating possible meaning shifts. Note that further senses might exist.

tables: the annotated table of each annotator.

plots: data visualization plots.

Please find more information on the provided data in the paper referenced below.

Version: 2.0.0, 30.9.2021.

Reference

Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota, USA, 2019.
Expert annotations for the Catalan Common Voice (v13)
zenodo.org
data.niaid.nih.gov
zip
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). Expert annotations for the Catalan Common Voice (v13) [Dataset]. http://doi.org/10.5281/zenodo.11104388
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11104388
Dataset updated
May 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

- Homepage: https://projecteaina.cat/tech/]
- Point of Contact: langech@bsc.es

Dataset Summary

These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

See annotations for more details.

Supported Tasks and Leaderboards

Gender classification, Accent classification.

Languages

The dataset is in Catalan (ca).

Dataset Structure

Instances

Two xlsx documents are published, one for each round of annotations.

The following information is available in each of the documents:

{
'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b',
'idx': '31',
'same speaker': {'AN1': 'SI',
'AN2': 'SI',
'AN3': 'SI',
'agreed': 'SI',
'percentage': '100'},
'gender': {'AN1': 'H',
'AN2': 'H',
'AN3': 'H',
'agreed': 'H',
'percentage': '100'},
'accent': {'AN1': 'Central',
'AN2': 'Central',
'AN3': 'Central',
'agreed': 'Central',
'percentage': '100'},
'audio quality': {'AN1': '4.0',
'AN2': '3.0',
'AN3': '3.0',
'agreed': '3.0',
'percentage': '66',
'mean quality': '3.33',
'stdev quality': '0.58'},
'comments': {'AN1': '',
'AN2': 'pujades i baixades de volum',
'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'},
}

We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

Data Fields

speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus

idx (int): Id in this corpus

AN1 (string): Annotations from Annotator 1

AN2 (string): Annotations from Annotator 2

AN3 (string): Annotations from Annotator 3

agreed (string): Annotation from the majority of the annotators

percentage (int): Percentage of annotators that agree with the agreed annotation

mean quality (float): Mean of the quality annotation

stdev quality (float): Standard deviation of the mean quality

Data Splits

The corpus remains undivided into splits, as its purpose does not involve training models.

Dataset Creation

Curation Rationale

During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Source Data

The original data comes from the [Catalan sentences of the Common Voice corpus](https://commonvoice.mozilla.org/en/datasets).

Initial Data Collection and Normalization

We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

Who are the source language producers?

The original data comes from the Catalan sentences of the Common Voice corpus.

Annotations

Annotation process

Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

A team of three annotators was tasked with annotating:

if all the recordings correspond to the same person

the gender of the speaker

the accent of the speaker

the quality of the recording

They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Who are the annotators?

The annotation was entrusted to the [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic) team from the University of Barcelona.
They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

The annotation team was composed of:

Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.

Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.

1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

To do the annotation they used a Google Drive spreadsheet

Personal and Sensitive Information

The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent.
You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

Considerations for Using the Data

Social Impact of Dataset

The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Discussion of Biases

Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

For the gender annotation, we have only considered "H" (male) and "D" (female).

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset is licensed under a CC BY 4.0 license.

It can be used for any purpose, whether academic or commercial, under the terms of the license.
Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The annotation was entrusted to the STeL team from the University of Barcelona.
P
LDC2017T10 Dataset
paperswithcode.com
Updated Oct 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). LDC2017T10 Dataset [Dataset]. https://paperswithcode.com/dataset/ldc2017t10
Explore at:
Dataset updated
Oct 28, 2021
Description
Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
DWUG SV: Diachronic Word Usage Graphs for Swedish
zenodo.org
explore.openaire.eu
zip
Updated Nov 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nina Tahmasebi; Nina Tahmasebi; Simon Hengchen; Simon Hengchen; Dominik Schlechtweg; Dominik Schlechtweg; Barbara McGillivray; Barbara McGillivray; Haim Dubossarsky; Haim Dubossarsky (2024). DWUG SV: Diachronic Word Usage Graphs for Swedish [Dataset]. http://doi.org/10.5281/zenodo.14028906
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14028906
Dataset updated
Nov 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nina Tahmasebi; Nina Tahmasebi; Simon Hengchen; Simon Hengchen; Dominik Schlechtweg; Dominik Schlechtweg; Barbara McGillivray; Barbara McGillivray; Haim Dubossarsky; Haim Dubossarsky
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for Swedish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

See previous versions for additional testsets.

Please find more information on the provided data in the papers referenced below.

Reference

Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.

Dominik Schlechtweg, Pierluigi Cassotti, Bill Noble, David Alfter, Sabine Schulte im Walde, Nina Tahmasebi. More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.
H
PEARC20 submitted paper: "Scientific Data Annotation and Dissemination:...
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Jul 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056
Explore at:
zip(873 bytes)Available download formats
Unique identifier
https://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056
Dataset updated
Jul 29, 2020
Dataset provided by
HydroShare
Authors
Sean Cleveland; Gwen Jacobs; Jennifer Geis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
Jul 29, 2020
Description
Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.
RefWUG: Diachronic Reference Word Usage Graphs for German
zenodo.org
explore.openaire.eu
+1more
zip
Updated Feb 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde (2024). RefWUG: Diachronic Reference Word Usage Graphs for German [Dataset]. http://doi.org/10.5281/zenodo.5791269
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5791269
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for German created with reference use sampling. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the paper referenced below.

Version: 1.1.0, 15.12.2021.

Reference

Dominik Schlechtweg and Sabine Schulte im Walde. submitted. Clustering Word Usage Graphs: A Flexible Framework to Measure Changes in Contextual Word Meaning.
DURel: Diachronic Usage Relatedness
zenodo.org
data.niaid.nih.gov
zip
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann (2024). DURel: Diachronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5784453
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5784453
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic semantic relatedness judgments for German word usage pairs. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the paper referenced below.

See previous versions for additional plots, tables and testsets.

Version: 3.0.0, 15.12.2021.

Reference

Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA.
Z
Data from: CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine...
data.niaid.nih.gov
zenodo.org
Updated Feb 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Capllonch-Carrión, Adrián (2022). CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine in Spanish [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6059737
Explore at:
Dataset updated
Feb 13, 2022
Dataset provided by
Moreno-Sandoval, Antonio
Campillos-Llanos, Leonardo
Capllonch-Carrión, Adrián
Valverde-Mateos, Ana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 1200 texts (292 173 tokens) about clinical trials studies and clinical trials announcements in Spanish:

500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO).

700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos.

Texts were annotated with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). 46 699 entities were annotated (13.98% are nested entities). 10% of the corpus was doubly annotated, and inter-annotator agreement (IAA) achieved a mean F-measure of 85.65% (±4.79, strict match) and a mean F-measure of 93.94% (±3.31, relaxed match).

The corpus is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License.
Data from: FluoroMatch 2.0-making automated and comprehensive non-targeted...
catalog.data.gov
gimi9.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality [Dataset]. https://catalog.data.gov/dataset/fluoromatch-2-0-making-automated-and-comprehensive-non-targeted-pfas-annotation-a-reality
Explore at:
Dataset updated
Feb 10, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data for "Koelmel JP, Stelben P, McDonough CA, Dukes DA, Aristizabal-Henao JJ, Nason SL, Li Y, Sternberg S, Lin E, Beckmann M, Williams AJ, Draper J, Finch JP, Munk JK, Deigl C, Rennie EE, Bowden JA, Godri Pollitt KJ. FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality. Anal Bioanal Chem. 2022 Jan;414(3):1201-1215. doi: 10.1007/s00216-021-03392-7. Epub 2021 May 20. PMID: 34014358.". Portions of this dataset are inaccessible because: The link provided by UCSD doesn't seem to be working. They can be accessed through the following means: Contact Jeremy Koelmel at Yale University, jeremykoelmel@innovativeomics.com. Format: The final annotated excel sheets with feature intensities, annotations, homologous series groupings, etc., are available as a supplemental excel file with the online version of this manuscript. The raw Agilent “.d” files can be downloaded at: ftp://massive.ucsd.edu/MSV000086811/updates/2021-02-05_jeremykoelmel_e5b21166/raw/McDonough_AFFF_3M_ddMS2_Neg.zip (Note use Google Chrome or Firefox, Microsoft Edge and certain other browsers are unable to download from an ftp link). This dataset is associated with the following publication: Koelmel, J.P., P. Stelben, C.A. McDonough, D.A. Dukes, J.J. Aristizabal-Henao, S.L. Nason, Y. Li, S. Sternberg, E. Lin, M. Beckmann, A. Williams, J. Draper, J. Finch, J.K. Munk, C. Deigl, E. Rennie, J.A. Bowden, and K.J. Godri Pollitt. FluoroMatch 2.0—making automated and comprehensive non-targeted PFAS annotation a reality. Analytical and Bioanalytical Chemistry. Springer, New York, NY, USA, 414(3): 1201-1215, (2022).
c
Amazon Mechanical Turk: Sentence annotation experiments
datacatalogue.cessda.eu
Updated Mar 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lau, J; Lappin, S (2025). Amazon Mechanical Turk: Sentence annotation experiments [Dataset]. http://doi.org/10.5255/UKDA-SN-851337
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-851337
Dataset updated
Mar 25, 2025
Dataset provided by
King
Authors
Lau, J; Lappin, S
Time period covered
Oct 1, 2012 - Sep 30, 2015
Area covered
United Kingdom, United States
Variables measured
Individual
Measurement technique
Amazon Mechanical Turk crowd sourcing
Description
This data collection consists of two .csv files containing lists of sentences with individual and mean sentence ratings (crowd sourced judgements) on three modes of presentation.

This research holds out the prospect of important impact in two areas. First, it can shed light on the relationship between the representation and acquisition of linguistic knowledge on one hand, and learning and the encoding of knowledge in other cognitive domains. This work can, in turn, help to clarify the respective roles of biologically conditioned learning biases and data driven learning in human cognition.

Second, this work can contribute to the development of more effective language technology by providing insight, from a computational perspective, into the way in which humans represent the syntactic properties of sentences in their language. To the extent that natural language processing systems take account of this class of representations they will provide more efficient tools for parsing and interpreting text and speech.
In the past twenty-five years work in natural language technology has made impressive progress across a wide range of tasks, which include, among others, information retrieval and extraction, text interpretation and summarization, speech recognition, morphological analysis, syntactic parsing, word sense identification, and machine translation. Much of this progress has been due to the successful application of powerful techniques for probabilistic modeling and statistical analysis to large corpora of linguistic data. These methods have given rise to a set of engineering tools that are rapidly shaping the digital environment in which we access and process most of the information that we use.

In recent work (Lappin and Shieber (2007), Clark and Lappin (2011a), Clark and Lappin (2011b)) my co-authors and I have argued that the machine learning methods that are driving the expansion of natural language technology are also directly relevant to understanding central features of human language acquisition. When these methods are used to construct carefully specified formal models and implementations of the grammar induction task, they yield striking insights into the limits and possibility of human learning on the basis of the primary linguistic data to which children are exposed. These models indicate that language learning can be achieved without the sorts of strong innate learning biases that have been posited by traditional theories of universal grammar. Weak biases, some derivable from non-linguistic cognitive domains, and domain general learning procedures are sufficient to support efficient data driven learning of plausible systems of grammatical representation.

In the current research I am focussing on the problem of how to specify the class of representations that encode human knowledge of the syntax of natural languages. I am pursuing the hypothesis that a representation in this class is best expressed as an enriched statistical language model that assigns probability values to the sentences of a language. A central part of the enrichment of the model consists of a procedure for determining the acceptability (grammaticality) of a sentence as a graded value, relative to the properties of that sentence and the language of which it is a part. This procedure avoids the simple reduction of the grammaticality of a string to its estimated probability of occurrence, while still characterizing grammaticality in probabilistic terms. An enriched model of this kind will provide a straightforward explanation for the fact that individual native speakers generally judge the well formedness of sentences along a continuum, rather than through the imposition of a sharp boundary between acceptable and unacceptable sentences. The pervasiveness of gradedness in the linguistic knowledge of individual speakers poses a serious problem for classical theories of syntax, which partition strings of words into the grammatical sentences of a language and ill formed strings of words.
d
Data from: FAPM: Functional annotation of proteins using multi-modal models...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Jul 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenkai Xiang; Zhaoping Xiong; Mingyue Zheng; Huan Chen; Zunyun Fu; Wei Zhang; Jiacheng Xiong; Qian Shi; Bing Liu (2024). FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling [Dataset]. https://search.dataone.org/view/sha256%3A3109cbde82aec70d1f9ebc5c91132588916b42fc29cbb08005dc7a0af3366692
Explore at:
Dataset updated
Jul 17, 2024
Dataset provided by
Dryad Digital Repository
Authors
Wenkai Xiang; Zhaoping Xiong; Mingyue Zheng; Huan Chen; Zunyun Fu; Wei Zhang; Jiacheng Xiong; Qian Shi; Bing Liu
Description
Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and â€œtail labelsâ€ with few known examples. Unlike previous methods that mainly focused on protein sequence features, we use a pretrained large natural language model to understand the semantic meaning of protein labels. Specifically, we introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homol..., , , # FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling

https://doi.org/10.5061/dryad.m905qfv9p

The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo

Description of the data and file structure

The dataset includes:

The information of GO (Gene Ontology). This is a system to describe the functions of proteins.Â

-The basic version of the GO (file name: go1.4-basic.obo).Â Source: https://geneontology.org/docs/download-ontology/

-The mapping between GO numbers andÂ GO descriptions (file name: go_descriptions1.4.txt)

-GO terms (file names: bp_terms.pkl; mf_terms.pkl; cc_terms.pkl)

Manually annotated data derived from Uniprot database. These datasets are used to finetune the model.

-File names:

train_exp_prompt_bp.csv; train_exp_prompt_mf.csv; train_exp_prompt_cc.cs...
E
Data from: Metaphor annotations in Polish political debates from 2020 (TVP...
live.european-language-grid.eu
binary format
Updated Jun 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
Explore at:
binary formatAvailable download formats
Dataset updated
Jun 30, 2021
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.
f
Definition of concept coverage scores for ASSESS CT manual annotation.
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Definition of concept coverage scores for ASSESS CT manual annotation. [Dataset]. http://doi.org/10.1371/journal.pone.0209547.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0209547.t002
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Definition of concept coverage scores for ASSESS CT manual annotation.
e
Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC)
data.europa.eu
snd.se
+1more
unknown
Updated May 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linnéuniversitetet (2024). Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC) [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-5878-002925/embed
Explore at:
unknownAvailable download formats
Dataset updated
May 21, 2024
Dataset authored and provided by
Linnéuniversitetet
Description
In this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.

Purpose:

The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.

The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.

For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.

The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.

When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060
Practical Annotation and Exchange of Virtual Anatomy
simtk.org
data/images/video
Updated Jan 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmet Erdemir; Andinet Enquobahrie (2021). Practical Annotation and Exchange of Virtual Anatomy [Dataset]. https://simtk.org/frs/?group_id=1767
Explore at:
(0), data/images/video(74 MB)Available download formats
Dataset updated
Jan 15, 2021
Dataset provided by
Kitwarehttps://www.kitware.com/
Cleveland Clinic, Lerner Research Institute
Authors
Ahmet Erdemir; Andinet Enquobahrie
Description
Representation of anatomy in a virtual form is at the heart of clinical decision making, biomedical research, and medical training. Virtual anatomy is not limited to description of geometry but also requires appropriate and efficient labeling of regions - to define spatial relationships and interactions between anatomical objects; effective strategies for pointwise operations - to define local properties, biological or otherwise; and support for diverse data formats and standards - to facilitate exchange between clinicians, scientists, engineers, and the general public. Development of aeva, a free and open source software package (library, user interfaces, extensions) capable of automated and interactive operations for virtual anatomy annotation and exchange, is in response to these currently unmet requirements. This site serves for aeva outreach, including dissemination the software and use cases. The use cases drive design and testing of aeva features and demonstrate various workflows that rely on virtual anatomy.

aeva downloads: Downloads (https://simtk.org/frs/?group_id=1767) Kitware data repository (https://data.kitware.com/#folder/5e7a4690af2e2eed356a17f2)

aeva documentation: Guides and tutorials (https://aeva.readthedocs.io)

aeva videos: Short instructions (https://www.youtube.com/channel/UCubfUe40LXvBs86UyKci0Fw)

aeva source code: Kitware source code repository (https://gitlab.kitware.com/aeva)

aeva forum: Forums (https://simtk.org/plugins/phpBB/indexPhpbb.php?group_id=1767 )

This project includes the following software/data packages:

aevaCMB : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aevaCMB will be familiar to users of ParaView and Computational Model Builder. The interface is customized and new features have been added to support operations for import and export of anatomical representations and for annotation (template based and freeform, including a powerful set of region selection).

aevaSlicer : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aevaSlicer will be familiar to users of Slicer. The interface is customized and new features have been added to accommodate a workflow amenable to generation of surface and volume meshes of anatomy from medical images.

aeva Tutorials : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aeva Tutorials provide data used and content generated by aevaSlicer and aevaCMB.

Facebook

Twitter

Click to copy link

Link copied

Cite

Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2024). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.6433203

DWUG ES: Diachronic Word Usage Graphs for Spanish

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6433203

Dataset updated

Feb 27, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg

License

Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically

Description

This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the paper referenced below.

The annotation was funded by

ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,
ANID - Millennium Science Initiative Program - Code ICN17 002 and
SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

Version: 1.0.1, 9.4.2022. Development data.

Reference

Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.

Clear search

Close search

Google apps

Main menu

DWUG ES: Diachronic Word Usage Graphs for Spanish

Abstract Meaning Representation (AMR) Annotation Release 3.0

Annotation Curricula to Implicitly Train Non-Expert Annotators - Dataset -...

TestWUG EN: Test Word Usage Graphs for English

DWUG EN: Diachronic Word Usage Graphs for English

SURel: Synchronic Usage Relatedness

Expert annotations for the Catalan Common Voice (v13)

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

LDC2017T10 Dataset

DWUG SV: Diachronic Word Usage Graphs for Swedish

Reference

PEARC20 submitted paper: "Scientific Data Annotation and Dissemination:...

RefWUG: Diachronic Reference Word Usage Graphs for German

DURel: Diachronic Usage Relatedness

Data from: CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine...

Data from: FluoroMatch 2.0-making automated and comprehensive non-targeted...

Amazon Mechanical Turk: Sentence annotation experiments

Data from: FAPM: Functional annotation of proteins using multi-modal models...

Description of the data and file structure

Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

Definition of concept coverage scores for ASSESS CT manual annotation.

Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC)

Practical Annotation and Exchange of Virtual Anatomy

DWUG ES: Diachronic Word Usage Graphs for SpanishSee More Versions

DWUG ES: Diachronic Word Usage Graphs for Spanish