41 datasets found

D
Data Annotation Tool Market Report
marketresearchforecast.com
doc, pdf, ppt
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2024). Data Annotation Tool Market Report [Dataset]. https://www.marketresearchforecast.com/reports/data-annotation-tool-market-10075
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Dec 9, 2024
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The size of the Data Annotation Tool Market market was valued at USD 3.9 USD billion in 2023 and is projected to reach USD 6.64 USD billion by 2032, with an expected CAGR of 7.9% during the forecast period. A Data Annotation Tool is defined as the software that can be employed to make annotations to data hence helping a learning computer model learn patterns. These tools provide a way of segregating the data types to include images, texts, and audio, as well as videos. Some of the subcategories of annotation include images such as bounding boxes, segmentation, text such as entity recognition, sentiment analysis, audio such as transcription, sound labeling, and video such as object tracking. Other common features depend on the case but they commonly consist of interfaces, cooperation with others, suggestion of labels, and quality assurance. It can be used in the automotive industry (object detection for self-driving cars), text processing (classification of text), healthcare (medical imaging), and retail (recommendation). These tools get applied in training good quality, accurately labeled data sets for the engineering of efficient AI systems. Key drivers for this market are: Increasing Adoption of Cloud-based Managed Services to Drive Market Growth. Potential restraints include: Adverse Health Effect May Hamper Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.
A
Abstract Meaning Representation (AMR) Annotation Release 3.0
abacus.library.ubc.ca
iso, txt
Updated Sep 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2021). Abstract Meaning Representation (AMR) Annotation Release 3.0 [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/82CVJF
Explore at:
txt(1308), iso(276281344)Available download formats
Dataset updated
Sep 3, 2021
Dataset provided by
Abacus Data Network
Description
AbstractIntroductionAbstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations. AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).DataThe source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 7379 210 229 7818 DEFT DF English 32915 0 0 32915 Aesop fables 49 0 0 49 Guidelines AMRs 970 0 0 970 LORELEI 4441 354 527 5322 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Wikipedia 192 0 0 192 Xinhua MT 741 99 86 926 Totals 55635 1722 1898 59255 Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.
e
The ecocomDP Annotation Dictionary
portal.edirepository.org
search.dataone.org
bin
Updated Sep 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle Zollo-Venecek (2021). The ecocomDP Annotation Dictionary [Dataset]. http://doi.org/10.6073/pasta/4c9d740edae7f81cb2612426304ba7b7
Explore at:
bin(13160 bytes), bin(7228 bytes)Available download formats
Unique identifier
https://doi.org/10.6073/pasta/4c9d740edae7f81cb2612426304ba7b7
Dataset updated
Sep 29, 2021
Dataset provided by
EDI
Authors
Kyle Zollo-Venecek
Time period covered
Jan 1, 1980 - Jan 1, 2021
Area covered

Description
This data package contains a comprehensive set of semantic annotations (URIs and labels) from datasets in the ecocomDP format published in EDI. The table of annotations, referred to as the ecocomDP Annotation Dictionary, can be viewed in RStudio using the view_annotation_dictionary function of the ecocomDP R package.
TestWUG EN: Test Word Usage Graphs for English
zenodo.org
zip
Updated Jun 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Dominik Schlechtweg (2023). TestWUG EN: Test Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.7900960
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7900960
Dataset updated
Jun 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data collection contains test Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.

The data is provided for testing purposes and thus contains specific data cases, which are sometimes artificially created, sometimes picked from existing data sets. The data contains the following cases:

afternoon_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 427 judgments. Has clear cluster structure with only one cluster, no graded change, no binary change, and medium agreement of 0.62 Krippendorff's alpha.

arm: standard textbook example for semantic proximity (see reference below). Fully connected graph with six words uses, annotated by author.

plane_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 1152 judgments. Has clear cluster structure, high graded change, binary change, and high agreement of 0.82 Krippendorff's alpha.

target: similar to arm, but with only two repeated sentences. Fully connected graph with six words uses, annotated by author. Same sentence (exactly same string) is annotated with 4, different string is annotated with 1.

Please find more information in the paper referenced below.

Version: 1.0.0, 05.05.2023.

Reference

Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.
E
Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1
live.european-language-grid.eu
binary format
Updated May 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Parallel sense-annotated corpus ELEXIS-WSD 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22947
Explore at:
binary formatAvailable download formats
Dataset updated
May 21, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.

The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation.

List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene

The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, five empty columns (reserved for e.g. dependency parsing, which is absent from this version), and the final MISC column containing the following: the token's whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression).

Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.

For more information, please refer to 00README.txt.

Differences to version 1.0: - Several minor errors were fixed (e.g. a typo in one of the Slovene sense IDs). - The corpus was converted to the true CoNLL-U format (as opposed to the CoNLL-U-like format used in v1.0). - An error was fixed that resulted in missing UPOS tags in version 1.0. - The sentences in all corpora now follow the same order (from 1 to 2024).
DWUG ES: Diachronic Word Usage Graphs for Spanish
zenodo.org
zip
Updated Feb 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2024). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.6433203
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6433203
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the paper referenced below.

The annotation was funded by

ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,

ANID - Millennium Science Initiative Program - Code ICN17 002 and

SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

Version: 1.0.1, 9.4.2022. Development data.

Reference

Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.
o
DWUG EN: Diachronic Word Usage Graphs for English
explore.openaire.eu
Updated Sep 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Haim Dubossarsky; Simon Hengchen; Barbara McGillivray; Nina Tahmasebi (2021). DWUG EN: Diachronic Word Usage Graphs for English [Dataset]. http://doi.org/10.5281/zenodo.14028531
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14028531
Dataset updated
Sep 30, 2021
Authors
Dominik Schlechtweg; Haim Dubossarsky; Simon Hengchen; Barbara McGillivray; Nina Tahmasebi
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite. See previous versions for additional testsets. Please find more information on the provided data in the papers referenced below. Reference Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Dominik Schlechtweg, Pierluigi Cassotti, Bill Noble, David Alfter, Sabine Schulte im Walde, Nina Tahmasebi. More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Extends previous versions with one more annotation round and new clusterings.
c
Data from: TermFrame: Terms, definitions and semantic annotations for...
clarin.si
live.european-language-grid.eu
Updated Nov 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Špela Vintar; Senja Pollak; Amanda Saksida; Uroš Stepišnik; Kristian Pintar; Larisa Grčić Simeunović; Teja Hadalin; Vid Podpečan; Matej Martinc; Andraž Repar; Katarina Vrtovec (2021). TermFrame: Terms, definitions and semantic annotations for karstology [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1463?locale-attribute=sl
Explore at:
Dataset updated
Nov 18, 2021
Authors
Špela Vintar; Senja Pollak; Amanda Saksida; Uroš Stepišnik; Kristian Pintar; Larisa Grčić Simeunović; Teja Hadalin; Vid Podpečan; Matej Martinc; Andraž Repar; Katarina Vrtovec
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The resource contains several datasets containing domain-specific data in three languages, English, Slovenian and Croatian, which can be used for various knowledge extraction or knowledge modelling tasks. The resource represents knowledge for the domain of karstology, a subfield of geography studying karst and related phenomena. It contains:

Definitions Plain text files contain definitions of karst concepts from relevant glossaries and encyclopaedia, but also definitions which had been extracted from domain-specific corpora.

Annotated definitions Definitions were manually annotated and curated in the WebAnno tool. Annotations include several layers including definition elements, semantic relations following the frame-based theory of terminology (FBT), relation definitors which can be used for learning relation patterns, and semantic categories defined in the domain model.

Terms, definitions and sources The TermFrame knowledge base contains terms and their corresponding concept identifiers, definitions and definition sources.
Data from: ReBeatICG database
zenodo.org
produccioncientifica.ucm.es
+1more
zip
Updated May 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Una Pale; Una Pale; David Meier; David Meier; Olivier Müller; Adriana Arza Valdes; Adriana Arza Valdes; David Atienza Alonso; David Atienza Alonso; Olivier Müller (2021). ReBeatICG database [Dataset]. http://doi.org/10.5281/zenodo.4725433
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4725433
Dataset updated
May 4, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Una Pale; Una Pale; David Meier; David Meier; Olivier Müller; Adriana Arza Valdes; Adriana Arza Valdes; David Atienza Alonso; David Atienza Alonso; Olivier Müller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

ReBeatICG database contains ICG (impedance cardiography) signals recorded during an experimental session of a virtual search and rescue mission with drones. It includes beat-to-beat annotations of the ICG characteristic points, made by a cardiologist, with the purpose of testing ICG delineation algorithms. A reference of synchronous ECG signals is included to allow comparison and mark cardiac events.

Raw data

The database includes 48 recordings of ICG and ECG signals from 24 healthy subjects during an experimental session of a virtual search and rescue mission with drones, described in [1]. Two segments of 5-minute signals are selected from each subject; one corresponding to baseline state (task BL) and the second one is recorded during higher levels of cognitive workload (task CW). In total, the presented database consisted of 240 minutes of ICG signals.

During the experiment, various signals were recorded, but here only ICG and ECG data are provided. Raw data was recorded with 2000Hz using the Biopac system.

Data Preprocessing (filtering)

Further, for the purpose of annotation by cardiologists, data were first downsampled to 250Hz instead of 2000Hz. Further, it was filtered with an adaptive Savitzky-Golay filter of order 3. “Adaptive'' refers to the adaptive selection of filter length, which plays a major role in the efficacy of the filter. The filter length was selected based on the first 3 seconds of each signal recording SNR level, following the procedure described below.

Starting from a filter length of 3 (i.e., the minimum length allowed), the length is increased in steps of two until signal SNR reaches 30 or the improvements are lower than 1% (i.e., the saturation of SNR improvement with further filter length increase). These values present a good compromise between reducing noise and over-smoothing of the signal (and hence potentially losing valuable details) and a lower filter length, thus reducing complexity. The SNR is calculated as a ratio between the 2-norm of the high and low signal frequencies considering 20Hz as cut-off frequency.

Data Annotation

In order to assess the performance of the ICG delineation algorithms, a subset of the database was annotated by a cardiologist from Lausanne University Hospital (CHUV) in Switzerland.

The annotated subset consists of 4 randomly chosen signal segments containing 10 beats from each subject and task (i.e., 4 segments from BL and 4 from CW task). Segments of signals with artifacts and very noisy were excluded when selecting the data for annotation, and in this case, 8 segments were chosen from the task with cleaner signals. In total, 1920 (80x24) beats were selected for annotation.

For each cardiac cycle, four characteristic points were annotated: B, C, X and O. The following definitions were used when annotating the data:

- C peak -- Defined as the peak with the greatest amplitude in one cardiac cycle that represents the maximum systolic flow.

- B point -- Indicates the onset of the final rapid upstroke toward the C point [3] that is expressed as the point of significant change in the slope of the ICG signal preceding the C point. It is related to the aortic valve opening. However, its identification can be difficult due to variations in the ICG signals morphology. A decisional algorithm has been proposed to guide accurate and reproducible B point identification [4].

- X point -- Often defined as the minimum dZ/dt value in one cardiac cycle. However, this does not always hold true due to variations in the dZ/dt waveform morphology [5]. Thus, the X point is defined as the onset of the steep rise in ICG towards the O point. It represents the aortic valve closing which occurs simultaneously as the T wave end on the ECG signal.

- O point -- The highest local maxima in the first half of the C-C interval. It represents the mitral valve opening.

Annotation was performed using open-access software (https://doi.org/10.5281/zenodo.4724843).

Annotated points are saved in separate files for each person and task, representing the location of points in the original signal.

Data structure

Data is organized in three folders, one for raw data (01_RawData), filtered data (02_FilteredData), and annotated points (03_ExpertAnnotations). In each folder, data is separated into files representing each subject and task (except in 03_ExpertAnnotations where 2 CW task files were not annotated due to an excessive amount of noise).

All files are Matlab .mat files.

Raw data and filtered data .mat files contain „ICG“, „ECG“ synchronized data, as well as “samplFreq“values. In filtered data final chosen Savitzky-Golay filter length (“SGFiltLen”) is provided too.

In Annotated data .mat file contains only matrix „annotPoints“ with each row representing one cardiac cycle, while in columns are positions of B, C, X and O points, respectively. Positions are expressed as a number of samples from the beginning of full database files (signals from 01_RawData and 02_FilteredData folders). In rare cases, there are less than 40 (or 80) values per file, when data was noisy and cardiologists couldn't annotate confidently each cardiac cycle.

-------------------

References

[1] F. Dell’Agnola, “Cognitive Workload Monitoring in Virtual Reality Based Rescue Missions with Drones.,” pp. 397–409, 2020, doi: 10.1007/978-3-030-49695-1_26.

[2] H. Yazdanian, A. Mahnam, M. Edrisi, and M. A. Esfahani, “Design and Implementation of a Portable Impedance Cardiography System for Noninvasive Stroke Volume Monitoring,” J. Med. Signals Sens., vol. 6, no. 1, pp. 47–56, Mar. 2016.

[3] A. Sherwood(Chair), M. T. Allen, J. Fahrenberg, R. M. Kelsey, W. R. Lovallo, and L. J. P. van Doornen, “Methodological Guidelines for Impedance Cardiography,” Psychophysiology, vol. 27, no. 1, pp. 1–23, 1990, doi: https://doi.org/10.1111/j.1469-8986.1990.tb02171.x.

[4] J. R. Árbol, P. Perakakis, A. Garrido, J. L. Mata, M. C. Fernández‐Santaella, and J. Vila, “Mathematical detection of aortic valve opening (B point) in impedance cardiography: A comparison of three popular algorithms,” Psychophysiology, vol. 54, no. 3, pp. 350–357, 2017, doi: https://doi.org/10.1111/psyp.12799.

[5] M. Nabian, Y. Yin, J. Wormwood, K. S. Quigley, L. F. Barrett, and S. Ostadabbas, “An Open-Source Feature Extraction Tool for the Analysis of Peripheral Physiological Data,” IEEE J. Transl. Eng. Health Med., vol. 6, p. 2800711, 2018, doi: 10.1109/JTEHM.2018.2878000.
DURel: Diachronic Usage Relatedness
zenodo.org
data.niaid.nih.gov
zip
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann (2024). DURel: Diachronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5784453
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5784453
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Stefanie Eckmann; Stefanie Eckmann
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic semantic relatedness judgments for German word usage pairs. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the paper referenced below.

See previous versions for additional plots, tables and testsets.

Version: 3.0.0, 15.12.2021.

Reference

Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA.
Practical Annotation and Exchange of Virtual Anatomy
simtk.org
data/images/video
Updated Jan 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmet Erdemir; Andinet Enquobahrie (2021). Practical Annotation and Exchange of Virtual Anatomy [Dataset]. https://simtk.org/frs/?group_id=1767
Explore at:
(0), data/images/video(74 MB)Available download formats
Dataset updated
Jan 15, 2021
Dataset provided by
Kitwarehttps://www.kitware.com/
Cleveland Clinic, Lerner Research Institute
Authors
Ahmet Erdemir; Andinet Enquobahrie
Description
Representation of anatomy in a virtual form is at the heart of clinical decision making, biomedical research, and medical training. Virtual anatomy is not limited to description of geometry but also requires appropriate and efficient labeling of regions - to define spatial relationships and interactions between anatomical objects; effective strategies for pointwise operations - to define local properties, biological or otherwise; and support for diverse data formats and standards - to facilitate exchange between clinicians, scientists, engineers, and the general public. Development of aeva, a free and open source software package (library, user interfaces, extensions) capable of automated and interactive operations for virtual anatomy annotation and exchange, is in response to these currently unmet requirements. This site serves for aeva outreach, including dissemination the software and use cases. The use cases drive design and testing of aeva features and demonstrate various workflows that rely on virtual anatomy.

aeva downloads: Downloads (https://simtk.org/frs/?group_id=1767) Kitware data repository (https://data.kitware.com/#folder/5e7a4690af2e2eed356a17f2)

aeva documentation: Guides and tutorials (https://aeva.readthedocs.io)

aeva videos: Short instructions (https://www.youtube.com/channel/UCubfUe40LXvBs86UyKci0Fw)

aeva source code: Kitware source code repository (https://gitlab.kitware.com/aeva)

aeva forum: Forums (https://simtk.org/plugins/phpBB/indexPhpbb.php?group_id=1767 )

This project includes the following software/data packages:

aevaCMB : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aevaCMB will be familiar to users of ParaView and Computational Model Builder. The interface is customized and new features have been added to support operations for import and export of anatomical representations and for annotation (template based and freeform, including a powerful set of region selection).

aevaSlicer : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aevaSlicer will be familiar to users of Slicer. The interface is customized and new features have been added to accommodate a workflow amenable to generation of surface and volume meshes of anatomy from medical images.

aeva Tutorials : aeva (annotation and exhange of virtual anatomy) is a software suite designed to work with virtual anatomy in various forms. aeva Tutorials provide data used and content generated by aevaSlicer and aevaCMB.
P
LDC2017T10 Dataset
paperswithcode.com
Updated Oct 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). LDC2017T10 Dataset [Dataset]. https://paperswithcode.com/dataset/ldc2017t10
Explore at:
Dataset updated
Oct 28, 2021
Description
Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
f
Qualitative analysis of manual annotations of clinical text with SNOMED CT
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Qualitative analysis of manual annotations of clinical text with SNOMED CT [Dataset]. http://doi.org/10.1371/journal.pone.0209547
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0209547
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SNOMED CT provides about 300,000 codes with fine-grained concept definitions to support interoperability of health data. Coding clinical texts with medical terminologies it is not a trivial task and is prone to disagreements between coders. We conducted a qualitative analysis to identify sources of disagreements on an annotation experiment which used a subset of SNOMED CT with some restrictions. A corpus of 20 English clinical text fragments from diverse origins and languages was annotated independently by two domain medically trained annotators following a specific annotation guideline. By following this guideline, the annotators had to assign sets of SNOMED CT codes to noun phrases, together with concept and term coverage ratings. Then, the annotations were manually examined against a reference standard to determine sources of disagreements. Five categories were identified. In our results, the most frequent cause of inter-annotator disagreement was related to human issues. In several cases disagreements revealed gaps in the annotation guidelines and lack of training of annotators. The reminder issues can be influenced by some SNOMED CT features.
RefWUG: Diachronic Reference Word Usage Graphs for German
zenodo.org
explore.openaire.eu
+1more
zip
Updated Feb 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde (2024). RefWUG: Diachronic Reference Word Usage Graphs for German [Dataset]. http://doi.org/10.5281/zenodo.5791269
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5791269
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for German created with reference use sampling. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the paper referenced below.

Version: 1.1.0, 15.12.2021.

Reference

Dominik Schlechtweg and Sabine Schulte im Walde. submitted. Clustering Word Usage Graphs: A Flexible Framework to Measure Changes in Contextual Word Meaning.
U
Grammar transformations of topographic feature type annotations of the U.S....
data.usgs.gov
s.cnmilf.com
+2more
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Abbott (2024). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. http://doi.org/10.5066/P1BDPXKZ
Explore at:
Unique identifier
https://doi.org/10.5066/P1BDPXKZ
Dataset updated
Jul 11, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Emily Abbott
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
1994 - 1999
Area covered
United States
Description
These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-spee ...
u
Agrilus planipennis community manual annotations
agdatacommons.nal.usda.gov
catalog.data.gov
application/x-gzip
Updated Nov 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua B. Benoit; Alexander Martynov; David R. Nelson; Kristen A. Panfilio; Yves Pauchet (2024). Agrilus planipennis community manual annotations [Dataset]. http://doi.org/10.15482/USDA.ADC/25076114.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/25076114.v1
Dataset updated
Nov 19, 2024
Dataset provided by
Ag Data Commons
Authors
Joshua B. Benoit; Alexander Martynov; David R. Nelson; Kristen A. Panfilio; Yves Pauchet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Manual annotation at the i5k Workspace@NAL (https://i5k.nal.usda.gov) is the review and improvement of gene models derived from computational gene prediction. Community curators compare an existing gene model to evidence such as RNA-Seq or protein alignments from the same or closely related species and modify the structure or function of the gene accordingly, typically following the i5k Workspace@NAL manual annotation guidelines (https://i5k.nal.usda.gov/content/rules-web-apollo-annotation-i5k-pilot-project). If a gene model is missing, the annotator can also use this evidence to create a new gene model. Because manual annotation, by definition, improves or creates gene models where computational methods have failed, it can be a powerful tool to improve computational gene sets, which often serve as foundational datasets to facilitate research on a species.Here, community curators used manual annotation at the i5k Workspace@NAL to improve computational gene predictions from the dataset Agrilus planipennis genome annotations v0.5.3. The i5k Workspace@NAL set up the Apollo v1 manual annotation software and multiple evidence tracks to facilitate manual annotation. From 2014-10-20 to 2018-07-12, five community curators updated 263 genes, including developmental genes; cytochrome P450s; cathepsin peptidases; cuticle proteins; glycoside hydrolases; and polysaccharide lyases. For this dataset, we used the program LiftOff v1.6.3 to map the manual annotations to the genome assembly GCF_000699045.2. We computed overlaps with annotations from the RefSeq database using gff3_merge from the GFF3toolkit software v2.1.0. FASTA sequences were generated using gff3_to_fasta from the same toolkit. These improvements should facilitate continued research on Agrilus planipennis, or emerald ash borer (EAB), which is an invasive insect pest.While these manual annotations will not be integrated with other computational gene sets, they are available to view at the i5k Workspace@NAL (https://i5k.nal.usda.gov) to enhance future research on Agrilus planipennis.
E
Data from: Metaphor annotations in Polish political debates from 2020 (TVP...
live.european-language-grid.eu
binary format
Updated Jun 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
Explore at:
binary formatAvailable download formats
Dataset updated
Jun 30, 2021
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.
SURel: Synchronic Usage Relatedness
zenodo.org
zip
Updated Feb 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty (2024). SURel: Synchronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5543348
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5543348
Dataset updated
Feb 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains synchronic semantic relatedness judgments for German word usage pairs drawn from general language and the domain of cooking. Find a description of the data format, code to process the data and further datasets on the WUGsite.

We provide additional data under misc/:

testset: a semantic meaning shift test set with 22 German lexemes exhibiting different degrees of meaning shifts from general language to the domain of cooking. The 'mean relatedness score' denotes the annotation-based measure of semantic shift described in the paper. 'frequency GEN' and 'frequency SPEC' list the frequencies of the target words in the general-language corpus (GEN) and the domain-specific cooking corpus (SPEC). 'translations' provides English translations across senses, illustrating possible meaning shifts. Note that further senses might exist.

tables: the annotated table of each annotator.

plots: data visualization plots.

Please find more information on the provided data in the paper referenced below.

Version: 2.0.0, 30.9.2021.

Reference

Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota, USA, 2019.
ActiveHuman Part 2
zenodo.org
data.niaid.nih.gov
Updated Nov 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charalampos Georgiadis; Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. http://doi.org/10.5281/zenodo.8361114
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8361114
Dataset updated
Nov 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Charalampos Georgiadis; Charalampos Georgiadis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here.
Dataset Description
ActiveHuman was generated using Unity's Perception package.
It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals).
The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset.
Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration
The dataset consists of 3 folders:
JSON Data: Contains all the generated JSON files.
RGB Images: Contains the generated RGB images.
Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology
Annotation: Recorded data describing a single capture.
Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.).
Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor).
Ego coordinate system: Coordinates with respect to the ego.
Global coordinate system: Coordinates with respect to the global origin in Unity.
Sensor: Device that captures the dataset (in this instance the sensor is a camera).
Sensor coordinate system: Coordinates with respect to the sensor.
Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital.
UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data
The dataset includes 4 types of JSON annotation files files:
annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:
id: Integer identifier of the annotation's definition.
name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation).
description: Description of the annotation's specifications.
format: Format of the file containing the annotation specifications (e.g., json, PNG).
spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:
BoundingBox2DLabeler/BoundingBox3DLabeler:
label_id: Integer identifier of a label.
label_name: String identifier of a label.
KeypointLabeler:
template_id: Keypoint template UUID.
template_name: Name of the keypoint template.
key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:
label: Joint label.
index: Joint index.
color: RGBA values of the keypoint.
color_code: Hex color code of the keypoint
skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:
label1: Label of the first joint.
label2: Label of the second joint.
joint1: Index of the first joint.
joint2: Index of the second joint.
color: RGBA values of the connection.
color_code: Hex color code of the connection.
SemanticSegmentationLabeler:
label_name: String identifier of a label.
pixel_value: RGBA values of the label.
color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:
id: UUID of the capture.
sequence_id: UUID of the sequence.
step: Index of the capture within a sequence.
timestamp: Timestamp (in ms) since the beginning of a sequence.
sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:
sensor_id: Sensor UUID.
ego_id: Ego UUID.
modality: Modality of the sensor (e.g., camera, radar).
translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system.
rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system.
camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration.
projection: Projection type used by the camera (e.g., orthographic, perspective).
ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:
ego_id: Ego UUID.
translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system.
rotation: Quaternion variable containing the ego's orientation.
velocity: 3D vector containing the ego's velocity (in meters per second).
acceleration: 3D vector containing the ego's acceleration (in ).
format: Format of the file captured by the sensor (e.g., PNG, JPG).
annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:
id: Annotation UUID .
annotation_definition: Integer identifier of the annotation's definition.
filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image.
values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:
BoundingBox2DLabeler:
label_id: Integer identifier of a label.
label_name: String identifier of a label.
instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values.
x: Position of the 2D bounding box on the X axis.
y: Position of the 2D bounding box position on the Y axis.
width: Width of the 2D bounding box.
height: Height of the 2D bounding box.
BoundingBox3DLabeler:
label_id: Integer identifier of a label.
label_name: String identifier of a label.
instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values.
translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters).
size: 3D
c
Data from: Slovenian Definition Extraction evaluation datasets RSDO-def 1.0
clarin.si
live.european-language-grid.eu
Updated May 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mateja Jemec Tomazin; Vid Podpečan; Senja Pollak; Hanh Thi Hong Tran; Tanja Fajfar; Simon Atelšek; Jera Sitar; Mojca Žagar Karer (2023). Slovenian Definition Extraction evaluation datasets RSDO-def 1.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1841?show=full&locale-attribute=sl
Explore at:
Dataset updated
May 19, 2023
Authors
Mateja Jemec Tomazin; Vid Podpečan; Senja Pollak; Hanh Thi Hong Tran; Tanja Fajfar; Simon Atelšek; Jera Sitar; Mojca Žagar Karer
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
Slovenia
Description
The Slovene Definition Extraction evaluation datasets RSDO-def contains sentences extracted from the Corpus of term-annotated texts RSDO5 1.1 (http://hdl.handle.net/11356/1470), which contains texts with annotated terms from four different domains: biomechanics, linguistics, chemistry, and veterinary science. The file and sentence identifiers are the same as in the original RSDO corpus.

The labels added to the sentences included in the dataset denote: 0: Non-definition 1: Weak definition 2: Definition

The dataset consists of two parts: 1. RSDO-def-random employed a random sampling strategy, with 14 definitions, 98 weak-definitions and 849 non-definitions. 2. RSDO-def-larger added sentences to the random one by the pattern-based definition extraction as presented in Pollak et al. (2014). It contains 169 definitions, 214 weak-definitions and 872 non-definitions.

Both parts were manually annotated by five terminographers. In case of discrepancies between annotators, a consensus was reached and the final label was confirmed by all five annotators. Duplicates were removed in both parts.

The criteria for annotation are based on the standard ISO 1087-1:2000 (E/F) Terminology Work - Vocabulary, Part 1, Theory and Application, which explains a definition as follows: "Representation of a concept by a descriptive statement which serves to differentiate it from related concepts". Weak definition labels were assigned if the extracted sentences contained a term and at least one delimiting feature without a superordinate concept, or sentences consisting of superordinate concepts without delimiting features but with some typical examples. Instances were labeled as Non-definition if the sentence with the extracted concept did not contain any information about the concept or its delimiting features.

The dataset is described in more detail in Tran et al. 2023, where it was used for evaluating definition extraction approaches. If you use this resource, please cite:

Tran, T.H.H., Podpečan, V., Jemec Tomazin, M., Pollak, Senja (2023). Definition Extraction for Slovene: Patterns, Transformer Classifiers and ChatGPT. Proceedings of the ELEX 2023: Electronic lexicography in the 21st century. Invisible lexicography: everywhere lexical data is used without users realizing they make use of a “dictionary” (accepted)

Reference to the pattern-based definition extraction method used for creating RSDO-def-larger: Pollak, S. (2014). Extracting definition candidates from specialized corpora. Slovenščina 2.0: empirical, applied and interdisciplinary research, 2(1), pp. 1–40. https://doi.org/10.4312/slo2.0.2014.1.1-40

Related resources:

Jemec Tomazin, M. et al. (2021). Corpus of term-annotated texts RSDO5 1.1, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1470.

Podpečan et al. (2023). DF_NDF_wiki_slo: Definition extraction training sets from Wikipedia, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1840.

Facebook

Twitter

Click to copy link

Link copied

Cite

Market Research Forecast (2024). Data Annotation Tool Market Report [Dataset]. https://www.marketresearchforecast.com/reports/data-annotation-tool-market-10075

Data Annotation Tool Market Report

Explore at:

doc, ppt, pdfAvailable download formats

Dataset updated

Dec 9, 2024

Dataset authored and provided by

Market Research Forecast

License

https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

Time period covered

2025 - 2033

Area covered

Global

Variables measured

Market Size

Description

The size of the Data Annotation Tool Market market was valued at USD 3.9 USD billion in 2023 and is projected to reach USD 6.64 USD billion by 2032, with an expected CAGR of 7.9% during the forecast period. A Data Annotation Tool is defined as the software that can be employed to make annotations to data hence helping a learning computer model learn patterns. These tools provide a way of segregating the data types to include images, texts, and audio, as well as videos. Some of the subcategories of annotation include images such as bounding boxes, segmentation, text such as entity recognition, sentiment analysis, audio such as transcription, sound labeling, and video such as object tracking. Other common features depend on the case but they commonly consist of interfaces, cooperation with others, suggestion of labels, and quality assurance. It can be used in the automotive industry (object detection for self-driving cars), text processing (classification of text), healthcare (medical imaging), and retail (recommendation). These tools get applied in training good quality, accurately labeled data sets for the engineering of efficient AI systems. Key drivers for this market are: Increasing Adoption of Cloud-based Managed Services to Drive Market Growth. Potential restraints include: Adverse Health Effect May Hamper Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.

Clear search

Close search

Google apps

Main menu

Data Annotation Tool Market Report

Abstract Meaning Representation (AMR) Annotation Release 3.0

The ecocomDP Annotation Dictionary

TestWUG EN: Test Word Usage Graphs for English

Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1

DWUG ES: Diachronic Word Usage Graphs for Spanish

DWUG EN: Diachronic Word Usage Graphs for English

Data from: TermFrame: Terms, definitions and semantic annotations for...

Data from: ReBeatICG database

DURel: Diachronic Usage Relatedness

Practical Annotation and Exchange of Virtual Anatomy

LDC2017T10 Dataset

Qualitative analysis of manual annotations of clinical text with SNOMED CT

RefWUG: Diachronic Reference Word Usage Graphs for German

Grammar transformations of topographic feature type annotations of the U.S....

Agrilus planipennis community manual annotations

Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

SURel: Synchronic Usage Relatedness

ActiveHuman Part 2

Data from: Slovenian Definition Extraction evaluation datasets RSDO-def 1.0

Data Annotation Tool Market Report