Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 35 of 39 taxonomies that were the result of a systematic review. The systematic review was conducted with the goal of identifying taxonomies suitable for semantically annotating research data. A special focus was set on research data from the hybrid societies domain.
The following taxonomies were identified as part of the systematic review:
Filename
Taxonomy Title
acm_ccs
ACM Computing Classification System [1]
amec
A Taxonomy of Evaluation Towards Standards [2]
bibo
A BIBO Ontology Extension for Evaluation of Scientific Research Results [3]
cdt
Cross-Device Taxonomy [4]
cso
Computer Science Ontology [5]
ddbm
What Makes a Data-driven Business Model? A Consolidated Taxonomy [6]
ddi_am
DDI Aggregation Method [7]
ddi_moc
DDI Mode of Collection [8]
n/a
DemoVoc [9]
discretization
Building a New Taxonomy for Data Discretization Techniques [10]
dp
Demopaedia [11]
dsg
Data Science Glossary [12]
ease
A Taxonomy of Evaluation Approaches in Software Engineering [13]
eco
Evidence & Conclusion Ontology [14]
edam
EDAM: The Bioscientific Data Analysis Ontology [15]
n/a
European Language Social Science Thesaurus [16]
et
Evaluation Thesaurus [17]
glos_hci
The Glossary of Human Computer Interaction [18]
n/a
Humanities and Social Science Electronic Thesaurus [19]
hcio
A Core Ontology on the Human-Computer Interaction Phenomenon [20]
hft
Human-Factors Taxonomy [21]
hri
A Taxonomy to Structure and Analyze Human–Robot Interaction [22]
iim
A Taxonomy of Interaction for Instructional Multimedia [23]
interrogation
A Taxonomy of Interrogation Methods [24]
iot
Design Vocabulary for Human–IoT Systems Communication [25]
kinect
Understanding Movement and Interaction: An Ontology for Kinect-Based 3D Depth Sensors [26]
maco
Thesaurus Mass Communication [27]
n/a
Thesaurus Cognitive Psychology of Human Memory [28]
mixed_initiative
Mixed-Initiative Human-Robot Interaction: Definition, Taxonomy, and Survey [29]
qos_qoe
A Taxonomy of Quality of Service and Quality of Experience of Multimodal Human-Machine Interaction [30]
ro
The Research Object Ontology [31]
senses_sensors
A Human-Centered Taxonomy of Interaction Modalities and Devices [32]
sipat
A Taxonomy of Spatial Interaction Patterns and Techniques [33]
social_errors
A Taxonomy of Social Errors in Human-Robot Interaction [34]
sosa
Semantic Sensor Network Ontology [35]
swo
The Software Ontology [36]
tadirah
Taxonomy of Digital Research Activities in the Humanities [37]
vrs
Virtual Reality and the CAVE: Taxonomy, Interaction Challenges and Research Directions [38]
xdi
Cross-Device Interaction [39]
We converted the taxonomies into SKOS (Simple Knowledge Organisation System) representation. The following 4 taxonomies were not converted as they were already available in SKOS and were for this reason excluded from this dataset:
1) DemoVoc, cf. http://thesaurus.web.ined.fr/navigateur/ available at https://thesaurus.web.ined.fr/exports/demovoc/demovoc.rdf
2) European Language Social Science Thesaurus, cf. https://thesauri.cessda.eu/elsst/en/ available at https://zenodo.org/record/5506929
3) Humanities and Social Science Electronic Thesaurus, cf. https://hasset.ukdataservice.ac.uk/hasset/en/ available at https://zenodo.org/record/7568355
4) Thesaurus Cognitive Psychology of Human Memory, cf. https://www.loterre.fr/presentation/ available at https://skosmos.loterre.fr/P66/en/
References
[1] “The 2012 ACM Computing Classification System,” ACM Digital Library, 2012. https://dl.acm.org/ccs (accessed May 08, 2023).
[2] AMEC, “A Taxonomy of Evaluation Towards Standards.” Aug. 31, 2016. Accessed: May 08, 2023. [Online]. Available: https://amecorg.com/amecframework/home/supporting-material/taxonomy/
[3] B. Dimić Surla, M. Segedinac, and D. Ivanović, “A BIBO ontology extension for evaluation of scientific research results,” in Proceedings of the Fifth Balkan Conference in Informatics, in BCI ’12. New York, NY, USA: Association for Computing Machinery, Sep. 2012, pp. 275–278. doi: 10.1145/2371316.2371376.
[4] F. Brudy et al., “Cross-Device Taxonomy: Survey, Opportunities and Challenges of Interactions Spanning Across Multiple Devices,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19. New York, NY, USA: Association for Computing Machinery, Mai 2019, pp. 1–28. doi: 10.1145/3290605.3300792.
[5] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, and E. Motta, “The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas,” in Lecture Notes in Computer Science 1137, D. Vrandečić, K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kaffee, and E. Simperl, Eds., Monterey, California, USA: Springer, Oct. 2018, pp. 187–205. Accessed: May 08, 2023. [Online]. Available: http://oro.open.ac.uk/55484/
[6] M. Dehnert, A. Gleiss, and F. Reiss, “What makes a data-driven business model? A consolidated taxonomy,” presented at the European Conference on Information Systems, 2021.
[7] DDI Alliance, “DDI Controlled Vocabulary for Aggregation Method,” 2014. https://ddialliance.org/Specification/DDI-CV/AggregationMethod_1.0.html (accessed May 08, 2023).
[8] DDI Alliance, “DDI Controlled Vocabulary for Mode Of Collection,” 2015. https://ddialliance.org/Specification/DDI-CV/ModeOfCollection_2.0.html (accessed May 08, 2023).
[9] INED - French Institute for Demographic Studies, “Thésaurus DemoVoc,” Feb. 26, 2020. https://thesaurus.web.ined.fr/navigateur/en/about (accessed May 08, 2023).
[10] A. A. Bakar, Z. A. Othman, and N. L. M. Shuib, “Building a new taxonomy for data discretization techniques,” in 2009 2nd Conference on Data Mining and Optimization, Oct. 2009, pp. 132–140. doi: 10.1109/DMO.2009.5341896.
[11] N. Brouard and C. Giudici, “Unified second edition of the Multilingual Demographic Dictionary (Demopaedia.org project),” presented at the 2017 International Population Conference, IUSSP, Oct. 2017. Accessed: May 08, 2023. [Online]. Available: https://iussp.confex.com/iussp/ipc2017/meetingapp.cgi/Paper/5713
[12] DuCharme, Bob, “Data Science Glossary.” https://www.datascienceglossary.org/ (accessed May 08, 2023).
[13] A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, and E. Stiakakis, “A Taxonomy of Evaluation Approaches in Software Engineering,” in Proceedings of the 7th Balkan Conference on Informatics Conference, in BCI ’15. New York, NY, USA: Association for Computing Machinery, Sep. 2015, pp. 1–8. doi: 10.1145/2801081.2801084.
[14] M. C. Chibucos, D. A. Siegele, J. C. Hu, and M. Giglio, “The Evidence and Conclusion Ontology (ECO): Supporting GO Annotations,” in The Gene Ontology Handbook, C. Dessimoz and N. Škunca, Eds., in Methods in Molecular Biology. New York, NY: Springer, 2017, pp. 245–259. doi: 10.1007/978-1-4939-3743-1_18.
[15] M. Black et al., “EDAM: the bioscientific data analysis ontology,” F1000Research, vol. 11, Jan. 2021, doi: 10.7490/f1000research.1118900.1.
[16] Council of European Social Science Data Archives (CESSDA), “European Language Social Science Thesaurus ELSST,” 2021. https://thesauri.cessda.eu/en/ (accessed May 08, 2023).
[17] M. Scriven, Evaluation Thesaurus, 3rd Edition. Edgepress, 1981. Accessed: May 08, 2023. [Online]. Available: https://us.sagepub.com/en-us/nam/evaluation-thesaurus/book3562
[18] Papantoniou, Bill et al., The Glossary of Human Computer Interaction. Interaction Design Foundation. Accessed: May 08, 2023. [Online]. Available: https://www.interaction-design.org/literature/book/the-glossary-of-human-computer-interaction
[19] “UK Data Service Vocabularies: HASSET Thesaurus.” https://hasset.ukdataservice.ac.uk/hasset/en/ (accessed May 08, 2023).
[20] S. D. Costa, M. P. Barcellos, R. de A. Falbo, T. Conte, and K. M. de Oliveira, “A core ontology on the Human–Computer Interaction phenomenon,” Data Knowl. Eng., vol. 138, p. 101977, Mar. 2022, doi: 10.1016/j.datak.2021.101977.
[21] V. J. Gawron et al., “Human Factors Taxonomy,” Proc. Hum. Factors Soc. Annu. Meet., vol. 35, no. 18, pp. 1284–1287, Sep. 1991, doi: 10.1177/154193129103501807.
[22] L. Onnasch and E. Roesler, “A Taxonomy to Structure and Analyze Human–Robot Interaction,” Int. J. Soc. Robot., vol. 13, no. 4, pp. 833–849, Jul. 2021, doi: 10.1007/s12369-020-00666-5.
[23] R. A. Schwier, “A Taxonomy of Interaction for Instructional Multimedia.” Sep. 28, 1992. Accessed: May 09, 2023. [Online]. Available: https://eric.ed.gov/?id=ED352044
[24] C. Kelly, J. Miller, A. Redlich, and S. Kleinman, “A Taxonomy of Interrogation Methods,”
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Here are a few use cases for this project:
Historical Weapon Classification: This computer vision model can be utilized by historians, archeologists, and museum curators to classify and catalog historical weapons and artifacts, including swords, arrows, guns, and knives, enabling them to better understand and contextualize the weapons' origins and usage throughout history.
Video Game Asset Management: Game developers can use the Data Annotate model to automatically tag and categorize in-game assets, such as weapons and visual effects, to streamline their development process and more easily manage game content.
Prop and Costume Design: The model can aid prop and costume designers in the film, theater, and cosplay industries by identifying and categorizing various weapons and related items, allowing them to find suitable props or inspirations for their designs more quickly.
Law Enforcement and Security: Data Annotate can be used by law enforcement agencies and security personnel to effectively detect weapons in surveillance footage or images, enabling them to respond more quickly to potential threats and uphold public safety.
Educational Applications: Teachers and educators can use the model to develop interactive and engaging learning materials in the fields of history, art, and technology. It can help students identify and understand the significance of various weapons and their roles in shaping human history and culture.
Facebook
TwitterBackground Currently, most genome annotation is curated by centralized groups with limited resources. Efforts to share annotations transparently among multiple groups have not yet been satisfactory. Results Here we introduce a concept called the Distributed Annotation System (DAS). DAS allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. The communication between client and servers in DAS is defined by the DAS XML specification. Annotations are displayed in layers, one per server. Any client or server adhering to the DAS XML specification can participate in the system; we describe a simple prototype client and server example. Conclusions The DAS specification is being used experimentally by Ensembl, WormBase, and the Berkeley Drosophila Genome Project. Continued success will depend on the readiness of the research community to adopt DAS and provide annotations. All components are freely available from the project website .
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
INCEpTION is an open-source text annotation tool primarily designed to annotate text documents. It supports annotations of words and sentences as well as linking annotations to each other.
These features make INCEpTION a comprehensive solution for building and managing annotated corpora.
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 2.88(USD Billion) |
| MARKET SIZE 2025 | 3.28(USD Billion) |
| MARKET SIZE 2035 | 12.0(USD Billion) |
| SEGMENTS COVERED | Application, Service Type, Industry, Deployment Model, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | growing AI adoption, increasing demand for accuracy, rise in machine learning, cost optimization needs, regulatory compliance requirements |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | Deep Vision, Amazon, Google, Scale AI, Microsoft, Defined.ai, Samhita, Samasource, Figure Eight, Cognitive Cloud, CloudFactory, Appen, Tegas, iMerit, Labelbox |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | AI and machine learning growth, Increasing demand for annotated data, Expansion in autonomous vehicles, Healthcare data management needs, Real-time data processing requirements |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 13.9% (2025 - 2035) |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
Dataset Summary
These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).
The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.
The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.
See annotations for more details.
Supported Tasks and Leaderboards
Gender classification, Accent classification.
Languages
The dataset is in Catalan (ca).
Dataset Structure
Instances
Two xlsx documents are published, one for each round of annotations.
The following information is available in each of the documents:
{ 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }
We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.
Data Fields
speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus
idx (int): Id in this corpus
AN1 (string): Annotations from Annotator 1
AN2 (string): Annotations from Annotator 2
AN3 (string): Annotations from Annotator 3
agreed (string): Annotation from the majority of the annotators
percentage (int): Percentage of annotators that agree with the agreed annotation
mean quality (float): Mean of the quality annotation
stdev quality (float): Standard deviation of the mean quality
Data Splits
The corpus remains undivided into splits, as its purpose does not involve training models.
Dataset Creation
Curation Rationale
During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.
In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Source Data
The original data comes from the Catalan sentences of the Common Voice corpus.
Initial Data Collection and Normalization
We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.
Who are the source language producers?
The original data comes from the Catalan sentences of the Common Voice corpus.
Annotations
Annotation process
Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.
A team of three annotators was tasked with annotating:
if all the recordings correspond to the same person
the gender of the speaker
the accent of the speaker
the quality of the recording
They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Who are the annotators?
The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.
The annotation team was composed of:
Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.
Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.
1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.
To do the annotation they used a Google Drive spreadsheet
Personal and Sensitive Information
The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
Considerations for Using the Data
Social Impact of Dataset
The ID come from the Common Voice dataset, that consists of people who have donated their voice online.
You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Discussion of Biases
Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.
For the gender annotation, we have only considered "H" (male) and "D" (female).
Other Known Limitations
[N/A]
Additional Information
Dataset Curators
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
Licensing Information
This dataset is licensed under a CC BY 4.0 license.
It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.
Citation Information
DOI
Contributions
The annotation was entrusted to the STeL team from the University of Barcelona.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in Excel format (.xlsx). These corpora were compiled and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com).
This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform.
The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a sample of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The sample texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution:
· 140 Very Easy texts
· 140 Easy texts
· 140 Plain texts
· 42 More Complex texts.
Trainers and students were asked to classify the texts according to the complexity levels of the project, here informally defined as:
· Very Easy (everyone can understand the text or most of the text).
· Easy (a person with less than the 9th year of schooling can understand the text or most of the text)
· Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it)
· More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it).
Annotators were also asked to mark the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), The full details regarding the students and the trainers’ tasks, data qualitative and quantitative description and inter-annotator agreement are described here: https://zenodo.org/records/14653180
The results are here presented in Excel format. For each language, and for each group (trainers and students), two pairs of files exist – the annotation and the classification files – resulting in four files per language and twelve files, in total.
In all files, the data is organized as a matrix, with each row representing an ‘answer’ from a particular participant, and the columns containing various details about that specific input, as shown below:
Column name
Data
Annotator's ID
The randomly generated ID code for each annotator, together with information on the dataset assigned to them.
Progress
Information on the completion of the task (for each text).
Duration (seconds)
Time used in the completion of the task (for each text).
File Name
N1 = Very Easy
N2 = Easy
N3 = Plain
N4=More Complex
File internal identification, providing its iRead4Skills classification.
Text
The content of the file, i.e. the text itself.
Annotated Level
Level assigned by the annotator (trainer).
Proficiency SubLevel
(Likert Scale - 1 to 5)
SubLevel assigned by the annotator (trainer) for FR data.
Corresponding CEFR Level
CEFR level closest to the iRead4Skills
Additional Info
Observations made by the trainers/students
Annotated Term
Word or set of words selected for annotation
Term Label
Annotation assigned to the Annotated Term (difficult word, word order, etc.)
Term Index
Position of the annotated term in the text
Annotator's Proficiency Level
Level of AL/VET of the student
Text adequate for user
Validation of the text by the students
The content of the column “File Name” is color-coded, where a green shade alludes to a text with a lower level of complexity and a red one alludes to one with a higher level of complexity.
The complete datasets are available under creative CC BY-NC-ND 4.0.
Facebook
TwitterAttribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.
Please find more information on the provided data in the papers referenced below.
The annotation was funded by
Version: 4.0.2, 7.1.2025. Full data. Quoting issues in uses resolved. Target word and target sentence indices corrected. One corrected context for word 'metro'. Judgments anonymized. Annotator 'gecsa' removed. Issues with special characters in filenames resolved. Additional removal of wrongly copied graphs.
Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.
Dominik Schlechtweg, Tejaswi Choppa, Wei Zhao, Michael Roth. 2025. The CoMeDi Shared Task: Median Judgment Classification & Mean Disagreement Ranking with Ordinal Word-in-Context Judgments. In Proceedings of the 1st Workshop on Context and Meaning--Navigating Disagreements in NLP Annotations.
Facebook
TwitterIn this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.
Purpose:
The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.
The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.
For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.
The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.
When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 1200 texts (292 173 tokens) about clinical trials studies and clinical trials announcements in Spanish: - 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO). - 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos. Texts were annotated with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). 46 699 entities were annotated (13.98% are nested entities). 10% of the corpus was doubly annotated, and inter-annotator agreement (IAA) achieved a mean F-measure of 85.65% (±4.79, strict match) and a mean F-measure of 93.94% (±3.31, relaxed match). The corpus is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License.
Facebook
TwitterAttribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
-------------------------------------
Siehe unten für die deutsche Version.
-------------------------------------
Synchronic Usage Relatedness (SURel) - Test Set and Annotation Data
This data collection supplementing the paper referenced below contains:
- a semantic meaning shift test set with 22 German lexemes with different degrees of meaning shifts from general language to the domain of cooking. It comes as a tab-separated csv file where each line has the form
lemma POS translations mean relatedness score frequency GEN frequency SPEC
The 'mean relatedness score' denotes the annotation-based measure of semantic shift described in the paper. 'frequency GEN' and 'frequency SPEC' list the frequencies of the target words in the general language corpus (GEN) and the domain-specific cooking corpus (SPEC). 'translations' gives English translations for different senses, illustrating possible meaning shifts. Note that further senses might exist;
- the full annotation tables as annotators received it filled it. The tables come in the form of a tab-separated csv file where each line has the form
sentence 1 rating comment sentence 2;
- the annotation guidelines in English and German (only the German version was used);
- data visualization plots.
Find more information in
Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota USA 2019.
Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.
The resources are freely available for education, research and other non-commercial purposes. More information can be requested via email to the authors.
-------
Deutsch
-------
Synchroner Wortverwendungsbezug (SURel) - Test Set und Annotationsdaten
Diese Datensammlung ergänzt den unten zitierten Artikel und enthält folgende Dateien:
- ein Test Set für semantische Bedeutungsverschiebung mit 22 deutschen Lexemen, mit unteschiedlichen Graden an Bedeutungsverschiebungen von der Allgemeinsprache hin zur domänenspezifischen Sprache des Kochens. Hierbei handelt es sich um eine tab-separierte CSV-Datei, in der jede Zeile folgende Form hat:
Lexem Wortart Übersetzungen Mean Relatedness Score Freqeunz GEN Frequenz SPEC
Der 'Mean Realtedness Score' bezeichnet das annotationsbasierte Maß für Bedeutungsverschiebungen wie im Paper beschrieben. 'Frequenz GEN' und 'Frequenz SPEC' listen die Häufigkeiten der Zielwörter im allgemeinsprachlichen Korpus (GEN) und im domänenspezifischen Korpus (SPEC) auf. 'Übersetzungen' enthält englische Übersetzungen für mögliche Bedeutungen um die Bedeutungsverschiebung zu illustrieren. Beachten Sie dass auch andere Bedeutungen exitieren können;
- Die Annotationstabellen, wie sie die Annotatoren erhalten aus ausgefüllt haben. Die Ergebnistabellen sind tab-separierte CSV-Dateien, in der jede Zeile folgende Form hat:
Satz 1 Bewertung Kommentar Satz 2
- die Annotationsrichtlinien auf Deutsch und Englisch (nur die deutsche Version wurde genutzt);
- Visualisierungsplots der Daten.
Mehr Informationen finden Sie in
Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota USA 2019.
Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.
Die Ressourcen sind frei verfügbar für Lehre, Forschung sowie andere nicht-kommerzielle Zwecke. Für weitere Informationen schreiben Sie bitte eine E-Mail an die Autoren.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.
Folder configuration The dataset consists of 3 folders:
JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.
Essential Terminology
Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.
Dataset Data The dataset includes 4 types of JSON annotation files files:
annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:
id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.
Most Labelers generate different annotation specifications in the spec key-value pair:
BoundingBox2DLabeler/BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:
template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:
label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:
label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:
label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.
captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:
id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:
sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:
ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:
id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.
Each Labeler generates different annotation specifications in the values key-value pair:
BoundingBox2DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:
label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:
index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.
The SemanticSegmentationLabeler does not contain a values list.
egos.json: Contains collections of key-value pairs for each ego. These include:
id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:
id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).
Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:
e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Currently, no consensus exists regarding criteria required to designate a protein within a proteomic data set as a cell surface protein. Most published proteomic studies rely on varied ontology annotations or computational predictions instead of experimental evidence when attributing protein localization. Consequently, standardized approaches for analyzing and reporting cell surface proteome data sets would increase confidence in localization claims and promote data use by other researchers. Recently, we developed Veneer, a web-based bioinformatic tool that analyzes results from cell surface N-glycocapture workflowsthe most popular cell surface proteomics method used to date that generates experimental evidence of subcellular location. Veneer assigns protein localization based on defined experimental and bioinformatic evidence. In this study, we updated the criteria and process for assigning protein localization and added new functionality to Veneer. Results of Veneer analysis of 587 cell surface N-glycocapture data sets from 32 published studies demonstrate the importance of applying defined criteria when analyzing cell surface proteomics data sets and exemplify how Veneer can be used to assess experimental quality and facilitate data extraction for informing future biological studies and annotating public repositories.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives in europarl texts.
The 8 connectives with their annotated relations are:
although (contrast|concession)
as (prep|causal|temporal|comparison|concession)
however (contrast|concession)
meanwhile (contrast|temporal)
since (causal|temporal|temporal-causal)
though (contrast|concession)
while (contrast|concession|temporal|temporal-contrast|temporal-causal)
yet (adv|contrast|concession)
For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.
If you need software for the latter, have a look at: https://github.com/idiap/DiscoConn-Classifier
Citation
Please cite the following papers if you make use of these datasets (and to know more about the annotation method):
@INPROCEEDINGS{Popescu-Belis-LREC-2012, author = {Popescu-Belis, Andrei and Meyer, Thomas and Liyanapathirana, Jeevanthi and Cartoni, Bruno and Zufferey, Sandrine}, title = {{D}iscourse-level {A}nnotation over {E}uroparl for {M}achine {T}ranslation: {C}onnectives and {P}ronouns}, booktitle = {Proceedings of the eighth international conference on Language Resources and Evaluation ({LREC})}, year = {2012}, address = {Istanbul, Turkey} }
@Article{Cartoni-DD-2013, Author = {Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas}, Title = {{Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique}}, Journal = {Dialogue & Discourse}, Volume = {4}, Number = {2}, pages = {65--86}, year = {2013} }
@ARTICLE{Meyer-TSLP-submitted, author = {Meyer, Thomas and Hajlaoui, Najeh and Popescu-Belis, Andrei}, title = {{Disambiguating Discourse Connectives for Statistical Machine Translation in Several Languages}}, journal = {IEEE/ACM Transactions of Audio, Speech, and Language Processing}, year = {submitted}, volume = {}, pages = {}, number = {} }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quarterly release of curated gene function data for Arabidopsis thaliana from The Arabidopsis Information Resource (www.arabidopsis.org)
The contents of the compressed archive include the following files which are described in detail in the included README file.
1.ATH_GO_GOSLIM.txt.gz This document is a tab-delimited file containing GO annotations for Arabidopsis genes annotated by TAIR and TIGR with terms from the Gene Ontology Consortium controlled vocabularies (see www.geneontology.org). This file includes an updated set of literature based annotations and >40,000 electronic annotations based upon matches to INTERPRO domains supplied by Nicola Mulder from SWISS PROT/INTERPRO.
Please cite this paper when using TAIR's GO annotations in your research: Berardini, TZ, Mundodi, S, Reiser, L, Huala, E, Garcia-Hernandez, M, Zhang, P, Mueller, LM, Yoon, J, Doyle, A, Lander, G, Moseyko, N, Yoo, D, Xu, I, Zoeckler, B, Montoya, M, Miller, N, Weems, D, and Rhee, SY (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. 135(2):1-11.
2.gene_aliases_yyyymmdd.txt(.gz) This file lists alternative names for each gene.
3.Locus_Germplasm_Phenotype_yyyymmdd.txt.gz This file contains links between loci, germplasms, and phenotypes.
4.Locus_Published_yyyymmdd.txt.gz This file contains links between loci and publications.
5.po_temporal_gene_arabidopsis_tair.assoc.gz po_anatomy_gene_arabidopsis_tair.assoc.gz These two files are tab-delimited files. Each contains the set of literature-based annotations of Arabidopsis genes and loci annotated at TAIR to the terms from the Plant Ontology developed by the Plant Ontology Consortium (POC, www.plantontology.org).
6.TAIR10 or ARAPORT11_functional_descriptions_yyyymmdd.txt(.gz) This file contains functional descriptions for gene models included in either the TAIR 10 or as of 20170630 the Araport11 genome release. TAIR10/Araport11 refers to the version of the genome annotation.
7.Araport11_GFF3_genes_transposons.[DATE].gff.gz
Column header: explanation 1. Name of the chromosome 2. Source: Name of the the data source that generated this feature (Araport11) 3. Annotation type: eg gene, mRNA etc. 4. Start position of annotation. 5. Stop position of annotation. 6. Score - A floating point value. 7. Strand information. Defined as + (forward) or - (reverse). 8. Frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on. 9. Detailed annotation information with a semicolon-separated list of tag-value pairs, providing additional information about each feature, including curator summary, computational description,. etc.
Column header: explanation 1. Name of the chromosome 2. Source: Name of the the data source that generated this feature (Araport11) 3. Annotation type: eg gene, mRNA etc. 4. Start position of annotation. 5. Stop position of annotation. 6. Score - A floating point value. 7. Strand information. Defined as + (forward) or - (reverse). 8. Frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on. 9. Detailed annotation information with a semicolon-separated list of tag-value pairs, providing additional information about each feature, including transcript_id. gene_id, Note, etc.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
, and denote the performance obtained by the SVM and SMLR classifiers on lateral view images only, using both majority (maj) and minority voting (min). For more details on majority and minority voting, please see ‘Materials and methods’. For each case, random partitions of the training and testing data sets are generated, on the most popular annotation terms. Abbreviations of the anatomical annotations: AMP - anterior midgut primordium; BP - brain primordium; DEP - dorsal epidermis primordium; FP - foregut primordium; HMP - head mesoderm primordium; HPP - hindgut proper primordium; PMP - posterior midgut primordium; SMP - somatic muscle primordium; TMP - trunk mesoderm primordium; VNCP - ventral nerve cord primordium.
Facebook
TwitterData for "Koelmel JP, Stelben P, McDonough CA, Dukes DA, Aristizabal-Henao JJ, Nason SL, Li Y, Sternberg S, Lin E, Beckmann M, Williams AJ, Draper J, Finch JP, Munk JK, Deigl C, Rennie EE, Bowden JA, Godri Pollitt KJ. FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality. Anal Bioanal Chem. 2022 Jan;414(3):1201-1215. doi: 10.1007/s00216-021-03392-7. Epub 2021 May 20. PMID: 34014358.". Portions of this dataset are inaccessible because: The link provided by UCSD doesn't seem to be working. They can be accessed through the following means: Contact Jeremy Koelmel at Yale University, jeremykoelmel@innovativeomics.com. Format: The final annotated excel sheets with feature intensities, annotations, homologous series groupings, etc., are available as a supplemental excel file with the online version of this manuscript. The raw Agilent “.d” files can be downloaded at: ftp://massive.ucsd.edu/MSV000086811/updates/2021-02-05_jeremykoelmel_e5b21166/raw/McDonough_AFFF_3M_ddMS2_Neg.zip (Note use Google Chrome or Firefox, Microsoft Edge and certain other browsers are unable to download from an ftp link). This dataset is associated with the following publication: Koelmel, J.P., P. Stelben, C.A. McDonough, D.A. Dukes, J.J. Aristizabal-Henao, S.L. Nason, Y. Li, S. Sternberg, E. Lin, M. Beckmann, A. Williams, J. Draper, J. Finch, J.K. Munk, C. Deigl, E. Rennie, J.A. Bowden, and K.J. Godri Pollitt. FluoroMatch 2.0—making automated and comprehensive non-targeted PFAS annotation a reality. Analytical and Bioanalytical Chemistry. Springer, New York, NY, USA, 414(3): 1201-1215, (2022).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Proteomics data-dependent acquisition data sets collected with high-resolution mass-spectrometry (MS) can achieve very high-quality results, but nearly every analysis yields results that are thresholded at some accepted false discovery rate, meaning that a substantial number of results are incorrect. For study conclusions that rely on a small number of peptide-spectrum matches being correct, it is thus important to examine at least some crucial spectra to ensure that they are not one of the incorrect identifications. We present Quetzal, a peptide fragment ion spectrum annotation tool to assist researchers in annotating and examining such spectra to ensure that they correctly support study conclusions. We describe how Quetzal annotates spectra using the new Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) mzPAF standard for fragment ion peak annotation, including the Python-based code, a web-service end point that provides annotation services, and a web-based application for annotating spectra and producing publication-quality figures. We illustrate its functionality with several annotated spectra of varying complexity. Quetzal provides easily accessible functionality that can assist in the effort to ensure and demonstrate that crucial spectra support study conclusions. Quetzal is publicly available at https://proteomecentral.proteomexchange.org/quetzal/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Studies identified false and non-actionnable alarms as a factor for alarm fatigue in intensive care units.
To annotate patient alarms, and analyse the alarm situation in intensive care units, we conceptualized and performed data mappings related to airway management and medication interventions. The mappings were based on information retrieved from the patient data management system (PDMS) and clinical expertise. For the airway management mappings, we used additional resources such as ISO 19223:2019 or ventilator instruction manuals. The mappings do not include patient data.
As the mappings are generic, they could be used in other contexts than alarm annotation and research.
General tables summarizing the 1) categories based on ISO 19223:2019 to describe respiratory support therapies (RSTs), 2) defining the invasiveness level of a RST and 3) listing the abbreviations used in the mappings
Tables including PDMS entries for airway devices (ADs), ventilation devices (VDs), and ventilation modes (VMs)
Mapping of AD entries (from the PDMS) to defined categories
Mapping of VDs, VMs, and ADs to defined RSTs, including information on invasiveness
Table specifying suitable ventilation parameters in the context of each RST
General tables providing information on physiological alarm conditions (PACs), interventions, routes, and techniques of administration of interest
Mapping of routes of administration to techniques of administration including PDMS entries
Mapping of active ingredients (including SNOMED CT Fully Specified Names and Identifiers), related PDMS information, and routes and techniques of administration to defined PAC and interventions
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 35 of 39 taxonomies that were the result of a systematic review. The systematic review was conducted with the goal of identifying taxonomies suitable for semantically annotating research data. A special focus was set on research data from the hybrid societies domain.
The following taxonomies were identified as part of the systematic review:
Filename
Taxonomy Title
acm_ccs
ACM Computing Classification System [1]
amec
A Taxonomy of Evaluation Towards Standards [2]
bibo
A BIBO Ontology Extension for Evaluation of Scientific Research Results [3]
cdt
Cross-Device Taxonomy [4]
cso
Computer Science Ontology [5]
ddbm
What Makes a Data-driven Business Model? A Consolidated Taxonomy [6]
ddi_am
DDI Aggregation Method [7]
ddi_moc
DDI Mode of Collection [8]
n/a
DemoVoc [9]
discretization
Building a New Taxonomy for Data Discretization Techniques [10]
dp
Demopaedia [11]
dsg
Data Science Glossary [12]
ease
A Taxonomy of Evaluation Approaches in Software Engineering [13]
eco
Evidence & Conclusion Ontology [14]
edam
EDAM: The Bioscientific Data Analysis Ontology [15]
n/a
European Language Social Science Thesaurus [16]
et
Evaluation Thesaurus [17]
glos_hci
The Glossary of Human Computer Interaction [18]
n/a
Humanities and Social Science Electronic Thesaurus [19]
hcio
A Core Ontology on the Human-Computer Interaction Phenomenon [20]
hft
Human-Factors Taxonomy [21]
hri
A Taxonomy to Structure and Analyze Human–Robot Interaction [22]
iim
A Taxonomy of Interaction for Instructional Multimedia [23]
interrogation
A Taxonomy of Interrogation Methods [24]
iot
Design Vocabulary for Human–IoT Systems Communication [25]
kinect
Understanding Movement and Interaction: An Ontology for Kinect-Based 3D Depth Sensors [26]
maco
Thesaurus Mass Communication [27]
n/a
Thesaurus Cognitive Psychology of Human Memory [28]
mixed_initiative
Mixed-Initiative Human-Robot Interaction: Definition, Taxonomy, and Survey [29]
qos_qoe
A Taxonomy of Quality of Service and Quality of Experience of Multimodal Human-Machine Interaction [30]
ro
The Research Object Ontology [31]
senses_sensors
A Human-Centered Taxonomy of Interaction Modalities and Devices [32]
sipat
A Taxonomy of Spatial Interaction Patterns and Techniques [33]
social_errors
A Taxonomy of Social Errors in Human-Robot Interaction [34]
sosa
Semantic Sensor Network Ontology [35]
swo
The Software Ontology [36]
tadirah
Taxonomy of Digital Research Activities in the Humanities [37]
vrs
Virtual Reality and the CAVE: Taxonomy, Interaction Challenges and Research Directions [38]
xdi
Cross-Device Interaction [39]
We converted the taxonomies into SKOS (Simple Knowledge Organisation System) representation. The following 4 taxonomies were not converted as they were already available in SKOS and were for this reason excluded from this dataset:
1) DemoVoc, cf. http://thesaurus.web.ined.fr/navigateur/ available at https://thesaurus.web.ined.fr/exports/demovoc/demovoc.rdf
2) European Language Social Science Thesaurus, cf. https://thesauri.cessda.eu/elsst/en/ available at https://zenodo.org/record/5506929
3) Humanities and Social Science Electronic Thesaurus, cf. https://hasset.ukdataservice.ac.uk/hasset/en/ available at https://zenodo.org/record/7568355
4) Thesaurus Cognitive Psychology of Human Memory, cf. https://www.loterre.fr/presentation/ available at https://skosmos.loterre.fr/P66/en/
References
[1] “The 2012 ACM Computing Classification System,” ACM Digital Library, 2012. https://dl.acm.org/ccs (accessed May 08, 2023).
[2] AMEC, “A Taxonomy of Evaluation Towards Standards.” Aug. 31, 2016. Accessed: May 08, 2023. [Online]. Available: https://amecorg.com/amecframework/home/supporting-material/taxonomy/
[3] B. Dimić Surla, M. Segedinac, and D. Ivanović, “A BIBO ontology extension for evaluation of scientific research results,” in Proceedings of the Fifth Balkan Conference in Informatics, in BCI ’12. New York, NY, USA: Association for Computing Machinery, Sep. 2012, pp. 275–278. doi: 10.1145/2371316.2371376.
[4] F. Brudy et al., “Cross-Device Taxonomy: Survey, Opportunities and Challenges of Interactions Spanning Across Multiple Devices,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19. New York, NY, USA: Association for Computing Machinery, Mai 2019, pp. 1–28. doi: 10.1145/3290605.3300792.
[5] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, and E. Motta, “The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas,” in Lecture Notes in Computer Science 1137, D. Vrandečić, K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kaffee, and E. Simperl, Eds., Monterey, California, USA: Springer, Oct. 2018, pp. 187–205. Accessed: May 08, 2023. [Online]. Available: http://oro.open.ac.uk/55484/
[6] M. Dehnert, A. Gleiss, and F. Reiss, “What makes a data-driven business model? A consolidated taxonomy,” presented at the European Conference on Information Systems, 2021.
[7] DDI Alliance, “DDI Controlled Vocabulary for Aggregation Method,” 2014. https://ddialliance.org/Specification/DDI-CV/AggregationMethod_1.0.html (accessed May 08, 2023).
[8] DDI Alliance, “DDI Controlled Vocabulary for Mode Of Collection,” 2015. https://ddialliance.org/Specification/DDI-CV/ModeOfCollection_2.0.html (accessed May 08, 2023).
[9] INED - French Institute for Demographic Studies, “Thésaurus DemoVoc,” Feb. 26, 2020. https://thesaurus.web.ined.fr/navigateur/en/about (accessed May 08, 2023).
[10] A. A. Bakar, Z. A. Othman, and N. L. M. Shuib, “Building a new taxonomy for data discretization techniques,” in 2009 2nd Conference on Data Mining and Optimization, Oct. 2009, pp. 132–140. doi: 10.1109/DMO.2009.5341896.
[11] N. Brouard and C. Giudici, “Unified second edition of the Multilingual Demographic Dictionary (Demopaedia.org project),” presented at the 2017 International Population Conference, IUSSP, Oct. 2017. Accessed: May 08, 2023. [Online]. Available: https://iussp.confex.com/iussp/ipc2017/meetingapp.cgi/Paper/5713
[12] DuCharme, Bob, “Data Science Glossary.” https://www.datascienceglossary.org/ (accessed May 08, 2023).
[13] A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, and E. Stiakakis, “A Taxonomy of Evaluation Approaches in Software Engineering,” in Proceedings of the 7th Balkan Conference on Informatics Conference, in BCI ’15. New York, NY, USA: Association for Computing Machinery, Sep. 2015, pp. 1–8. doi: 10.1145/2801081.2801084.
[14] M. C. Chibucos, D. A. Siegele, J. C. Hu, and M. Giglio, “The Evidence and Conclusion Ontology (ECO): Supporting GO Annotations,” in The Gene Ontology Handbook, C. Dessimoz and N. Škunca, Eds., in Methods in Molecular Biology. New York, NY: Springer, 2017, pp. 245–259. doi: 10.1007/978-1-4939-3743-1_18.
[15] M. Black et al., “EDAM: the bioscientific data analysis ontology,” F1000Research, vol. 11, Jan. 2021, doi: 10.7490/f1000research.1118900.1.
[16] Council of European Social Science Data Archives (CESSDA), “European Language Social Science Thesaurus ELSST,” 2021. https://thesauri.cessda.eu/en/ (accessed May 08, 2023).
[17] M. Scriven, Evaluation Thesaurus, 3rd Edition. Edgepress, 1981. Accessed: May 08, 2023. [Online]. Available: https://us.sagepub.com/en-us/nam/evaluation-thesaurus/book3562
[18] Papantoniou, Bill et al., The Glossary of Human Computer Interaction. Interaction Design Foundation. Accessed: May 08, 2023. [Online]. Available: https://www.interaction-design.org/literature/book/the-glossary-of-human-computer-interaction
[19] “UK Data Service Vocabularies: HASSET Thesaurus.” https://hasset.ukdataservice.ac.uk/hasset/en/ (accessed May 08, 2023).
[20] S. D. Costa, M. P. Barcellos, R. de A. Falbo, T. Conte, and K. M. de Oliveira, “A core ontology on the Human–Computer Interaction phenomenon,” Data Knowl. Eng., vol. 138, p. 101977, Mar. 2022, doi: 10.1016/j.datak.2021.101977.
[21] V. J. Gawron et al., “Human Factors Taxonomy,” Proc. Hum. Factors Soc. Annu. Meet., vol. 35, no. 18, pp. 1284–1287, Sep. 1991, doi: 10.1177/154193129103501807.
[22] L. Onnasch and E. Roesler, “A Taxonomy to Structure and Analyze Human–Robot Interaction,” Int. J. Soc. Robot., vol. 13, no. 4, pp. 833–849, Jul. 2021, doi: 10.1007/s12369-020-00666-5.
[23] R. A. Schwier, “A Taxonomy of Interaction for Instructional Multimedia.” Sep. 28, 1992. Accessed: May 09, 2023. [Online]. Available: https://eric.ed.gov/?id=ED352044
[24] C. Kelly, J. Miller, A. Redlich, and S. Kleinman, “A Taxonomy of Interrogation Methods,”