Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database presents the morphological annotation of Slovenian adjectives. It includes the 6,000 most frequent adjectives in Slovenian, extracted from the Gigafida 2.0 corpus (deduplicated) using the CQL [tag="P.*"] on a random sample of 10,000,000 lines in the NoSketch engine in March 2024.
Among the adjectives on the list, there are some homophonous items and given that the corpus is not annotated for meaning, homophonous adjectives were counted as a single item. For example, premočen can either mean ‘soaked’ (from the verb premočiti ‘to drench’) or ‘too strong’ (from močen ‘strong’). The annotator decided which meaning they perceive as more salient and annotated the item as that specific adjective.
Proper names were not annotated as morphologically complex. For example, the possessive Gregorinov ‘Gregorin’s’ (Gregorin is a last name) is only marked for the possessive suffix -ov, even if the last name itself is probably decomposable to Gregor+in.
Column-by-column overview
We start by listing the columns in the database and showing what property is annotated in each of them.
Column A: ID
Adjectives are annotated with consecutive numbers, and this column contains a unique number assigned to each adjective.
Column B: Adjective
This column lists the citation form (lemma) of each adjective.
Column C: Frequency
This column provides the frequency of each individual adjective's lemma.
Column D: Included
This column distinguishes between items we consider actual adjectives in the relevant sense from all other items. Words marked with 1 are included in the annotation, while items marked with a 0 are excluded. The reasons for exclusion are:
the item not being an adjective,
the item being misspelled and
the item being a proper name or a part of a proper name.
Columns E–K: Suffix 1 to Suffix 7
These columns list the specific suffixes contained in each adjective. Suffix 1 is the one closest to the root, followed by Suffix 2, and so on.
The aim was to pursue maximal decomposition. Therefore, for instance, the possessive adjectival pronoun svoj ‘own’ was decomposed into s-v-oj, based on its relation to s-eb-e ‘oneself’ as well as to m-oj ‘my’ and t-v-oj ‘your’.
See Appendix for the specific decisions regarding the annotation.
Column L: Ending
If the adjective has a phonologically overt inflectional ending (e.g., slovensk-i ‘Slovenian’), this ending is listed in column E.
Column M: Prefixes
If the adjective has a prefix, the prefix is listed in this column.
Prefixes in loanwords are annotated in this column if the version without the prefix (or with some other prefix) also exists in Slovenian. E.g., iracionalen ‘irrational’ is annotated as having the prefix i- because racionalen ‘rational’ also exists. On the other hand, dis- in diskonten ‘discount’ is not given in this column, because *konten does not exist in Slovenian
If the adjective has several prefixes, these are listed in the column and are separated by a plus sign. The rightmost prefix is the one closest to the root/base.
If the prefix is marked with an asterisk, this prefix modifies an existing adjective. In other cases, the prefix is a part of a non-adjectival base that got adjectivised. Compare:
predolg ‘too long’: prefix pre* (dolg ‘long’ is an adjective)
preminul ‘dead’: prefix marked as pre (the adjective is derived from the verb preminiti ‘to die’)
zavezniški ‘ally’: prefix marked as za (the adjective is derived from the noun zaveznik ‘ally’).
Items that could be taken to be a prefix, but an unprefixed version of the base (or a version with a different prefix) is not attested, are given in brackets. For instance zanikrn ‘sloppy’ has (za) in this column, since the annotator has the intuition that za is a prefix in this word, but *nikrn is not attested.
If it was unclear whether the item in question was a single prefix or could be further decomposed, a potential decomposition is provided. One such example is izpodbijan ‘contentious’ where prefixes iz- and pod- also exist (as do prepositions iz, pod and izpod), which is why prefixes iz+pod were annotated.
Column N: Non-derived adjective
Adjectives that are taken to be non-derived (i.e., in cases where we have no arguments to assume they are morphologically complex) get a 1 in this column (if not, they are assigned a 0). For instance, bled ‘pale’ has a 1 in this column, whereas mandlj-ev ‘made out of almonds’ has a 0.
Column O: Zero
Adjectives that contain a base from a different category or a compound base, but do not include an overt adjectivising morpheme, are assigned a 1 in this column (if not, they are assigned a 0). An example is drag-o-cen-Ø lit. expensive-o-price ‘invaluable’.
Column P: Compound base
Adjectives that have a compound base get a 1 in this column (if not, they are assigned a 0).
If an adjective is annotated with a 1, the right component of the compound is decomposed for suffixes only. For instance, drug-o-uvrščen ‘runner-up’ (literally second-o-classified) has the prefix u- in the right part, but this is not annotated separately.
Loan adjectives are marked as having a compound base if the components of the base are used in other contexts in Slovenian. E.g., the base of radiološki ‘radiological’ is radiolog, which contains radio, used as an independent word meaning ‘radio’ and -log, also attested in, e.g., psiholog ‘psychologist’, arheolog ‘archeologist’.
Finally, if an item is marked as a compound, it is not also marked as participial, even if it contains a deverbal participle. A case in point is drug-o-uvrščen ‘runner-up’, which contains the passive participle of the verb uvrstiti ‘classify’.
Column R: PTCP
If the adjective is a passive or active participle, it is assigned a 1 in this column. If not, they are assigned a 0.
Appendix: Specific decisions for the annotation of suffixes in columns E–J
The general criterion for annotating an element as a suffix was its occurrence in multiple adjectives and/or in combination with other suffixes. Crucially, this means that we also attempted to decompose elements sometimes considered a single suffix. For example, -kast in siv-kast ‘gray-ish’ was annotated as siv+k+ast, since both -k and -ast are independently attested suffixes (kič-ast ‘kitsch-y’, ljub-(e)k ‘cute’).
Especially in the domain of borrowed words, in some cases, it was impossible to reconstruct the underlying representation of suffixes that only appear before palatalising suffixes. For instance, in sarkastičen ‘sarcastic’, the sequence -ič- can, in principle, be underlyingly -ik-, -ic-, or -ič-, as all these underlying representations could lead to the surface allomorph -ič-. In such cases, we opted for analogy with comparable words whose intermediate bases do surface as independent words. In this case, an analogy can be made with words like logističen ‘logistic’, with the base logistika ‘logistics’. As a consequence, sarkastičen was annotated as sarkast+ik+n.
Some nominal bases display so-called stem extensions, which occur throughout the paradigm of the noun (e.g. vrem-e ‘weather’ has the genitive singular vrem-en-a, dative singular vrem-en-u etc.) Stem extensions like en were not annotated as derivational suffixes, so that e.g., vrem-en-sk-i is annotated as having only the suffix sk.
Similarly, many nouns ending in -r in the nominative singular get an extra -j in other forms in the paradigm. Because -j is present in the declension of the noun, it was not annotated as a suffix. E.g. krompir ‘potato’ has the genitive singular krompir-j-a. The related adjective krompir-j-ev ‘related to potato’ is
Unfortunately, no README file was found for the datano extension, limiting the ability to provide a detailed and comprehensive description. Therefore, the following description is based on the extension name and general assumptions about data annotation tools within the CKAN ecosystem. The datano
extension for CKAN, presumably short for "data annotation," likely aims to enhance datasets with annotations, metadata enrichment, and quality control features directly within the CKAN environment. It potentially introduces functionalities for adding textual descriptions, classifications, or other forms of annotation to datasets to improve their discoverability, usability, and overall value. This extension could provide an interface for users to collaboratively annotate data, thereby enriching dataset descriptions and making the data more useful for various purposes. Key Features (Assumed): * Dataset Annotation Interface: Provides a user-friendly interface within CKAN for adding structured or unstructured annotations to datasets and associated resources. This allows for a richer understanding of the data's content, purpose, and usage. * Collaborative Annotation: Supports multiple users collaboratively annotating datasets, fostering knowledge sharing and collective understanding of the data. * Annotation Versioning: Maintains a history of annotations, enabling users to track changes and revert to previous versions if necessary. * Annotation Search: Allows users to search for datasets based on annotations, enabling quick discovery of relevant data based on specific criteria. * Metadata Enrichment: Integrates annotations with existing metadata, enhancing metadata schemas to support more detailed descriptions and contextual information. * Quality Control Features: Includes options to rate, validate, or flag annotations to ensure they are accurate and relevant, improving overall data quality. Use Cases (Assumed): 1. Data Discovery Improvement: Enables users to find specific datasets more easily by searching for datasets based on their annotations and enriched metadata. 2. Data Quality Enhancement: Allows data curators to improve the quality of datasets by adding annotations that clarify the data's meaning, provenance, and limitations. 3. Collaborative Data Projects: Facilitates collaborative data annotation efforts, wherein multiple users contribute to the enrichment of datasets with their knowledge and insights. Technical Integration (Assumed): The datano
extension would likely integrate with CKAN's existing plugin framework, adding new UI elements for annotation management and search. It could leverage CKAN's API for programmatic access to annotations and utilize CKAN's security model for managing access permissions. Benefits & Impact (Assumed): By implementing the datano
extension, CKAN users can leverage improvements to data discoverability, quality, and collaborative potential. The enhancement can help data curators to refine the understanding and management of data, making it easier to search, understand and promote data driven decision-making.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
Dataset Summary
These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).
The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.
The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.
See annotations for more details.
Supported Tasks and Leaderboards
Gender classification, Accent classification.
Languages
The dataset is in Catalan (ca).
Dataset Structure
Instances
Two xlsx documents are published, one for each round of annotations.
The following information is available in each of the documents:
{ 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }
We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.
Data Fields
speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus
idx (int): Id in this corpus
AN1 (string): Annotations from Annotator 1
AN2 (string): Annotations from Annotator 2
AN3 (string): Annotations from Annotator 3
agreed (string): Annotation from the majority of the annotators
percentage (int): Percentage of annotators that agree with the agreed annotation
mean quality (float): Mean of the quality annotation
stdev quality (float): Standard deviation of the mean quality
Data Splits
The corpus remains undivided into splits, as its purpose does not involve training models.
Dataset Creation
Curation Rationale
During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.
In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Source Data
The original data comes from the Catalan sentences of the Common Voice corpus.
Initial Data Collection and Normalization
We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.
Who are the source language producers?
The original data comes from the Catalan sentences of the Common Voice corpus.
Annotations
Annotation process
Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.
A team of three annotators was tasked with annotating:
if all the recordings correspond to the same person
the gender of the speaker
the accent of the speaker
the quality of the recording
They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Who are the annotators?
The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.
The annotation team was composed of:
Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.
Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.
1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.
To do the annotation they used a Google Drive spreadsheet
Personal and Sensitive Information
The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
Considerations for Using the Data
Social Impact of Dataset
The ID come from the Common Voice dataset, that consists of people who have donated their voice online.
You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Discussion of Biases
Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.
For the gender annotation, we have only considered "H" (male) and "D" (female).
Other Known Limitations
[N/A]
Additional Information
Dataset Curators
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
Licensing Information
This dataset is licensed under a CC BY 4.0 license.
It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.
Citation Information
DOI
Contributions
The annotation was entrusted to the STeL team from the University of Barcelona.
In this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC. Purpose: The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse. The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words. For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another. The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances. When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060
Dataset Title: A Gold Standard Corpus for Activity Information (GoSCAI)
Dataset Curators: The Epidemiology & Biostatistics Section of the NIH Clinical Center Rehabilitation Medicine Department
Dataset Version: 1.0 (May 16, 2025)
Dataset Citation and DOI: NIH CC RMD Epidemiology & Biostatistics Section. (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Data set]. Zenodo. doi: 10.5281/zenodo.15528545
This data statement is for a gold standard corpus of de-identified clinical notes that have been annotated for human functioning information based on the framework of the WHO's International Classification of Functioning, Disability and Health (ICF). The corpus includes 484 notes from a single institution within the United States written in English in a clinical setting. This dataset was curated for the purpose of training natural language processing models to automatically identify, extract, and classify information on human functioning at the whole-person, or activity, level.
This dataset is curated to be a publicly available resource for the development and evaluation of methods for the automatic extraction and classification of activity-level functioning information as defined in the ICF. The goals of data curation are to 1) create a corpus of a size that can be manually deidentified and annotated, 2) maximize the density and diversity of functioning information of interest, and 3) allow public dissemination of the data.
Language Region: en-US
Prose Description: English as written by native and bilingual English speakers in a clinical setting
The language users represented in this dataset are medical and clinical professionals who work in a research hospital setting. These individuals hold professional degrees corresponding to their respective specialties. Specific demographic characteristics of the language users such as age, gender, or race/ethnicity were not collected.
The annotator group consisted of five people, 33 to 76 years old, including four females and one male. Socioeconomically, they came from the middle and upper-middle income classes. Regarding first language, three annotators had English as their first language, one had Chinese, and one had Spanish. Proficiency in English, the language of the data being annotated, was native for three of the annotators and bilingual for the other two. The annotation team included clinical rehabilitation domain experts with backgrounds in occupational therapy, physical therapy, and individuals with public health and data science expertise. Prior to annotation, all annotators were trained on the specific annotation process using established guidelines for the given domain, and annotators were required to achieve a specified proficiency level prior to annotating notes in this corpus.
The notes in the dataset were written as part of clinical care within a U.S. research hospital between May 2008 and November 2019. These notes were written by health professionals asynchronously following the patient encounter to document the interaction and support continuity of care. The intended audience of these notes were clinicians involved in the patients' care. The included notes come from nine disciplines - neuropsychology, occupational therapy, physical medicine (physiatry), physical therapy, psychiatry, recreational therapy, social work, speech language pathology, and vocational rehabilitation. The notes were curated to support research on natural language processing for functioning information between 2018 and 2024.
The final corpus was derived from a set of clinical notes extracted from the hospital electronic medical record (EMR) for the purpose of clinical research. The original data include character-based digital content originally. We work in ASCII 8 or UNICODE encoding, and therefore part of our pre-processing includes running encoding detection and transformation from encodings such as Windows-1252 or ISO-8859 format to our preferred format.
On the larger corpus, we applied sampling to match our curation rationale. Given the resource constraints of manual annotation, we set out to create a dataset of 500 clinical notes, which would exclude notes over 10,000 characters in length.
To promote density and diversity, we used five note characteristics as sampling criteria. We used the text length as expressed in number of characters. Next, we considered the discipline group as derived from note type metadata and describes which discipline a note originated from: occupational and vocational therapy (OT/VOC), physical therapy (PT), recreation therapy (RT), speech and language pathology (SLP), social work (SW), or miscellaneous (MISC, including psychiatry, neurology and physiatry). These disciplines were selected for collecting the larger corpus because their notes are likely to include functioning information. Existing information extraction tools were used to obtain annotation counts in four areas of functioning and provided a note’s annotation count, annotation density (annotation count divided by text length), and domain count (number of domains with at least 1 annotation).
We used stratified sampling across the 6 discipline groups to ensure discipline diversity in the corpus. Because of low availability, 50 notes were sampled from SLP with relaxed criteria, and 90 notes each from the 5 other discipline groups with stricter criteria. Sampled SLP notes were those with the highest annotation density that had an annotation count of at least 5 and a domain count of at least 2. Other notes were sampled by highest annotation count and lowest text length, with a minimum annotation count of 15 and minimum domain count of 3.
The notes in the resulting sample included certain types of PHI and PII. To prepare for public dissemination, all sensitive or potentially identifying information was manually annotated in the notes and replaced with substituted content to ensure readability and enough context needed for machine learning without exposing any sensitive information. This de-identification effort was manually reviewed to ensure no PII or PHI exposure and correct any resulting readability issues. Notes about pediatric patients were excluded. No intent was made to sample multiple notes from the same patient. No metadata is provided to group notes other than by note type, discipline, or discipline group. The dataset is not organized beyond the provided metadata, but publications about models trained on this dataset should include information on the train/test splits used.
All notes were sentence-segmented and tokenized using the spaCy en_core_web_lg model with additional rules for sentence segmentation customized to the dataset. Notes are stored in an XML format readable by the GATE annotation software (https://gate.ac.uk/family/developer.html), which stores annotations separately in annotation sets.
As the clinical notes were extracted directly from the EMR in text format, the capture quality was determined to be high. The clinical notes did not have to be converted from other data formats, which means this dataset is free from noise introduced by conversion processes such as optical character recognition.
Because of the effort required to manually deidentify and annotate notes, this corpus is limited in terms of size and representation. The curation decisions skewed note selection towards specific disciplines and note types to increase the likelihood of encountering information on functioning. Some subtypes of functioning occur infrequently in the data, or not at all. The deidentification of notes was done in a manner to preserve natural language as it would occur in the notes, but some information is lost, e.g. on rare diseases.
Information on the manual annotation process is provided in the annotation guidelines for each of the four domains:
- Communication & Cognition (https://zenodo.org/records/13910167)
- Mobility (https://zenodo.org/records/11074838)
- Self-Care & Domestic Life (SCDL) (https://zenodo.org/records/11210183)
- Interpersonal Interactions & Relationships (IPIR) (https://zenodo.org/records/13774684)
Inter-annotator agreement was established on development datasets described in the annotation guidelines prior to the annotation of this gold standard corpus.
The gold standard corpus consists of 484 documents, which include 35,147 sentences in total. The distribution of annotated information is provided in the table below.
Domain |
Number of Annotated Sentences |
% of All Sentences |
Mean Number of Annotated Sentences per Document |
Communication & Cognition |
6033 |
17.2% |
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
DESCRIPTION
For this task, we use a subset of the MIRFLICKR (http://mirflickr.liacs.nl) collection. The entire collection contains 1 million images from the social photo sharing website Flickr and was formed by downloading up to a thousand photos per day that were deemed to be the most interesting according to Flickr. All photos in this collection were released by their users under a Creative Commons license, allowing them to be freely used for research purposes. Of the entire collection, 25 thousand images were manually annotated with a limited number of concepts and many of these annotations have been further refined and expanded over the lifetime of the ImageCLEF photo annotation task. This year we used crowd sourcing to annotate all of these 25 thousand images with the concepts.
On this page we provide you with more information about the textual features, visual features and concept features we supply with each image in the collection we use for this year's task.
TEXTUAL FEATURES
All images are accompanied by the following textual features:
- Flickr user tags
These are the tags that the users assigned to the photos their uploaded to Flickr. The 'raw' tags are the original tags, while the 'clean' tags are those collapsed to lowercase and condensed to removed spaces.
- EXIF metadata
If available, the EXIF metadata contains information about the camera that took the photo and the parameters used. The 'raw' exif is the original camera data, while the 'clean' exif reduces the verbosity.
- User information and Creative Commons license information
This contains information about the user that took the photo and the license associated with it.
VISUAL FEATURES
Over the previous years of the photo annotation task we noticed that often the same types of visual features are used by the participants, in particular features based on interest points and bag-of-words are popular. To assist you we have extracted several features for you that you may want to use, so you can focus on the concept detection instead. We additionally give you some pointers to easy to use toolkits that will help you extract other features or the same features but with different default settings.
- SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT
We used the ISIS Color Descriptors (http://www.colordescriptors.com) toolkit to extract these descriptors. This package provides you with many different types of features based on interest points, mostly using SIFT. It furthermore assists you with building codebooks for bag-of-words. The toolkit is available for Windows, Linux and Mac OS X.
- SURF
We used the OpenSURF (http://www.chrisevansdev.com/computer-vision-opensurf.html) toolkit to extract this descriptor. The open source code is available in C++, C#, Java and many more languages.
- TOP-SURF
We used the TOP-SURF (http://press.liacs.nl/researchdownloads/topsurf) toolkit to extract this descriptor, which represents images with SURF-based bag-of-words. The website provides codebooks of several different sizes that were created using a combination of images from the MIR-FLICKR collection and from the internet. The toolkit also offers the ability to create custom codebooks from your own image collection. The code is open source, written in C++ and available for Windows, Linux and Mac OS X.
- GIST
We used the LabelMe (http://labelme.csail.mit.edu) toolkit to extract this descriptor. The MATLAB-based library offers a comprehensive set of tools for annotating images.
For the interest point-based features above we used a Fast Hessian-based technique to detect the interest points in each image. This detector is built into the OpenSURF library. In comparison with the Hessian-Laplace technique built into the ColorDescriptors toolkit it detects fewer points, resulting in a considerably reduced memory footprint. We therefore also provide you with the interest point locations in each image that the Fast Hessian-based technique detected, so when you would like to recalculate some features you can use them as a starting point for the extraction. The ColorDescriptors toolkit for instance accepts these locations as a separate parameter. Please go to http://www.imageclef.org/2012/photo-flickr/descriptors for more information on the file format of the visual features and how you can extract them yourself if you want to change the default settings.
CONCEPT FEATURES
We have solicited the help of workers on the Amazon Mechanical Turk platform to perform the concept annotation for us. To ensure a high standard of annotation we used the CrowdFlower platform that acts as a quality control layer by removing the judgments of workers that fail to annotate properly. We reused several concepts of last year's task and for most of these we annotated the remaining photos of the MIRFLICKR-25K collection that had not yet been used before in the previous task; for some concepts we reannotated all 25,000 images to boost their quality. For the new concepts we naturally had to annotate all of the images.
- Concepts
For each concept we indicate in which images it is present. The 'raw' concepts contain the judgments of all annotators for each image, where a '1' means an annotator indicated the concept was present whereas a '0' means the concept was not present, while the 'clean' concepts only contain the images for which the majority of annotators indicated the concept was present. Some images in the raw data for which we reused last year's annotations only have one judgment for a concept, whereas the other images have between three and five judgments; the single judgment does not mean only one annotator looked at it, as it is the result of a majority vote amongst last year's annotators.
- Annotations
For each image we indicate which concepts are present, so this is the reverse version of the data above. The 'raw' annotations contain the average agreement of the annotators on the presence of each concept, while the 'clean' annotations only include those for which there was a majority agreement amongst the annotators.
You will notice that the annotations are not perfect. Especially when the concepts are more subjective or abstract, the annotators tend to disagree more with each other. The raw versions of the concept annotations should help you get an understanding of the exact judgments given by the annotators.
The HED Language library schema (HED LANG Schema) contains vocabulary for annotating language experiments in cognitive science. The schema allows for detailed annotation of neuroimaging experiments that involve language events. It is suitable for experiments using carefully controlled experiment stimuli to address specific questions in the domain of language processing, and for experiments using complex naturalistic paradigms involving written or spoken language. HED LANG allows for annotation of language stimuli on different levels through the orthogonal definition of Language-items and Language-item-properties. Full sentences can be annotated with sentence-level characteristics, and individual words can be associated with word-level characteristics. Annotation possibilities are extensive and cover characteristics found across languages to allow for comparisons between languages. The current release of the schema is primarily centered around written language, and morphosyntactic word properties. The schema is open to extension. HED LANG 1.0.0 is partnered with version 8.3.0 of the standard schema. As a result, annotators may use tags from both schemas without extra prefixing. Example annotations You can find several example annotations of recent work in psycholinguistics in our preprint. Additionally, we have added annotations to several datasets which are publicly available on OpenNeuro. Their annotated versions (and links to the original) can be found here: ds001894 ds002155 ds002382 ds003126 Viewing HED LANG The HED LANG library schema can be viewed using the HED Schema Browser. NOTE: This is a minor release of the HED lang library schema. It includes changes to the schema and tags, as well as updates to the documentation. The changes are backward compatible with previous version of the schema. The annotations have been moved to a schema attribute to allow programmatic linking to other ontologies. The mediawiki and XML formats are now completely equivalent to the tsv version. The schema is partnered with HED schema version 8.4.0.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Musical prosody is characterized by the acoustic variations that make music expressive. However, few systematic and scalable studies exist on the function it serves or on effective tools to carry out such studies. To address this gap, we introduce a novel approach to capturing information about prosodic functions through a citizen science paradigm. In typical bottom-up approaches to studying musical prosody, acoustic properties in performed music and basic musical structures such as accents and phrases are mapped to prosodic functions, namely segmentation and prominence. In contrast, our top-down, human-centered method puts listener annotations of musical prosodic functions first, to analyze the connection between these functions, the underlying musical structures, and acoustic properties. The method is applied primarily to the exploring of segmentation and prominence in performed solo piano music. These prosodic functions are marked by means of four annotation types—boundaries, regions, note groups, and comments—in the CosmoNote web-based citizen science platform, which presents the music signal or MIDI data and related acoustic features in information layers that can be toggled on and off. Various annotation strategies are discussed and appraised: intuitive vs. analytical; real-time vs. retrospective; and, audio-based vs. visual. The end-to-end process of the data collection is described, from the providing of prosodic examples to the structuring and formatting of the annotation data for analysis, to techniques for preventing precision errors. The aim is to obtain reliable and coherent annotations that can be applied to theoretical and data-driven models of musical prosody. The outcomes include a growing library of prosodic examples with the goal of achieving an annotation convention for studying musical prosody in performed music.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Data
The dataset consist of 5538 images of public spaces, annotated with steps, stairs, ramps and grab bars for stairs and ramps. The dataset has annotations 3564 of steps, 1492 of stairs, 143 of ramps and 922 of grab bars.
Each step annotation is attributed with an estimate of the height of the step, as falling into one of three categories: less than 3cm, 3cm to 7cm or more than 7cm. Additionally it is attributed with a 'type', with the possibilities 'doorstep', 'curb' or 'other'.
Stair annotations are attributed with the number of steps in the stair.
Ramps are attributed with an estimate of their width, also falling into three categories: less than 50cm, 50cm to 100cm and more than 100cm.
In order to preserve all additional attributes of the labels, the data is published in the CVAT XML format for images.
Annotating Process
The labelling has been done using bounding boxes around the objects. This format is compatible with many popular object detection models, e.g. the YOLO object model. A bounding box is placed so it contains exactly the visible part of the respective objects. This implies that only objects that are visible in the photo are annotated. This means in particular a photo of a stair or step from above, where the object cannot be seen, have not been annotated, even when a human viewer can possibly infer that there is a stair or a step from other features in the photo.
Steps
A step is annotated, when there is an vertical increment that functions as a passage between two surface areas intended human or vehicle traffic. This means that we have not included:
In particular, the bounding box of a step object contains exactly the incremental part of the step, but does not extend into the top or bottom horizontal surface any more than necessary to enclose entirely the incremental part. This has been chosen for consistency reasons, as including parts of the horizontal surfaces would imply a non-trivial choice of how much to include, which we deemed would most likely lead to more inconstistent annotations.
The height of the steps are estimated by the annotators, and are therefore not guarranteed to be accurate.
The type of the steps typically fall into the category 'doorstep' or 'curb'. Steps that are in a doorway, entrance or likewise are attributed as doorsteps. We also include in this category steps that are immediately leading to a doorway within a proximity of 1-2m. Steps between different types of pathways, e.g. between streets and sidewalks, are annotated as curbs. Any other type of step are annotated with 'other'. Many of the 'other' steps are for example steps to terraces.
Stairs
The stair label is used whenever two or more steps directly follow each other in a consistent pattern. All vertical increments are enclosed in the bounding box, as well as intermediate surfaces of the steps. However the top and bottom surface is not included more than necessary for the same reason as for steps, as described in the previous section.
The annotator counts the number of steps, and attribute this to the stair object label.
Ramps
Ramps have been annotated when a sloped passage way has been placed or built to connect two surface areas intended for human or vehicle traffic. This implies the same considerations as with steps. Alike also only the sloped part of a ramp is annotated, not including the bottom or top surface area.
For each ramp, the annotator makes an assessment of the width of the ramp in three categories: less than 50cm, 50cm to 100cm and more than 100cm. This parameter is visually hard to assess, and sometimes impossible due to the view of the ramp.
Grab Bars
Grab bars are annotated for hand rails and similar that are in direct connection to a stair or a ramp. While horizontal grab bars could also have been included, this was omitted due to the implied ambiguities of fences and similar objects. As the grab bar was originally intended as an attributal information to stairs and ramps, we chose to keep this focus. The bounding box encloses the part of the grab bar that functions as a hand rail for the stair or ramp.
Usage
As is often the case when annotating data, much information depends on the subjective assessment of the annotator. As each data point in this dataset has been annotated only by one person, caution should be taken if the data is applied.
Generally speaking, the mindset and usage guiding the annotations have been wheelchair accessibility. While we have strived to annotate at an object level, hopefully making the data more widely applicable than this, we state this explicitly as it may have swayed untrivial annotation choices.
The attributal data, such as step height or ramp width are highly subjective estimations. We still provide this data to give a post-hoc method to adjust which annotations to use. E.g. for some purposes, one may be interested in detecting only steps that are indeed more than 3cm. The attributal data makes it possible to sort away the steps less than 3cm, so a machine learning algorithm can be trained on this more appropriate dataset for that use case. We stress however, that one cannot expect to train accurate machine learning algorithms inferring the attributal data, as this is not accurate data in the first place.
We hope this dataset will be a useful building block in the endeavours for automating barrier detection and documentation.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Proteomics data-dependent acquisition data sets collected with high-resolution mass-spectrometry (MS) can achieve very high-quality results, but nearly every analysis yields results that are thresholded at some accepted false discovery rate, meaning that a substantial number of results are incorrect. For study conclusions that rely on a small number of peptide-spectrum matches being correct, it is thus important to examine at least some crucial spectra to ensure that they are not one of the incorrect identifications. We present Quetzal, a peptide fragment ion spectrum annotation tool to assist researchers in annotating and examining such spectra to ensure that they correctly support study conclusions. We describe how Quetzal annotates spectra using the new Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) mzPAF standard for fragment ion peak annotation, including the Python-based code, a web-service end point that provides annotation services, and a web-based application for annotating spectra and producing publication-quality figures. We illustrate its functionality with several annotated spectra of varying complexity. Quetzal provides easily accessible functionality that can assist in the effort to ensure and demonstrate that crucial spectra support study conclusions. Quetzal is publicly available at https://proteomecentral.proteomexchange.org/quetzal/.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives in europarl texts.
The 8 connectives with their annotated relations are:
although (contrast|concession)
as (prep|causal|temporal|comparison|concession)
however (contrast|concession)
meanwhile (contrast|temporal)
since (causal|temporal|temporal-causal)
though (contrast|concession)
while (contrast|concession|temporal|temporal-contrast|temporal-causal)
yet (adv|contrast|concession)
For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.
If you need software for the latter, have a look at: https://github.com/idiap/DiscoConn-Classifier
Citation
Please cite the following papers if you make use of these datasets (and to know more about the annotation method):
@INPROCEEDINGS{Popescu-Belis-LREC-2012, author = {Popescu-Belis, Andrei and Meyer, Thomas and Liyanapathirana, Jeevanthi and Cartoni, Bruno and Zufferey, Sandrine}, title = {{D}iscourse-level {A}nnotation over {E}uroparl for {M}achine {T}ranslation: {C}onnectives and {P}ronouns}, booktitle = {Proceedings of the eighth international conference on Language Resources and Evaluation ({LREC})}, year = {2012}, address = {Istanbul, Turkey} }
@Article{Cartoni-DD-2013, Author = {Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas}, Title = {{Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique}}, Journal = {Dialogue & Discourse}, Volume = {4}, Number = {2}, pages = {65--86}, year = {2013} }
@ARTICLE{Meyer-TSLP-submitted, author = {Meyer, Thomas and Hajlaoui, Najeh and Popescu-Belis, Andrei}, title = {{Disambiguating Discourse Connectives for Statistical Machine Translation in Several Languages}}, journal = {IEEE/ACM Transactions of Audio, Speech, and Language Processing}, year = {submitted}, volume = {}, pages = {}, number = {} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide text metadata, image frames, and thumbnails of YouTube videos classified as harmful or harmless by domain experts, GPT-4-Turbo, and crowdworkers. Harmful videos are categorized into one or more of six harm categories: Information harms (IH), Hate and Harassment harms (HH), Clickbait harms (CB), Addictive harms (ADD), Sexual harms (SXL), and Physical harms (PH).
This repository includes the text metadata and a link to external cloud storage for the image data.
Folder | Subfolder | #Videos |
Ground Truth | Harmful_full_agreement (classified as harmful by all the three actors) | 5,109 |
Harmful_subset_agreement (classified as harmful by more than two actors) | 14,019 | |
Domain Experts | Harmful | 15,115 |
Harmless | 3,303 | |
GPT-4-Turbo | Harmful | 10,495 |
Harmless | 7,818 | |
Crowdworkers (Workers from Amazon Mechanical Turk) | Harmful | 12,668 |
Harmless | 4,390 | |
Unannotated large pool | - | 60,906 |
For details about the harm classification taxonomy and the performance comparison between crowdworkers, GPT-4-Turbo, and domain experts, please see https://arxiv.org/abs/2411.05854.
Data consists of annotations of music in terms of moods music may express and activities that music might fit. The data structures are related to different kinds of annotation tasks, which addressed these questions: 1) annotations of 9 activities that fit a wide range of moods related to music, 2) nominations of music tracks that best fit the a particular mood and annotating the activities that fit them, and 3) annotations of these nominated tracks in terms of mood and activities. Users are anonymised, but the background information (gender, music preferences, age, etc.) are also available. Dataset consists of relational database, that is linked together by means of common ids (tracks, users, activities, moods, genres, expertise, language skill). Current approaches to the tagging of music in online databases predominantly rely on music genre and artist name, with music tags being often ambiguous and inexact. Yet, the possibly most salient feature musical experiences is emotion. The few attempts so far undertaken to tag music for mood or emotion lack a scientific foundation in emotion research. The current project proposes to incorporate recent research on music-evoked emotion into the growing number of online musical databases and catalogues, notably the Geneva Emotional Music Scale (GEMS) - a rating measure for describing emotional effects of music recently developed by our group. Specifically, the aim here is to develop the GEMS into an innovative conceptual and technical tool for tagging of online musical content for emotion. To this end, three studies are proposed. In study 1, we will examine whether the GEMS labels and their grouping holds up against a much wider range of musical genres than those that were originally used for its development. In Study 2, we will use advanced data reduction techniques to select the most recurrent and important labels for describing music-evoked emotion. In a third study we will examine the added benefit of the new GEMS compared to conventional approaches to the tagging of music. The anticipated impact of the findings is threefold. First, the research to be described next will advance our understanding of the nature and structure of emotions evoked by music. Developing a valid model of music-evoked emotion is crucial for meaningful research in the social and in the neurosciences. Second, music information organization and retrieval can benefit from a scientifically sound and parsimonious taxonomy for describing the emotional effects of music. Thus, searches for relevant online music databases need not be longer confined to genre or artist, but can also incorporate emotion as a key experiential dimension of music. Third, a valid tagging scheme for emotion can assist both researchers and professionals in the choice of music to induce specific emotions. For example, psychologists, behavioural economists, and neuroscientists often need to induce emotion in their experiments to understand how behaviour or performance is modulated by emotion. Music is an obvious choice for emotion induction in controlled settings because it is a universal language that lends itself to comparisons across cultures and because it is ethically unproblematic. Data was collected using crowdsourcing method executed on Crowdflower platform. Participants completed the background information and then completed as many rounds of human annotations tasks as they wished. 1 round contained 3 sub-tasks, (1) mood and activity tagging, (2) track search and tagging, (3) track tagging for moods and activities task. These were designed to map various moods and activities related to music. Description of the questions, and the types of information obtained is given in further documentation (Questionnaire_Form.docx and Information_sheet.docx).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for our paper "WormSwin: Instance Segmentation of C. elegans using Vision Transformer".This publication is divided into three parts:
CSB-1 Dataset
Synthetic Images Dataset
MD Dataset
The CSB-1 Dataset consists of frames extracted from videos of Caenorhabditis elegans (C. elegans) annotated with binary masks. Each C. elegans is separately annotated, providing accurate annotations even for overlapping instances. All annotations are provided in binary mask format and as COCO Annotation JSON files (see COCO website).
The videos are named after the following pattern:
<"worm age in hours"_"mutation"_"irradiated (binary)"_"video index (zero based)">
For mutation the following values are possible:
wild type
csb-1 mutant
csb-1 with rescue mutation
An example video name would be 24_1_1_2 meaning it shows C. elegans with csb-1 mutation, being 24h old which got irradiated.
Video data was provided by M. Rieckher; Instance Segmentation Annotations were created under supervision of K. Bozek and M. Deserno.The Synthetic Images Dataset was created by cutting out C. elegans (foreground objects) from the CSB-1 Dataset and placing them randomly on background images also taken from the CSB-1 Dataset. Foreground objects were flipped, rotated and slightly blurred before placed on the background images.The same was done with the binary mask annotations taken from CSB-1 Dataset so that they match the foreground objects in the synthetic images. Additionally, we added rings of random color, size, thickness and position to the background images to simulate petri-dish edges.
This synthetic dataset was generated by M. Deserno.The Mating Dataset (MD) consists of 450 grayscale image patches of 1,012 x 1,012 px showing C. elegans with high overlap, crawling on a petri-dish.We took the patches from a 10 min. long video of size 3,036 x 3,036 px. The video was downsampled from 25 fps to 5 fps before selecting 50 random frames for annotating and patching.Like the other datasets, worms were annotated with binary masks and annotations are provided as COCO Annotation JSON files.
The video data was provided by X.-L. Chu; Instance Segmentation Annotations were created under supervision of K. Bozek and M. Deserno.
Further details about the datasets can be found in our paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items.
Our corresponding paper (published at ITSC 2022) is available here. Further, we have applied 3DHD CityScenes to map deviation detection here.
Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises:
Python tools to read, generate, and visualize the dataset,
3DHDNet deep learning pipeline (training, inference, evaluation) for map deviation detection and 3D object detection.
The DevKit is available here:
https://github.com/volkswagen/3DHD_devkit.
The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany.
When using our dataset, you are welcome to cite:
@INPROCEEDINGS{9921866, author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and Fingscheidt, Tim}, booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, year={2022}, pages={627-634}}
Acknowledgements
We thank the following interns for their exceptional contributions to our work.
Benjamin Sertolli: Major contributions to our DevKit during his master thesis
Niels Maier: Measurement campaign for data collection and data preparation
The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies.
The Dataset
After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following.
This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map.
During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet.
To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example.
import json
json_path = r"E:\3DHD_CityScenes\Dataset\train.json" with open(json_path) as jf: data = json.load(jf) print(data)
Map items are stored as lists of items in JSON format. In particular, we provide:
traffic signs,
traffic lights,
pole-like objects,
construction site locations,
construction site obstacles (point-like such as cones, and line-like such as fences),
line-shaped markings (solid, dashed, etc.),
polygon-shaped markings (arrows, stop lines, symbols, etc.),
lanes (ordinary and temporary),
relations between elements (only for construction sites, e.g., sign to lane association).
Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON.
Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as “shape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API.
The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows.
x-coordinates: 4 byte integer
y-coordinates: 4 byte integer
z-coordinates: 4 byte integer
intensity of reflected beams: 2 byte unsigned integer
ground classification flag: 1 byte unsigned integer
After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance.
import numpy as np import pptk
file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin" pc_dict = {} key_list = ['x', 'y', 'z', 'intensity', 'is_ground'] type_list = ['
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental data can broadly be divided in discrete or continuous data. Continuous data are obtained from measurements that are performed as a function of another quantitative variable, e.g., time, length, concentration, or wavelength. The results from these types of experiments are often used to generate plots that visualize the measured variable on a continuous, quantitative scale. To simplify state-of-the-art data visualization and annotation of data from such experiments, an open-source tool was created with R/shiny that does not require coding skills to operate it. The freely available web app accepts wide (spreadsheet) and tidy data and offers a range of options to normalize the data. The data from individual objects can be shown in 3 different ways: (1) lines with unique colors, (2) small multiples, and (3) heatmap-style display. Next to this, the mean can be displayed with a 95% confidence interval for the visual comparison of different conditions. Several color-blind-friendly palettes are available to label the data and/or statistics. The plots can be annotated with graphical features and/or text to indicate any perturbations that are relevant. All user-defined settings can be stored for reproducibility of the data visualization. The app is dubbed PlotTwist and runs locally or online: https://huygens.science.uva.nl/PlotTwist
SQUAD - Smart Qualitative Data: Methods and Community Tools for Data Mark-Up is a demonstrator project that will explore methodological and technical solutions for exposing digital qualitative data to make them fully shareable, exploitable and archivable for the longer term. Such tools are required to exploit fully the potential of qualitative data for adventurous collaborative research using web-based and e-science systems. An example of the latter might be linking multiple data and information sources, such as text, statistics and maps. Initially, the project deals with specifying and testing flexible means of storing and marking-up, or annotating, qualitative data using universal standards and technologies, through eXtensible Mark-up Language (XML).A community standard, or schema, will be proposed that will be applicable to most kinds of qualitative data. The second strand investigates optimal requirements for describing or 'contextualising' research data (e.g. interview setting or interviewer characteristics), aiming to develop standards for data documentation. The third strand aims to use natural language processing technologies to develop and implement user-friendly tools for semi-automating processes to prepare marked-up qualitative data. Finally, the project will investigate tools for publishing the enriched data and contextual information to web-based systems and for exporting to preservation formats. Tools and technologies to explore new forms of sharing and disseminating qualitative data
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Prague Discourse Treebank 4.0 (PDiT 4.0; Synková et al., 2024) is an annotation of discourse relations marked by primary and secondary discourse connectives in the whole data of the Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0; Hajič et al., 2024). With respect to the previous versions of PDiT, annotating discourse relations in the whole PDT-C 2.0 means a significant increase in the size of the annotated data.
This is a safety annotation set for ImageNet. It uses the LlavaGuard-13B model for annotating. The annotations entail a safety category (image-category), an explanation (assessment), and a safety rating (decision). Furthermore, it contains the unique ImageNet id class_sampleId, i.e. n04542943_1754. These annotations allow you to train your model on only safety-aligned data. Plus, you can define yourself what safety-aligned means, i.e. discard all images where decision=="Review Needed" or… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/imagenet_safety_annotated.
Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.