Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/
Introduction: These corpora contain the data for each of the subtracks of MESINESP2 shared-task:
[Subtrack 1] MESINESP - Medical indexing:
Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:
Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.
Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.
Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:
213 articles were annotated by more than one annotator. We have selected de union between annotations.
852 articles were annotated by only one of the three selected annotators with better performance.
Test set: To be published
[Subtrack 2] MESINESP - Clinical trials:
Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.
Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.
Test set: To be published
[Subtrack 3] MESINESP - Patents: To be published
Files structure:
Subtrack1-Scientific_Literature.zip contains the corpora generated for subtrack 1. Content:
Subtrack1:
Train
training_set_track1_all.json: Full training set for subtrack 1.
training_set_track1_only_articles.json: Articles training set for subtrack 1.
Development
development_set_subtrack1.json: Manually annotated development set for subtrack 1.
Subtrack2-Clinical_Trials.zip contains the corpora generated for subtrack 2. Content:
Subtrack2:
Train
training_set_subtrack2.json: Training set for subtrack 2.
Development
development_set_subtrack2.json: Manually annotated development set for subtrack 2.
DeCS2020.tsv contains a DeCS table with the following structure:
DeCS code
Preferred descriptor (the preferred label in the Latin Spanish DeCS 2020 set)
List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020 set, separated by pipes.
DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.
*Note: The obo and tsv files with DeCS2020 descriptors contain some additional COVID19 descriptors that will be included in future versions of DeCS. These items were provided by the Pan American Health Organization (PAHO), which has kindly shared this content to improve the results of the task by taking these descriptors into account.
For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at encargo-pln-life@bsc.es
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).
Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.
The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.
Facebook
TwitterIn this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.
Purpose:
The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.
The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.
For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.
The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.
When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 1200 texts (292 173 tokens) about clinical trials studies and clinical trials announcements in Spanish: - 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO). - 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos. Texts were annotated with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). 46 699 entities were annotated (13.98% are nested entities). 10% of the corpus was doubly annotated, and inter-annotator agreement (IAA) achieved a mean F-measure of 85.65% (±4.79, strict match) and a mean F-measure of 93.94% (±3.31, relaxed match). The corpus is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
DESCRIPTION
For this task, we use a subset of the MIRFLICKR (http://mirflickr.liacs.nl) collection. The entire collection contains 1 million images from the social photo sharing website Flickr and was formed by downloading up to a thousand photos per day that were deemed to be the most interesting according to Flickr. All photos in this collection were released by their users under a Creative Commons license, allowing them to be freely used for research purposes. Of the entire collection, 25 thousand images were manually annotated with a limited number of concepts and many of these annotations have been further refined and expanded over the lifetime of the ImageCLEF photo annotation task. This year we used crowd sourcing to annotate all of these 25 thousand images with the concepts.
On this page we provide you with more information about the textual features, visual features and concept features we supply with each image in the collection we use for this year's task.
TEXTUAL FEATURES
All images are accompanied by the following textual features:
- Flickr user tags
These are the tags that the users assigned to the photos their uploaded to Flickr. The 'raw' tags are the original tags, while the 'clean' tags are those collapsed to lowercase and condensed to removed spaces.
- EXIF metadata
If available, the EXIF metadata contains information about the camera that took the photo and the parameters used. The 'raw' exif is the original camera data, while the 'clean' exif reduces the verbosity.
- User information and Creative Commons license information
This contains information about the user that took the photo and the license associated with it.
VISUAL FEATURES
Over the previous years of the photo annotation task we noticed that often the same types of visual features are used by the participants, in particular features based on interest points and bag-of-words are popular. To assist you we have extracted several features for you that you may want to use, so you can focus on the concept detection instead. We additionally give you some pointers to easy to use toolkits that will help you extract other features or the same features but with different default settings.
- SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT
We used the ISIS Color Descriptors (http://www.colordescriptors.com) toolkit to extract these descriptors. This package provides you with many different types of features based on interest points, mostly using SIFT. It furthermore assists you with building codebooks for bag-of-words. The toolkit is available for Windows, Linux and Mac OS X.
- SURF
We used the OpenSURF (http://www.chrisevansdev.com/computer-vision-opensurf.html) toolkit to extract this descriptor. The open source code is available in C++, C#, Java and many more languages.
- TOP-SURF
We used the TOP-SURF (http://press.liacs.nl/researchdownloads/topsurf) toolkit to extract this descriptor, which represents images with SURF-based bag-of-words. The website provides codebooks of several different sizes that were created using a combination of images from the MIR-FLICKR collection and from the internet. The toolkit also offers the ability to create custom codebooks from your own image collection. The code is open source, written in C++ and available for Windows, Linux and Mac OS X.
- GIST
We used the LabelMe (http://labelme.csail.mit.edu) toolkit to extract this descriptor. The MATLAB-based library offers a comprehensive set of tools for annotating images.
For the interest point-based features above we used a Fast Hessian-based technique to detect the interest points in each image. This detector is built into the OpenSURF library. In comparison with the Hessian-Laplace technique built into the ColorDescriptors toolkit it detects fewer points, resulting in a considerably reduced memory footprint. We therefore also provide you with the interest point locations in each image that the Fast Hessian-based technique detected, so when you would like to recalculate some features you can use them as a starting point for the extraction. The ColorDescriptors toolkit for instance accepts these locations as a separate parameter. Please go to http://www.imageclef.org/2012/photo-flickr/descriptors for more information on the file format of the visual features and how you can extract them yourself if you want to change the default settings.
CONCEPT FEATURES
We have solicited the help of workers on the Amazon Mechanical Turk platform to perform the concept annotation for us. To ensure a high standard of annotation we used the CrowdFlower platform that acts as a quality control layer by removing the judgments of workers that fail to annotate properly. We reused several concepts of last year's task and for most of these we annotated the remaining photos of the MIRFLICKR-25K collection that had not yet been used before in the previous task; for some concepts we reannotated all 25,000 images to boost their quality. For the new concepts we naturally had to annotate all of the images.
- Concepts
For each concept we indicate in which images it is present. The 'raw' concepts contain the judgments of all annotators for each image, where a '1' means an annotator indicated the concept was present whereas a '0' means the concept was not present, while the 'clean' concepts only contain the images for which the majority of annotators indicated the concept was present. Some images in the raw data for which we reused last year's annotations only have one judgment for a concept, whereas the other images have between three and five judgments; the single judgment does not mean only one annotator looked at it, as it is the result of a majority vote amongst last year's annotators.
- Annotations
For each image we indicate which concepts are present, so this is the reverse version of the data above. The 'raw' annotations contain the average agreement of the annotators on the presence of each concept, while the 'clean' annotations only include those for which there was a majority agreement amongst the annotators.
You will notice that the annotations are not perfect. Especially when the concepts are more subjective or abstract, the annotators tend to disagree more with each other. The raw versions of the concept annotations should help you get an understanding of the exact judgments given by the annotators.
Facebook
TwitterThis dataset features over 330,000 high-quality interior design images sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly varied and extensively annotated collection of indoor environment visuals.
Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, making it ideal for tasks such as room classification, furniture detection, and spatial layout analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.
Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions centered on interior design themes ensure a steady stream of fresh, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours to fulfill specific requests, such as particular room types, design styles, or furnishings.
Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide spectrum of architectural styles, cultural aesthetics, and functional spaces. The images include homes, offices, restaurants, studios, and public interiors—ranging from minimalist and modern to classic and eclectic designs.
High-Quality Imagery: the dataset includes standard to ultra-high-definition images that capture fine interior details. Both professionally staged and candid real-life spaces are included, offering versatility for training AI across design evaluation, object detection, and environmental understanding.
Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This provides valuable insights into global aesthetic trends, helping AI models learn user preferences, design appeal, and stylistic relevance.
AI-Ready Design: the dataset is optimized for machine learning tasks such as interior scene recognition, style transfer, virtual staging, and layout generation. It integrates smoothly with popular AI development environments and tools.
Licensing & Compliance: the dataset fully complies with data privacy regulations and includes transparent licensing suitable for commercial and academic use.
Use Cases: 1. Training AI for interior design recommendation engines and virtual staging tools. 2. Enhancing smart home applications and spatial recognition systems. 3. Powering AR/VR platforms for virtual tours, furniture placement, and room redesign. 4. Supporting architectural visualization, decor style transfer, and real estate marketing.
This dataset offers a comprehensive, high-quality resource tailored for AI-driven innovation in design, real estate, and spatial computing. Customizations are available upon request. Contact us to learn more!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Labelled industry datasets are the most valuable asset in prognostics and health management (PHM) research. However, creating labelled industry datasets is both difficult and expensive, making publicly available industry datasets rare at best. While labels are generally unavailable, many industry datasets contain annotations, maintenance work orders, or logbooks, with free-form text containing technical language descriptions of component properties, valuable information for any PHM model. Alas, publicly available annotated industry datasets are also scarce, in particular ones with associated signals available. Therefore, we release data from an annotated process industry dataset, consisting of 21090 pairs of signals and annotations from one year of kraftliner production.The annotations are written, in Swedish, by on-site Swedish experts, and the signals consist of accelerometer vibration measurements from two large (80x10x10m) paper machines. The data is cleaned and structured so that each annotation is associated with ten days of signal measurements leading up to the annotation date, where one signal measurement consists of 8192 samples over 6.4 seconds, which becomes 3200 samples stretching over 500 Hz in the frequency domain. The associated annotations are attached to each signal sample, so that the list of annotations is as long as the list of signals. In total, there are 43 unique annotations, though most are associated with multiple signals from different machines due to commonalities in fault descriptions. The language data is pre-processed so that all letters are lower case, numbers are removed, and names are replaced with the Swedish word "egennamn", meaning "name of a person" in English.Also included are pre-computed embeddings, which facilitates faster and easier testing for researchers wanting to easily investigate training signal encoders supervised through technical language supervision. The data presented here was used in the article "Technical Language Supervision for Intelligent Fault Diagnosis in Process Industry" (https://papers.phmsociety.org/index.php/ijphm/article/view/3137). Please cite this article if you use this dataset.To use this dataset without understanding Swedish, please consult "Processing of Condition Monitoring Annotations with BERT and Technical Language Substitution: A Case Study" (https://www.papers.phmsociety.org/index.php/phme/article/view/3356) on how to augment the technical data to facilitate easier language model translations to other languages, and don't hesitate to contact me if you have questions regarding the data.Accessing the data is simple; all you need to do to load spectra and annotation pairs is:import pandas as pdspectra_note_df = pd.read_pickle("TL_spectra_note_df_big.pkl")all_spectra = TL_spectra_note_df['Spectra']all_annotations = TL_spectra_note_df['Notes']Pre-computed embeddings can be accessed through:all_embeddings = TL_spectra_note_df['Embeddings']
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Proteomics data-dependent acquisition data sets collected with high-resolution mass-spectrometry (MS) can achieve very high-quality results, but nearly every analysis yields results that are thresholded at some accepted false discovery rate, meaning that a substantial number of results are incorrect. For study conclusions that rely on a small number of peptide-spectrum matches being correct, it is thus important to examine at least some crucial spectra to ensure that they are not one of the incorrect identifications. We present Quetzal, a peptide fragment ion spectrum annotation tool to assist researchers in annotating and examining such spectra to ensure that they correctly support study conclusions. We describe how Quetzal annotates spectra using the new Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) mzPAF standard for fragment ion peak annotation, including the Python-based code, a web-service end point that provides annotation services, and a web-based application for annotating spectra and producing publication-quality figures. We illustrate its functionality with several annotated spectra of varying complexity. Quetzal provides easily accessible functionality that can assist in the effort to ensure and demonstrate that crucial spectra support study conclusions. Quetzal is publicly available at https://proteomecentral.proteomexchange.org/quetzal/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
Dataset Summary
These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).
The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.
The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.
See annotations for more details.
Supported Tasks and Leaderboards
Gender classification, Accent classification.
Languages
The dataset is in Catalan (ca).
Dataset Structure
Instances
Two xlsx documents are published, one for each round of annotations.
The following information is available in each of the documents:
{ 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }
We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.
Data Fields
speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus
idx (int): Id in this corpus
AN1 (string): Annotations from Annotator 1
AN2 (string): Annotations from Annotator 2
AN3 (string): Annotations from Annotator 3
agreed (string): Annotation from the majority of the annotators
percentage (int): Percentage of annotators that agree with the agreed annotation
mean quality (float): Mean of the quality annotation
stdev quality (float): Standard deviation of the mean quality
Data Splits
The corpus remains undivided into splits, as its purpose does not involve training models.
Dataset Creation
Curation Rationale
During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.
In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Source Data
The original data comes from the Catalan sentences of the Common Voice corpus.
Initial Data Collection and Normalization
We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.
Who are the source language producers?
The original data comes from the Catalan sentences of the Common Voice corpus.
Annotations
Annotation process
Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.
A team of three annotators was tasked with annotating:
if all the recordings correspond to the same person
the gender of the speaker
the accent of the speaker
the quality of the recording
They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Who are the annotators?
The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.
The annotation team was composed of:
Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.
Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.
1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.
To do the annotation they used a Google Drive spreadsheet
Personal and Sensitive Information
The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
Considerations for Using the Data
Social Impact of Dataset
The ID come from the Common Voice dataset, that consists of people who have donated their voice online.
You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Discussion of Biases
Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.
For the gender annotation, we have only considered "H" (male) and "D" (female).
Other Known Limitations
[N/A]
Additional Information
Dataset Curators
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
Licensing Information
This dataset is licensed under a CC BY 4.0 license.
It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.
Citation Information
DOI
Contributions
The annotation was entrusted to the STeL team from the University of Barcelona.
Facebook
TwitterData for "Koelmel JP, Stelben P, McDonough CA, Dukes DA, Aristizabal-Henao JJ, Nason SL, Li Y, Sternberg S, Lin E, Beckmann M, Williams AJ, Draper J, Finch JP, Munk JK, Deigl C, Rennie EE, Bowden JA, Godri Pollitt KJ. FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality. Anal Bioanal Chem. 2022 Jan;414(3):1201-1215. doi: 10.1007/s00216-021-03392-7. Epub 2021 May 20. PMID: 34014358.". Portions of this dataset are inaccessible because: The link provided by UCSD doesn't seem to be working. They can be accessed through the following means: Contact Jeremy Koelmel at Yale University, jeremykoelmel@innovativeomics.com. Format: The final annotated excel sheets with feature intensities, annotations, homologous series groupings, etc., are available as a supplemental excel file with the online version of this manuscript. The raw Agilent “.d” files can be downloaded at: ftp://massive.ucsd.edu/MSV000086811/updates/2021-02-05_jeremykoelmel_e5b21166/raw/McDonough_AFFF_3M_ddMS2_Neg.zip (Note use Google Chrome or Firefox, Microsoft Edge and certain other browsers are unable to download from an ftp link). This dataset is associated with the following publication: Koelmel, J.P., P. Stelben, C.A. McDonough, D.A. Dukes, J.J. Aristizabal-Henao, S.L. Nason, Y. Li, S. Sternberg, E. Lin, M. Beckmann, A. Williams, J. Draper, J. Finch, J.K. Munk, C. Deigl, E. Rennie, J.A. Bowden, and K.J. Godri Pollitt. FluoroMatch 2.0—making automated and comprehensive non-targeted PFAS annotation a reality. Analytical and Bioanalytical Chemistry. Springer, New York, NY, USA, 414(3): 1201-1215, (2022).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).
Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
For this notebook, we're going to use the GMB (Groningen Meaning Bank) corpus for named entity recognition. GMB is a fairly large corpus with a lot of annotations. The data is labeled using the IOB format (short for inside, outside, beginning), which means each annotation also needs a prefix of I, O, or B.
The following classes appear in the dataset:
LOC - Geographical Entity ORG - Organization PER - Person GPE - Geopolitical Entity TIME - Time indicator ART - Artifact EVE - Event NAT - Natural Phenomenon Note: GMB is not completely human annotated, and it’s not considered 100% correct. For this exercise, classes ART, EVE, and NAT were combined into a MISC class due to small number of examples for these classes.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use HelpSteer: An Open-Source AI Alignment Dataset
HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.
Step 1 - Choosing the Data File
Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.
## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as “Feature Engineering”
## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality
- Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.
- Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.
- Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Current research introduces an annotated gold standard dataset based on 2,663 comments from Meta (Facebook). The dataset is manually labelled for stigma, not stigma, and ambiguous sentiment. Each comment is labelled three times (four times in case of dissensus) by independent expert annotators. The overall observed share of agreement reached 68% and Fleiss Kappa agreement rate achieved 0.62 on the annotation task with three labels ("stigma, "not stigma", and "ambiguous" category). Annotation share of agreement between two labels ("stigma, "not stigma") is 89% and Fleiss Kappa is 0.84. The labels are consequently propagated from the annotated Facebook (Meta) to a dataset discussing COVID vaccines with 40,084 comments from Twitter, Reddit, and YouTube corpora. In addition, the corpora are annotated with linguistic features from LIWC (Linguistic Inquiry and Word Count) [1], [2] and additional features: number of characters in the comment string, sentiment score, subjectivity score.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Named Entity Recognition for annotated corpus using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
The dataset an extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed. The attached dataset is in tab separated format, the goal is to create a good model to classify the Tag column. The data is labelled using the IOB tagging system. Here are the following classes in the dataset - geo = Geographical Entity org = Organization per = Person gpe = Geopolitical Entity tim = Time indicator art = Artifact eve = Event nat = Natural Phenomenon
The dataset is a subset of the original dataset shared here - https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/kernels
The data can be used by anyone who is starting off with NER in NLP.
Facebook
TwitterThis dataset features over 25,000,000 high-quality general-purpose images sourced from photographers worldwide. Designed to support a wide range of AI and machine learning applications, it offers a richly diverse and extensively annotated collection of everyday visual content.
Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.
2.Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions spanning various themes ensure a steady influx of diverse, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements—such as themes, subjects, or scenarios—to be met efficiently.
Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide range of human experiences, cultures, environments, and activities. The dataset includes images of people, nature, objects, animals, urban and rural life, and more—captured across different times of day, seasons, and lighting conditions.
High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a balance of realism and creativity across visual domains.
Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on aesthetics, engagement, or content curation.
AI-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in general image recognition, multi-label classification, content filtering, and scene understanding. It integrates easily with leading machine learning frameworks and pipelines.
Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.
Use Cases: 1. Training AI models for general-purpose image classification and tagging. 2. Enhancing content moderation and visual search systems. 3. Building foundational datasets for large-scale vision-language models. 4. Supporting research in computer vision, multimodal AI, and generative modeling.
This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models across a wide array of domains. Customizations are available to suit specific project needs. Contact us to learn more!
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global robotics data labeling services market size reached USD 1.34 billion in 2024, reflecting robust expansion fueled by the rapid adoption of robotics across multiple industries. The market is set to grow at a CAGR of 21.7% from 2025 to 2033, reaching an estimated USD 9.29 billion by 2033. This impressive growth trajectory is primarily driven by increasing investments in artificial intelligence (AI), machine learning (ML), and automation technologies, which demand high-quality labeled data for effective robotics training and deployment. As per our latest research, the proliferation of autonomous systems and the need for precise data annotation are the key contributors to this market’s upward momentum.
One of the primary growth factors for the robotics data labeling services market is the accelerating adoption of AI-powered robotics in industrial and commercial domains. The increasing sophistication of robotics, especially in sectors like automotive manufacturing, logistics, and healthcare, requires vast amounts of accurately labeled data to train algorithms for object detection, navigation, and interaction. The emergence of Industry 4.0 and the transition toward smart factories have amplified the need for reliable data annotation services. Moreover, the growing complexity of robotic tasks necessitates not just basic labeling but advanced contextual annotation, further fueling demand. The rise in collaborative robots (cobots) in manufacturing environments also underlines the necessity for precise data labeling to ensure safety and efficiency.
Another significant driver is the surge in autonomous vehicle development, which relies heavily on high-quality labeled data for perception, decision-making, and real-time response. Automotive giants and tech startups alike are investing heavily in robotics data labeling services to enhance the performance of their autonomous driving systems. The expansion of sensor technologies, including LiDAR, radar, and high-definition cameras, has led to an exponential increase in the volume and complexity of data that must be annotated. This trend is further supported by regulatory pressures to ensure the safety and reliability of autonomous systems, making robust data labeling a non-negotiable requirement for market players.
Additionally, the healthcare sector is emerging as a prominent end-user of robotics data labeling services. The integration of robotics in surgical procedures, diagnostics, and patient care is driving demand for meticulously annotated datasets to train AI models in recognizing anatomical structures, pathological features, and procedural steps. The need for precision and accuracy in healthcare robotics is unparalleled, as errors can have significant consequences. As a result, healthcare organizations are increasingly outsourcing data labeling tasks to specialized service providers to leverage their expertise and ensure compliance with stringent regulatory standards. The expansion of telemedicine and remote diagnostics is also contributing to the growing need for reliable data annotation in healthcare robotics.
From a regional perspective, North America currently dominates the robotics data labeling services market, accounting for the largest share in 2024, followed closely by Asia Pacific and Europe. The United States is at the forefront, driven by substantial investments in AI research, a strong presence of leading robotics companies, and a mature technology ecosystem. Meanwhile, Asia Pacific is experiencing the fastest growth, propelled by large-scale industrial automation initiatives in China, Japan, and South Korea. Europe remains a critical market, driven by advancements in automotive and healthcare robotics, as well as supportive government policies. The Middle East & Africa and Latin America are also witnessing gradual adoption, primarily in manufacturing and logistics sectors, albeit at a slower pace compared to other regions.
The service type segment in the robotics data labeling services market encompasses image labeling, video labeling, sensor data labeling, text labeling, and others. Image labeling remains the cornerstone of data annotation for robotics, as computer vision is integral to most robotic applications. The demand for image labeling services has surged with the proliferation of robots that rely on visual perception for nav
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Data
The dataset consist of 5538 images of public spaces, annotated with steps, stairs, ramps and grab bars for stairs and ramps. The dataset has annotations 3564 of steps, 1492 of stairs, 143 of ramps and 922 of grab bars.
Each step annotation is attributed with an estimate of the height of the step, as falling into one of three categories: less than 3cm, 3cm to 7cm or more than 7cm. Additionally it is attributed with a 'type', with the possibilities 'doorstep', 'curb' or 'other'.
Stair annotations are attributed with the number of steps in the stair.
Ramps are attributed with an estimate of their width, also falling into three categories: less than 50cm, 50cm to 100cm and more than 100cm.
In order to preserve all additional attributes of the labels, the data is published in the CVAT XML format for images.
Annotating Process
The labelling has been done using bounding boxes around the objects. This format is compatible with many popular object detection models, e.g. the YOLO object model. A bounding box is placed so it contains exactly the visible part of the respective objects. This implies that only objects that are visible in the photo are annotated. This means in particular a photo of a stair or step from above, where the object cannot be seen, have not been annotated, even when a human viewer can possibly infer that there is a stair or a step from other features in the photo.
Steps
A step is annotated, when there is an vertical increment that functions as a passage between two surface areas intended human or vehicle traffic. This means that we have not included:
In particular, the bounding box of a step object contains exactly the incremental part of the step, but does not extend into the top or bottom horizontal surface any more than necessary to enclose entirely the incremental part. This has been chosen for consistency reasons, as including parts of the horizontal surfaces would imply a non-trivial choice of how much to include, which we deemed would most likely lead to more inconstistent annotations.
The height of the steps are estimated by the annotators, and are therefore not guarranteed to be accurate.
The type of the steps typically fall into the category 'doorstep' or 'curb'. Steps that are in a doorway, entrance or likewise are attributed as doorsteps. We also include in this category steps that are immediately leading to a doorway within a proximity of 1-2m. Steps between different types of pathways, e.g. between streets and sidewalks, are annotated as curbs. Any other type of step are annotated with 'other'. Many of the 'other' steps are for example steps to terraces.
Stairs
The stair label is used whenever two or more steps directly follow each other in a consistent pattern. All vertical increments are enclosed in the bounding box, as well as intermediate surfaces of the steps. However the top and bottom surface is not included more than necessary for the same reason as for steps, as described in the previous section.
The annotator counts the number of steps, and attribute this to the stair object label.
Ramps
Ramps have been annotated when a sloped passage way has been placed or built to connect two surface areas intended for human or vehicle traffic. This implies the same considerations as with steps. Alike also only the sloped part of a ramp is annotated, not including the bottom or top surface area.
For each ramp, the annotator makes an assessment of the width of the ramp in three categories: less than 50cm, 50cm to 100cm and more than 100cm. This parameter is visually hard to assess, and sometimes impossible due to the view of the ramp.
Grab Bars
Grab bars are annotated for hand rails and similar that are in direct connection to a stair or a ramp. While horizontal grab bars could also have been included, this was omitted due to the implied ambiguities of fences and similar objects. As the grab bar was originally intended as an attributal information to stairs and ramps, we chose to keep this focus. The bounding box encloses the part of the grab bar that functions as a hand rail for the stair or ramp.
Usage
As is often the case when annotating data, much information depends on the subjective assessment of the annotator. As each data point in this dataset has been annotated only by one person, caution should be taken if the data is applied.
Generally speaking, the mindset and usage guiding the annotations have been wheelchair accessibility. While we have strived to annotate at an object level, hopefully making the data more widely applicable than this, we state this explicitly as it may have swayed untrivial annotation choices.
The attributal data, such as step height or ramp width are highly subjective estimations. We still provide this data to give a post-hoc method to adjust which annotations to use. E.g. for some purposes, one may be interested in detecting only steps that are indeed more than 3cm. The attributal data makes it possible to sort away the steps less than 3cm, so a machine learning algorithm can be trained on this more appropriate dataset for that use case. We stress however, that one cannot expect to train accurate machine learning algorithms inferring the attributal data, as this is not accurate data in the first place.
We hope this dataset will be a useful building block in the endeavours for automating barrier detection and documentation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Portraying emotion and trustworthiness is known to increase the appeal of video content. However, the causal relationship between these signals and online user engagement is not well understood. This limited understanding is partly due to a scarcity in emotionally annotated data and the varied modalities which express user engagement online. In this contribution, we utilize a large dataset of YouTube review videos which includes ca. 600 h of dimensional arousal, valence and trustworthiness annotations. We investigate features extracted from these signals against various user engagement indicators including views, like/dislike ratio, as well as the sentiment of comments. In doing so, we identify the positive and negative influences which single features have, as well as interpretable patterns in each dimension which relate to user engagement. Our results demonstrate that smaller boundary ranges and fluctuations for arousal lead to an increase in user engagement. Furthermore, the extracted time-series features reveal significant (p < 0.05) correlations for each dimension, such as, count below signal mean (arousal), number of peaks (valence), and absolute energy (trustworthiness). From this, an effective combination of features is outlined for approaches aiming to automatically predict several user engagement indicators. In a user engagement prediction paradigm we compare all features against semi-automatic (cross-task), and automatic (task-specific) feature selection methods. These selected feature sets appear to outperform the usage of all features, e.g., using all features achieves 1.55 likes per day (Lp/d) mean absolute error from valence; this improves through semi-automatic and automatic selection to 1.33 and 1.23 Lp/d, respectively (data mean 9.72 Lp/d with a std. 28.75 Lp/d).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.