100+ datasets found

H
PEARC20 submitted paper: "Scientific Data Annotation and Dissemination:...
hydroshare.org
beta.hydroshare.org
zip
Updated Jul 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056
Explore at:
zip(873 bytes)Available download formats
Unique identifier
https://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056
Dataset updated
Jul 29, 2020
Dataset provided by
HydroShare
Authors
Sean Cleveland; Gwen Jacobs; Jennifer Geis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
Jul 29, 2020
Description
Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.
Z
MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Oct 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gasco, Luis; Krallinger, Martin (2021). MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4612274
Explore at:
Dataset updated
Oct 28, 2021
Dataset provided by
Barcelona Supercomputing Center
Authors
Gasco, Luis; Krallinger, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/

Introduction: These corpora contain the data for each of the subtracks of MESINESP2 shared-task:

[Subtrack 1] MESINESP - Medical indexing:

Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:

Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.

Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.

Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:

213 articles were annotated by more than one annotator. We have selected de union between annotations.

852 articles were annotated by only one of the three selected annotators with better performance.

Test set: To be published

[Subtrack 2] MESINESP - Clinical trials:

Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.

Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.

Test set: To be published

[Subtrack 3] MESINESP - Patents: To be published

Files structure:

Subtrack1-Scientific_Literature.zip contains the corpora generated for subtrack 1. Content:

Subtrack1:

Train

training_set_track1_all.json: Full training set for subtrack 1.

training_set_track1_only_articles.json: Articles training set for subtrack 1.

Development

development_set_subtrack1.json: Manually annotated development set for subtrack 1.

Subtrack2-Clinical_Trials.zip contains the corpora generated for subtrack 2. Content:

Subtrack2:

Train

training_set_subtrack2.json: Training set for subtrack 2.

Development

development_set_subtrack2.json: Manually annotated development set for subtrack 2.

DeCS2020.tsv contains a DeCS table with the following structure:

DeCS code

Preferred descriptor (the preferred label in the Latin Spanish DeCS 2020 set)

List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020 set, separated by pipes.

DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.

*Note: The obo and tsv files with DeCS2020 descriptors contain some additional COVID19 descriptors that will be included in future versions of DeCS. These items were provided by the Pan American Health Organization (PAHO), which has kindly shared this content to improve the results of the task by taking these descriptors into account.

For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at encargo-pln-life@bsc.es
E
Data from: Metaphor annotations in Polish political debates from 2020 (TVP...
live.european-language-grid.eu
binary format
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
Explore at:
binary formatAvailable download formats
Dataset updated
Jun 30, 2021
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.
r
Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC)
demo.researchdata.se
researchdata.se
Updated Jan 15, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Kerren; Carita Paradis (2019). Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC) [Dataset]. http://doi.org/10.5878/002925
Explore at:
Unique identifier
https://doi.org/10.5878/002925
Dataset updated
Jan 15, 2019
Dataset provided by
Linnaeus University
Authors
Andreas Kerren; Carita Paradis
Time period covered
Jun 1, 2015 - May 31, 2016
Description
In this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.

Purpose:

The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.

The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.

For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.

The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.

When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060
Data from: CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine...
data.europa.eu
datos.gob.es
+1more
unknown
Updated Feb 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine in Spanish [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6059737?locale=da
Explore at:
unknown(2576817)Available download formats
Dataset updated
Feb 12, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 1200 texts (292 173 tokens) about clinical trials studies and clinical trials announcements in Spanish: - 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO). - 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos. Texts were annotated with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). 46 699 entities were annotated (13.98% are nested entities). 10% of the corpus was doubly annotated, and inter-annotator agreement (IAA) achieved a mean F-measure of 85.65% (±4.79, strict match) and a mean F-measure of 93.94% (±3.31, relaxed match). The corpus is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License.
z
ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR)
zenodo.org
txt, zip
Updated May 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bart Thomee; Adrian Popescu; Bart Thomee; Adrian Popescu (2020). ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR) [Dataset]. http://doi.org/10.5281/zenodo.1246796
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1246796
Dataset updated
May 22, 2020
Dataset provided by
Zenodo
Authors
Bart Thomee; Adrian Popescu; Bart Thomee; Adrian Popescu
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
DESCRIPTION
For this task, we use a subset of the MIRFLICKR (http://mirflickr.liacs.nl) collection. The entire collection contains 1 million images from the social photo sharing website Flickr and was formed by downloading up to a thousand photos per day that were deemed to be the most interesting according to Flickr. All photos in this collection were released by their users under a Creative Commons license, allowing them to be freely used for research purposes. Of the entire collection, 25 thousand images were manually annotated with a limited number of concepts and many of these annotations have been further refined and expanded over the lifetime of the ImageCLEF photo annotation task. This year we used crowd sourcing to annotate all of these 25 thousand images with the concepts.

On this page we provide you with more information about the textual features, visual features and concept features we supply with each image in the collection we use for this year's task.

TEXTUAL FEATURES
All images are accompanied by the following textual features:

- Flickr user tags
These are the tags that the users assigned to the photos their uploaded to Flickr. The 'raw' tags are the original tags, while the 'clean' tags are those collapsed to lowercase and condensed to removed spaces.

- EXIF metadata
If available, the EXIF metadata contains information about the camera that took the photo and the parameters used. The 'raw' exif is the original camera data, while the 'clean' exif reduces the verbosity.

- User information and Creative Commons license information
This contains information about the user that took the photo and the license associated with it.

VISUAL FEATURES
Over the previous years of the photo annotation task we noticed that often the same types of visual features are used by the participants, in particular features based on interest points and bag-of-words are popular. To assist you we have extracted several features for you that you may want to use, so you can focus on the concept detection instead. We additionally give you some pointers to easy to use toolkits that will help you extract other features or the same features but with different default settings.

- SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT
We used the ISIS Color Descriptors (http://www.colordescriptors.com) toolkit to extract these descriptors. This package provides you with many different types of features based on interest points, mostly using SIFT. It furthermore assists you with building codebooks for bag-of-words. The toolkit is available for Windows, Linux and Mac OS X.

- SURF
We used the OpenSURF (http://www.chrisevansdev.com/computer-vision-opensurf.html) toolkit to extract this descriptor. The open source code is available in C++, C#, Java and many more languages.

- TOP-SURF
We used the TOP-SURF (http://press.liacs.nl/researchdownloads/topsurf) toolkit to extract this descriptor, which represents images with SURF-based bag-of-words. The website provides codebooks of several different sizes that were created using a combination of images from the MIR-FLICKR collection and from the internet. The toolkit also offers the ability to create custom codebooks from your own image collection. The code is open source, written in C++ and available for Windows, Linux and Mac OS X.

- GIST
We used the LabelMe (http://labelme.csail.mit.edu) toolkit to extract this descriptor. The MATLAB-based library offers a comprehensive set of tools for annotating images.

For the interest point-based features above we used a Fast Hessian-based technique to detect the interest points in each image. This detector is built into the OpenSURF library. In comparison with the Hessian-Laplace technique built into the ColorDescriptors toolkit it detects fewer points, resulting in a considerably reduced memory footprint. We therefore also provide you with the interest point locations in each image that the Fast Hessian-based technique detected, so when you would like to recalculate some features you can use them as a starting point for the extraction. The ColorDescriptors toolkit for instance accepts these locations as a separate parameter. Please go to http://www.imageclef.org/2012/photo-flickr/descriptors for more information on the file format of the visual features and how you can extract them yourself if you want to change the default settings.

CONCEPT FEATURES
We have solicited the help of workers on the Amazon Mechanical Turk platform to perform the concept annotation for us. To ensure a high standard of annotation we used the CrowdFlower platform that acts as a quality control layer by removing the judgments of workers that fail to annotate properly. We reused several concepts of last year's task and for most of these we annotated the remaining photos of the MIRFLICKR-25K collection that had not yet been used before in the previous task; for some concepts we reannotated all 25,000 images to boost their quality. For the new concepts we naturally had to annotate all of the images.

- Concepts
For each concept we indicate in which images it is present. The 'raw' concepts contain the judgments of all annotators for each image, where a '1' means an annotator indicated the concept was present whereas a '0' means the concept was not present, while the 'clean' concepts only contain the images for which the majority of annotators indicated the concept was present. Some images in the raw data for which we reused last year's annotations only have one judgment for a concept, whereas the other images have between three and five judgments; the single judgment does not mean only one annotator looked at it, as it is the result of a majority vote amongst last year's annotators.

- Annotations
For each image we indicate which concepts are present, so this is the reverse version of the data above. The 'raw' annotations contain the average agreement of the annotators on the presence of each concept, while the 'clean' annotations only include those for which there was a majority agreement amongst the annotators.

You will notice that the annotations are not perfect. Especially when the concepts are more subjective or abstract, the annotators tend to disagree more with each other. The raw versions of the concept annotations should help you get an understanding of the exact judgments given by the annotators.
d
330K+ Interior Design Images | AI Training Data | Annotated imagery data for...
datarade.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds, 330K+ Interior Design Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/200k-interior-design-images-ai-training-data-annotated-i-data-seeds
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Data Seeds
Area covered
Indonesia, Jamaica, Egypt, Curaçao, Turks and Caicos Islands, Congo, Tajikistan, Kuwait, Nicaragua, Ethiopia
Description
This dataset features over 330,000 high-quality interior design images sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly varied and extensively annotated collection of indoor environment visuals.

Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, making it ideal for tasks such as room classification, furniture detection, and spatial layout analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.

Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions centered on interior design themes ensure a steady stream of fresh, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours to fulfill specific requests, such as particular room types, design styles, or furnishings.

Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide spectrum of architectural styles, cultural aesthetics, and functional spaces. The images include homes, offices, restaurants, studios, and public interiors—ranging from minimalist and modern to classic and eclectic designs.

High-Quality Imagery: the dataset includes standard to ultra-high-definition images that capture fine interior details. Both professionally staged and candid real-life spaces are included, offering versatility for training AI across design evaluation, object detection, and environmental understanding.

Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This provides valuable insights into global aesthetic trends, helping AI models learn user preferences, design appeal, and stylistic relevance.

AI-Ready Design: the dataset is optimized for machine learning tasks such as interior scene recognition, style transfer, virtual staging, and layout generation. It integrates smoothly with popular AI development environments and tools.

Licensing & Compliance: the dataset fully complies with data privacy regulations and includes transparent licensing suitable for commercial and academic use.

Use Cases: 1. Training AI for interior design recommendation engines and virtual staging tools. 2. Enhancing smart home applications and spatial recognition systems. 3. Powering AR/VR platforms for virtual tours, furniture placement, and room redesign. 4. Supporting architectural visualization, decor style transfer, and real estate marketing.

This dataset offers a comprehensive, high-quality resource tailored for AI-driven innovation in design, real estate, and spatial computing. Customizations are available upon request. Contact us to learn more!
Annotations and associated frequency signals
figshare.com
bin
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Löwenmark (2023). Annotations and associated frequency signals [Dataset]. http://doi.org/10.6084/m9.figshare.24470620.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24470620.v1
Dataset updated
Nov 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Karl Löwenmark
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Labelled industry datasets are the most valuable asset in prognostics and health management (PHM) research. However, creating labelled industry datasets is both difficult and expensive, making publicly available industry datasets rare at best. While labels are generally unavailable, many industry datasets contain annotations, maintenance work orders, or logbooks, with free-form text containing technical language descriptions of component properties, valuable information for any PHM model. Alas, publicly available annotated industry datasets are also scarce, in particular ones with associated signals available. Therefore, we release data from an annotated process industry dataset, consisting of 21090 pairs of signals and annotations from one year of kraftliner production.The annotations are written, in Swedish, by on-site Swedish experts, and the signals consist of accelerometer vibration measurements from two large (80x10x10m) paper machines. The data is cleaned and structured so that each annotation is associated with ten days of signal measurements leading up to the annotation date, where one signal measurement consists of 8192 samples over 6.4 seconds, which becomes 3200 samples stretching over 500 Hz in the frequency domain. The associated annotations are attached to each signal sample, so that the list of annotations is as long as the list of signals. In total, there are 43 unique annotations, though most are associated with multiple signals from different machines due to commonalities in fault descriptions. The language data is pre-processed so that all letters are lower case, numbers are removed, and names are replaced with the Swedish word "egennamn", meaning "name of a person" in English.Also included are pre-computed embeddings, which facilitates faster and easier testing for researchers wanting to easily investigate training signal encoders supervised through technical language supervision. The data presented here was used in the article "Technical Language Supervision for Intelligent Fault Diagnosis in Process Industry" (https://papers.phmsociety.org/index.php/ijphm/article/view/3137). Please cite this article if you use this dataset.To use this dataset without understanding Swedish, please consult "Processing of Condition Monitoring Annotations with BERT and Technical Language Substitution: A Case Study" (https://www.papers.phmsociety.org/index.php/phme/article/view/3356) on how to augment the technical data to facilitate easier language model translations to other languages, and don't hesitate to contact me if you have questions regarding the data.Accessing the data is simple; all you need to do to load spectra and annotation pairs is:import pandas as pdspectra_note_df = pd.read_pickle("TL_spectra_note_df_big.pkl")all_spectra = TL_spectra_note_df['Spectra']all_annotations = TL_spectra_note_df['Notes']Pre-computed embeddings can be accessed through:all_embeddings = TL_spectra_note_df['Embeddings']
f
Data from: Quetzal: Comprehensive Peptide Fragmentation Annotation and...
acs.figshare.com
xlsx
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric W. Deutsch; Luis Mendoza; Robert L. Moritz (2025). Quetzal: Comprehensive Peptide Fragmentation Annotation and Visualization [Dataset]. http://doi.org/10.1021/acs.jproteome.5c00092.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.5c00092.s002
Dataset updated
Mar 20, 2025
Dataset provided by
ACS Publications
Authors
Eric W. Deutsch; Luis Mendoza; Robert L. Moritz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Proteomics data-dependent acquisition data sets collected with high-resolution mass-spectrometry (MS) can achieve very high-quality results, but nearly every analysis yields results that are thresholded at some accepted false discovery rate, meaning that a substantial number of results are incorrect. For study conclusions that rely on a small number of peptide-spectrum matches being correct, it is thus important to examine at least some crucial spectra to ensure that they are not one of the incorrect identifications. We present Quetzal, a peptide fragment ion spectrum annotation tool to assist researchers in annotating and examining such spectra to ensure that they correctly support study conclusions. We describe how Quetzal annotates spectra using the new Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) mzPAF standard for fragment ion peak annotation, including the Python-based code, a web-service end point that provides annotation services, and a web-based application for annotating spectra and producing publication-quality figures. We illustrate its functionality with several annotated spectra of varying complexity. Quetzal provides easily accessible functionality that can assist in the effort to ensure and demonstrate that crucial spectra support study conclusions. Quetzal is publicly available at https://proteomecentral.proteomexchange.org/quetzal/.
Z
Expert annotations for the Catalan Common Voice (v13)
data.niaid.nih.gov
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technologies Unit (2024). Expert annotations for the Catalan Common Voice (v13) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11104387
Explore at:
Dataset updated
May 2, 2024
Dataset provided by
Barcelona Supercomputing Centerhttps://www.bsc.es/
Authors
Language Technologies Unit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Homepage: https://projecteaina.cat/tech/]- Point of Contact: langech@bsc.es

Dataset Summary

These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

See annotations for more details.

Supported Tasks and Leaderboards

Gender classification, Accent classification.

Languages

The dataset is in Catalan (ca).

Dataset Structure

Instances

Two xlsx documents are published, one for each round of annotations.

The following information is available in each of the documents:

{ 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }

We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

Data Fields

speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus

idx (int): Id in this corpus

AN1 (string): Annotations from Annotator 1

AN2 (string): Annotations from Annotator 2

AN3 (string): Annotations from Annotator 3

agreed (string): Annotation from the majority of the annotators

percentage (int): Percentage of annotators that agree with the agreed annotation

mean quality (float): Mean of the quality annotation

stdev quality (float): Standard deviation of the mean quality

Data Splits

The corpus remains undivided into splits, as its purpose does not involve training models.

Dataset Creation

Curation Rationale

During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Source Data

The original data comes from the Catalan sentences of the Common Voice corpus.

Initial Data Collection and Normalization

We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

Who are the source language producers?

The original data comes from the Catalan sentences of the Common Voice corpus.

Annotations

Annotation process

Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

A team of three annotators was tasked with annotating:

if all the recordings correspond to the same person

the gender of the speaker

the accent of the speaker

the quality of the recording

They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Who are the annotators?

The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

The annotation team was composed of:

Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.

Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.

1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

To do the annotation they used a Google Drive spreadsheet

Personal and Sensitive Information

The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

Considerations for Using the Data

Social Impact of Dataset

The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Discussion of Biases

Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

For the gender annotation, we have only considered "H" (male) and "D" (female).

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset is licensed under a CC BY 4.0 license.

It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The annotation was entrusted to the STeL team from the University of Barcelona.
Data from: FluoroMatch 2.0-making automated and comprehensive non-targeted...
catalog.data.gov
s.cnmilf.com
+1more
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality [Dataset]. https://catalog.data.gov/dataset/fluoromatch-2-0-making-automated-and-comprehensive-non-targeted-pfas-annotation-a-reality
Explore at:
Dataset updated
Feb 10, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data for "Koelmel JP, Stelben P, McDonough CA, Dukes DA, Aristizabal-Henao JJ, Nason SL, Li Y, Sternberg S, Lin E, Beckmann M, Williams AJ, Draper J, Finch JP, Munk JK, Deigl C, Rennie EE, Bowden JA, Godri Pollitt KJ. FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality. Anal Bioanal Chem. 2022 Jan;414(3):1201-1215. doi: 10.1007/s00216-021-03392-7. Epub 2021 May 20. PMID: 34014358.". Portions of this dataset are inaccessible because: The link provided by UCSD doesn't seem to be working. They can be accessed through the following means: Contact Jeremy Koelmel at Yale University, jeremykoelmel@innovativeomics.com. Format: The final annotated excel sheets with feature intensities, annotations, homologous series groupings, etc., are available as a supplemental excel file with the online version of this manuscript. The raw Agilent “.d” files can be downloaded at: ftp://massive.ucsd.edu/MSV000086811/updates/2021-02-05_jeremykoelmel_e5b21166/raw/McDonough_AFFF_3M_ddMS2_Neg.zip (Note use Google Chrome or Firefox, Microsoft Edge and certain other browsers are unable to download from an ftp link). This dataset is associated with the following publication: Koelmel, J.P., P. Stelben, C.A. McDonough, D.A. Dukes, J.J. Aristizabal-Henao, S.L. Nason, Y. Li, S. Sternberg, E. Lin, M. Beckmann, A. Williams, J. Draper, J. Finch, J.K. Munk, C. Deigl, E. Rennie, J.A. Bowden, and K.J. Godri Pollitt. FluoroMatch 2.0—making automated and comprehensive non-targeted PFAS annotation a reality. Analytical and Bioanalytical Chemistry. Springer, New York, NY, USA, 414(3): 1201-1215, (2022).
c
Data from: Slovenian Word in Context dataset SloWiC 1.0
clarin.si
live.european-language-grid.eu
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781?locale-attribute=en
Explore at:
Dataset updated
Mar 23, 2023
Authors
Timotej Knez; Slavko Žitnik
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example
GMB Data
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghassen Khaled (2025). GMB Data [Dataset]. https://www.kaggle.com/datasets/ghassenkhaled/gmb-data
Explore at:
zip(3265952 bytes)Available download formats
Dataset updated
Jul 31, 2025
Authors
Ghassen Khaled
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
For this notebook, we're going to use the GMB (Groningen Meaning Bank) corpus for named entity recognition. GMB is a fairly large corpus with a lot of annotations. The data is labeled using the IOB format (short for inside, outside, beginning), which means each annotation also needs a prefix of I, O, or B.

The following classes appear in the dataset:

LOC - Geographical Entity ORG - Organization PER - Person GPE - Geopolitical Entity TIME - Time indicator ART - Artifact EVE - Event NAT - Natural Phenomenon Note: GMB is not completely human annotated, and it’s not considered 100% correct. For this exercise, classes ART, EVE, and NAT were combined into a MISC class due to small number of examples for these classes.
HelpSteer: AI Alignment Dataset
kaggle.com
zip
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). HelpSteer: AI Alignment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/helpsteer-ai-alignment-dataset
Explore at:
zip(16614333 bytes)Available download formats
Dataset updated
Nov 22, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
HelpSteer: AI Alignment Dataset

Real-World Helpfulness Annotated for AI Alignment

By Huggingface Hub [source]

About this dataset

HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use HelpSteer: An Open-Source AI Alignment Dataset

HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.

Step 1 - Choosing the Data File

Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.

## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as “Feature Engineering”

## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality

Research Ideas

Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.

Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.

Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...
Social Media Corpus: Stigma Identification in Vaccination Discourse
figshare.com
txt
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Straton (2025). Social Media Corpus: Stigma Identification in Vaccination Discourse [Dataset]. http://doi.org/10.6084/m9.figshare.23277392.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23277392.v4
Dataset updated
May 21, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Straton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Current research introduces an annotated gold standard dataset based on 2,663 comments from Meta (Facebook). The dataset is manually labelled for stigma, not stigma, and ambiguous sentiment. Each comment is labelled three times (four times in case of dissensus) by independent expert annotators. The overall observed share of agreement reached 68% and Fleiss Kappa agreement rate achieved 0.62 on the annotation task with three labels ("stigma, "not stigma", and "ambiguous" category). Annotation share of agreement between two labels ("stigma, "not stigma") is 89% and Fleiss Kappa is 0.84. The labels are consequently propagated from the annotated Facebook (Meta) to a dataset discussing COVID vaccines with 40,084 comments from Twitter, Reddit, and YouTube corpora. In addition, the corpora are annotated with linguistic features from LIWC (Linguistic Inquiry and Word Count) [1], [2] and additional features: number of characters in the comment string, sentiment score, subjectivity score.

Pennebaker, J. W., Francis, M. E. & Booth, R. J. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Assoc. 71, 2001 (2001).

Tausczik, Y. R. & Pennebaker, J. W. The psychological meaning of words: Liwc and computerised text analysis methods. J. language social psychology 29, 24–54 (2010)
Annotated GMB Corpus
kaggle.com
zip
Updated Oct 7, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shoumik (2018). Annotated GMB Corpus [Dataset]. https://www.kaggle.com/shoumikgoswami/annotated-gmb-corpus
Explore at:
zip(473318 bytes)Available download formats
Dataset updated
Oct 7, 2018
Authors
Shoumik
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

Named Entity Recognition for annotated corpus using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Content

The dataset an extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed. The attached dataset is in tab separated format, the goal is to create a good model to classify the Tag column. The data is labelled using the IOB tagging system. Here are the following classes in the dataset - geo = Geographical Entity org = Organization per = Person gpe = Geopolitical Entity tim = Time indicator art = Artifact eve = Event nat = Natural Phenomenon

Acknowledgements

The dataset is a subset of the original dataset shared here - https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/kernels

Inspiration

The data can be used by anyone who is starting off with NER in NLP.
d
25M+ Images | AI Training Data | Annotated imagery data for AI | Object &...
datarade.ai
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds, 25M+ Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/15m-images-ai-training-data-annotated-imagery-data-for-a-data-seeds
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Data Seeds
Area covered
Venezuela (Bolivarian Republic of), Macedonia (the former Yugoslav Republic of), Bulgaria, Iraq, Botswana, China, Cabo Verde, United Republic of, Sierra Leone, Honduras
Description
This dataset features over 25,000,000 high-quality general-purpose images sourced from photographers worldwide. Designed to support a wide range of AI and machine learning applications, it offers a richly diverse and extensively annotated collection of everyday visual content.

Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.

2.Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions spanning various themes ensure a steady influx of diverse, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements—such as themes, subjects, or scenarios—to be met efficiently.

Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide range of human experiences, cultures, environments, and activities. The dataset includes images of people, nature, objects, animals, urban and rural life, and more—captured across different times of day, seasons, and lighting conditions.

High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a balance of realism and creativity across visual domains.

Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on aesthetics, engagement, or content curation.

AI-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in general image recognition, multi-label classification, content filtering, and scene understanding. It integrates easily with leading machine learning frameworks and pipelines.

Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.

Use Cases: 1. Training AI models for general-purpose image classification and tagging. 2. Enhancing content moderation and visual search systems. 3. Building foundational datasets for large-scale vision-language models. 4. Supporting research in computer vision, multimodal AI, and generative modeling.

This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models across a wide array of domains. Customizations are available to suit specific project needs. Contact us to learn more!
D
Robotics Data Labeling Services Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Robotics Data Labeling Services Market Research Report 2033 [Dataset]. https://dataintelo.com/report/robotics-data-labeling-services-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Robotics Data Labeling Services Market Outlook

According to our latest research, the global robotics data labeling services market size reached USD 1.34 billion in 2024, reflecting robust expansion fueled by the rapid adoption of robotics across multiple industries. The market is set to grow at a CAGR of 21.7% from 2025 to 2033, reaching an estimated USD 9.29 billion by 2033. This impressive growth trajectory is primarily driven by increasing investments in artificial intelligence (AI), machine learning (ML), and automation technologies, which demand high-quality labeled data for effective robotics training and deployment. As per our latest research, the proliferation of autonomous systems and the need for precise data annotation are the key contributors to this market’s upward momentum.

One of the primary growth factors for the robotics data labeling services market is the accelerating adoption of AI-powered robotics in industrial and commercial domains. The increasing sophistication of robotics, especially in sectors like automotive manufacturing, logistics, and healthcare, requires vast amounts of accurately labeled data to train algorithms for object detection, navigation, and interaction. The emergence of Industry 4.0 and the transition toward smart factories have amplified the need for reliable data annotation services. Moreover, the growing complexity of robotic tasks necessitates not just basic labeling but advanced contextual annotation, further fueling demand. The rise in collaborative robots (cobots) in manufacturing environments also underlines the necessity for precise data labeling to ensure safety and efficiency.

Another significant driver is the surge in autonomous vehicle development, which relies heavily on high-quality labeled data for perception, decision-making, and real-time response. Automotive giants and tech startups alike are investing heavily in robotics data labeling services to enhance the performance of their autonomous driving systems. The expansion of sensor technologies, including LiDAR, radar, and high-definition cameras, has led to an exponential increase in the volume and complexity of data that must be annotated. This trend is further supported by regulatory pressures to ensure the safety and reliability of autonomous systems, making robust data labeling a non-negotiable requirement for market players.

Additionally, the healthcare sector is emerging as a prominent end-user of robotics data labeling services. The integration of robotics in surgical procedures, diagnostics, and patient care is driving demand for meticulously annotated datasets to train AI models in recognizing anatomical structures, pathological features, and procedural steps. The need for precision and accuracy in healthcare robotics is unparalleled, as errors can have significant consequences. As a result, healthcare organizations are increasingly outsourcing data labeling tasks to specialized service providers to leverage their expertise and ensure compliance with stringent regulatory standards. The expansion of telemedicine and remote diagnostics is also contributing to the growing need for reliable data annotation in healthcare robotics.

From a regional perspective, North America currently dominates the robotics data labeling services market, accounting for the largest share in 2024, followed closely by Asia Pacific and Europe. The United States is at the forefront, driven by substantial investments in AI research, a strong presence of leading robotics companies, and a mature technology ecosystem. Meanwhile, Asia Pacific is experiencing the fastest growth, propelled by large-scale industrial automation initiatives in China, Japan, and South Korea. Europe remains a critical market, driven by advancements in automotive and healthcare robotics, as well as supportive government policies. The Middle East & Africa and Latin America are also witnessing gradual adoption, primarily in manufacturing and logistics sectors, albeit at a slower pace compared to other regions.

Service Type Analysis

The service type segment in the robotics data labeling services market encompasses image labeling, video labeling, sensor data labeling, text labeling, and others. Image labeling remains the cornerstone of data annotation for robotics, as computer vision is integral to most robotic applications. The demand for image labeling services has surged with the proliferation of robots that rely on visual perception for nav
Image Dataset of Accessibility Barriers
zenodo.org
zip
Updated Mar 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jakob Stolberg; Jakob Stolberg (2022). Image Dataset of Accessibility Barriers [Dataset]. http://doi.org/10.5281/zenodo.6382090
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6382090
Dataset updated
Mar 25, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jakob Stolberg; Jakob Stolberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Data
The dataset consist of 5538 images of public spaces, annotated with steps, stairs, ramps and grab bars for stairs and ramps. The dataset has annotations 3564 of steps, 1492 of stairs, 143 of ramps and 922 of grab bars.

Each step annotation is attributed with an estimate of the height of the step, as falling into one of three categories: less than 3cm, 3cm to 7cm or more than 7cm. Additionally it is attributed with a 'type', with the possibilities 'doorstep', 'curb' or 'other'.

Stair annotations are attributed with the number of steps in the stair.

Ramps are attributed with an estimate of their width, also falling into three categories: less than 50cm, 50cm to 100cm and more than 100cm.

In order to preserve all additional attributes of the labels, the data is published in the CVAT XML format for images.

Annotating Process
The labelling has been done using bounding boxes around the objects. This format is compatible with many popular object detection models, e.g. the YOLO object model. A bounding box is placed so it contains exactly the visible part of the respective objects. This implies that only objects that are visible in the photo are annotated. This means in particular a photo of a stair or step from above, where the object cannot be seen, have not been annotated, even when a human viewer can possibly infer that there is a stair or a step from other features in the photo.

Steps
A step is annotated, when there is an vertical increment that functions as a passage between two surface areas intended human or vehicle traffic. This means that we have not included:

Increments that are to high to reasonably be considered at passage.

Increments that does not lead to a surface intended for human or vehicle traffic, e.g. a 'step' in front of a wall or a curb in front of a bush.

In particular, the bounding box of a step object contains exactly the incremental part of the step, but does not extend into the top or bottom horizontal surface any more than necessary to enclose entirely the incremental part. This has been chosen for consistency reasons, as including parts of the horizontal surfaces would imply a non-trivial choice of how much to include, which we deemed would most likely lead to more inconstistent annotations.

The height of the steps are estimated by the annotators, and are therefore not guarranteed to be accurate.

The type of the steps typically fall into the category 'doorstep' or 'curb'. Steps that are in a doorway, entrance or likewise are attributed as doorsteps. We also include in this category steps that are immediately leading to a doorway within a proximity of 1-2m. Steps between different types of pathways, e.g. between streets and sidewalks, are annotated as curbs. Any other type of step are annotated with 'other'. Many of the 'other' steps are for example steps to terraces.

Stairs
The stair label is used whenever two or more steps directly follow each other in a consistent pattern. All vertical increments are enclosed in the bounding box, as well as intermediate surfaces of the steps. However the top and bottom surface is not included more than necessary for the same reason as for steps, as described in the previous section.

The annotator counts the number of steps, and attribute this to the stair object label.

Ramps
Ramps have been annotated when a sloped passage way has been placed or built to connect two surface areas intended for human or vehicle traffic. This implies the same considerations as with steps. Alike also only the sloped part of a ramp is annotated, not including the bottom or top surface area.

For each ramp, the annotator makes an assessment of the width of the ramp in three categories: less than 50cm, 50cm to 100cm and more than 100cm. This parameter is visually hard to assess, and sometimes impossible due to the view of the ramp.

Grab Bars
Grab bars are annotated for hand rails and similar that are in direct connection to a stair or a ramp. While horizontal grab bars could also have been included, this was omitted due to the implied ambiguities of fences and similar objects. As the grab bar was originally intended as an attributal information to stairs and ramps, we chose to keep this focus. The bounding box encloses the part of the grab bar that functions as a hand rail for the stair or ramp.

Usage
As is often the case when annotating data, much information depends on the subjective assessment of the annotator. As each data point in this dataset has been annotated only by one person, caution should be taken if the data is applied.

Generally speaking, the mindset and usage guiding the annotations have been wheelchair accessibility. While we have strived to annotate at an object level, hopefully making the data more widely applicable than this, we state this explicitly as it may have swayed untrivial annotation choices.

The attributal data, such as step height or ramp width are highly subjective estimations. We still provide this data to give a post-hoc method to adjust which annotations to use. E.g. for some purposes, one may be interested in detecting only steps that are indeed more than 3cm. The attributal data makes it possible to sort away the steps less than 3cm, so a machine learning algorithm can be trained on this more appropriate dataset for that use case. We stress however, that one cannot expect to train accurate machine learning algorithms inferring the attributal data, as this is not accurate data in the first place.

We hope this dataset will be a useful building block in the endeavours for automating barrier detection and documentation.
f
Data_Sheet_1_An Estimation of Online Video User Engagement From Features of...
frontiersin.figshare.com
pdf
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Stappen; Alice Baird; Michelle Lienhart; Annalena Bätz; Björn Schuller (2023). Data_Sheet_1_An Estimation of Online Video User Engagement From Features of Time- and Value-Continuous, Dimensional Emotions.pdf [Dataset]. http://doi.org/10.3389/fcomp.2022.773154.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2022.773154.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Lukas Stappen; Alice Baird; Michelle Lienhart; Annalena Bätz; Björn Schuller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Portraying emotion and trustworthiness is known to increase the appeal of video content. However, the causal relationship between these signals and online user engagement is not well understood. This limited understanding is partly due to a scarcity in emotionally annotated data and the varied modalities which express user engagement online. In this contribution, we utilize a large dataset of YouTube review videos which includes ca. 600 h of dimensional arousal, valence and trustworthiness annotations. We investigate features extracted from these signals against various user engagement indicators including views, like/dislike ratio, as well as the sentiment of comments. In doing so, we identify the positive and negative influences which single features have, as well as interpretable patterns in each dimension which relate to user engagement. Our results demonstrate that smaller boundary ranges and fluctuations for arousal lead to an increase in user engagement. Furthermore, the extracted time-series features reveal significant (p < 0.05) correlations for each dimension, such as, count below signal mean (arousal), number of peaks (valence), and absolute energy (trustworthiness). From this, an effective combination of features is outlined for approaches aiming to automatically predict several user engagement indicators. In a user engagement prediction paradigm we compare all features against semi-automatic (cross-task), and automatic (task-specific) feature selection methods. These selected feature sets appear to outperform the usage of all features, e.g., using all features achieves 1.55 likes per day (Lp/d) mean absolute error from valence; this improves through semi-automatic and automatic selection to 1.33 and 1.23 Lp/d, respectively (data mean 9.72 Lp/d with a std. 28.75 Lp/d).

Facebook

Twitter

Click to copy link

Link copied

Cite

Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056

PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data"

Explore at:

zip(873 bytes)Available download formats

Unique identifier

https://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056

Dataset updated

Jul 29, 2020

Dataset provided by

HydroShare

Authors

Sean Cleveland; Gwen Jacobs; Jennifer Geis

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Time period covered

Jul 29, 2020

Description

Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.

Clear search

Close search

Google apps

Main menu

PEARC20 submitted paper: "Scientific Data Annotation and Dissemination:...

MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC)

Data from: CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine...

ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR)

330K+ Interior Design Images | AI Training Data | Annotated imagery data for...

Annotations and associated frequency signals

Data from: Quetzal: Comprehensive Peptide Fragmentation Annotation and...

Expert annotations for the Catalan Common Voice (v13)

Data from: FluoroMatch 2.0-making automated and comprehensive non-targeted...

Data from: Slovenian Word in Context dataset SloWiC 1.0

GMB Data

HelpSteer: AI Alignment Dataset

HelpSteer: AI Alignment Dataset

Real-World Helpfulness Annotated for AI Alignment

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

How to Use HelpSteer: An Open-Source AI Alignment Dataset

Step 1 - Choosing the Data File

Research Ideas

Acknowledgements

License

Social Media Corpus: Stigma Identification in Vaccination Discourse

Annotated GMB Corpus

Context

Content

Acknowledgements

Inspiration

25M+ Images | AI Training Data | Annotated imagery data for AI | Object &...

Robotics Data Labeling Services Market Research Report 2033

Robotics Data Labeling Services Market Outlook

Service Type Analysis

Image Dataset of Accessibility Barriers

Data_Sheet_1_An Estimation of Online Video User Engagement From Features of...

PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data"See More Versions

PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data"