100+ datasets found
  1. H

    PEARC20 submitted paper: "Scientific Data Annotation and Dissemination:...

    • hydroshare.org
    • beta.hydroshare.org
    zip
    Updated Jul 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056
    Explore at:
    zip(873 bytes)Available download formats
    Dataset updated
    Jul 29, 2020
    Dataset provided by
    HydroShare
    Authors
    Sean Cleveland; Gwen Jacobs; Jennifer Geis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Jul 29, 2020
    Description

    Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.

  2. Z

    MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gasco, Luis; Krallinger, Martin (2021). MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4612274
    Explore at:
    Dataset updated
    Oct 28, 2021
    Dataset provided by
    Barcelona Supercomputing Center
    Authors
    Gasco, Luis; Krallinger, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/

    Introduction: These corpora contain the data for each of the subtracks of MESINESP2 shared-task:

    [Subtrack 1] MESINESP - Medical indexing:

    Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:

    Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.

    Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.

    Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:

    213 articles were annotated by more than one annotator. We have selected de union between annotations.

    852 articles were annotated by only one of the three selected annotators with better performance.

    Test set: To be published

    [Subtrack 2] MESINESP - Clinical trials:

    Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.

    Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.

    Test set: To be published

    [Subtrack 3] MESINESP - Patents: To be published

    Files structure:

    Subtrack1-Scientific_Literature.zip contains the corpora generated for subtrack 1. Content:

    Subtrack1:

    Train

    training_set_track1_all.json: Full training set for subtrack 1.

    training_set_track1_only_articles.json: Articles training set for subtrack 1.

    Development

    development_set_subtrack1.json: Manually annotated development set for subtrack 1.

    Subtrack2-Clinical_Trials.zip contains the corpora generated for subtrack 2. Content:

    Subtrack2:

    Train

    training_set_subtrack2.json: Training set for subtrack 2.

    Development

    development_set_subtrack2.json: Manually annotated development set for subtrack 2.

    DeCS2020.tsv contains a DeCS table with the following structure:

    DeCS code

    Preferred descriptor (the preferred label in the Latin Spanish DeCS 2020 set)

    List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020 set, separated by pipes.

    DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.

    *Note: The obo and tsv files with DeCS2020 descriptors contain some additional COVID19 descriptors that will be included in future versions of DeCS. These items were provided by the Pan American Health Organization (PAHO), which has kindly shared this content to improve the results of the task by taking these descriptors into account.

    For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at encargo-pln-life@bsc.es

  3. E

    Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

    • live.european-language-grid.eu
    binary format
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 30, 2021
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

    Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

    The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.

  4. r

    Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC)

    • demo.researchdata.se
    • researchdata.se
    Updated Jan 15, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Kerren; Carita Paradis (2019). Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC) [Dataset]. http://doi.org/10.5878/002925
    Explore at:
    Dataset updated
    Jan 15, 2019
    Dataset provided by
    Linnaeus University
    Authors
    Andreas Kerren; Carita Paradis
    Time period covered
    Jun 1, 2015 - May 31, 2016
    Description

    In this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.

    Purpose:

    The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.

    The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.

    For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.

    The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.

    When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060

  5. Data from: CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine...

    • data.europa.eu
    • datos.gob.es
    • +1more
    unknown
    Updated Feb 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine in Spanish [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6059737?locale=da
    Explore at:
    unknown(2576817)Available download formats
    Dataset updated
    Feb 12, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 1200 texts (292 173 tokens) about clinical trials studies and clinical trials announcements in Spanish: - 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO). - 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos. Texts were annotated with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). 46 699 entities were annotated (13.98% are nested entities). 10% of the corpus was doubly annotated, and inter-annotator agreement (IAA) achieved a mean F-measure of 85.65% (±4.79, strict match) and a mean F-measure of 93.94% (±3.31, relaxed match). The corpus is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License.

  6. z

    ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR)

    • zenodo.org
    txt, zip
    Updated May 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bart Thomee; Adrian Popescu; Bart Thomee; Adrian Popescu (2020). ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR) [Dataset]. http://doi.org/10.5281/zenodo.1246796
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    May 22, 2020
    Dataset provided by
    Zenodo
    Authors
    Bart Thomee; Adrian Popescu; Bart Thomee; Adrian Popescu
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    DESCRIPTION
    For this task, we use a subset of the MIRFLICKR (http://mirflickr.liacs.nl) collection. The entire collection contains 1 million images from the social photo sharing website Flickr and was formed by downloading up to a thousand photos per day that were deemed to be the most interesting according to Flickr. All photos in this collection were released by their users under a Creative Commons license, allowing them to be freely used for research purposes. Of the entire collection, 25 thousand images were manually annotated with a limited number of concepts and many of these annotations have been further refined and expanded over the lifetime of the ImageCLEF photo annotation task. This year we used crowd sourcing to annotate all of these 25 thousand images with the concepts.

    On this page we provide you with more information about the textual features, visual features and concept features we supply with each image in the collection we use for this year's task.


    TEXTUAL FEATURES
    All images are accompanied by the following textual features:

    - Flickr user tags
    These are the tags that the users assigned to the photos their uploaded to Flickr. The 'raw' tags are the original tags, while the 'clean' tags are those collapsed to lowercase and condensed to removed spaces.

    - EXIF metadata
    If available, the EXIF metadata contains information about the camera that took the photo and the parameters used. The 'raw' exif is the original camera data, while the 'clean' exif reduces the verbosity.

    - User information and Creative Commons license information
    This contains information about the user that took the photo and the license associated with it.


    VISUAL FEATURES
    Over the previous years of the photo annotation task we noticed that often the same types of visual features are used by the participants, in particular features based on interest points and bag-of-words are popular. To assist you we have extracted several features for you that you may want to use, so you can focus on the concept detection instead. We additionally give you some pointers to easy to use toolkits that will help you extract other features or the same features but with different default settings.

    - SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT
    We used the ISIS Color Descriptors (http://www.colordescriptors.com) toolkit to extract these descriptors. This package provides you with many different types of features based on interest points, mostly using SIFT. It furthermore assists you with building codebooks for bag-of-words. The toolkit is available for Windows, Linux and Mac OS X.

    - SURF
    We used the OpenSURF (http://www.chrisevansdev.com/computer-vision-opensurf.html) toolkit to extract this descriptor. The open source code is available in C++, C#, Java and many more languages.

    - TOP-SURF
    We used the TOP-SURF (http://press.liacs.nl/researchdownloads/topsurf) toolkit to extract this descriptor, which represents images with SURF-based bag-of-words. The website provides codebooks of several different sizes that were created using a combination of images from the MIR-FLICKR collection and from the internet. The toolkit also offers the ability to create custom codebooks from your own image collection. The code is open source, written in C++ and available for Windows, Linux and Mac OS X.

    - GIST
    We used the LabelMe (http://labelme.csail.mit.edu) toolkit to extract this descriptor. The MATLAB-based library offers a comprehensive set of tools for annotating images.

    For the interest point-based features above we used a Fast Hessian-based technique to detect the interest points in each image. This detector is built into the OpenSURF library. In comparison with the Hessian-Laplace technique built into the ColorDescriptors toolkit it detects fewer points, resulting in a considerably reduced memory footprint. We therefore also provide you with the interest point locations in each image that the Fast Hessian-based technique detected, so when you would like to recalculate some features you can use them as a starting point for the extraction. The ColorDescriptors toolkit for instance accepts these locations as a separate parameter. Please go to http://www.imageclef.org/2012/photo-flickr/descriptors for more information on the file format of the visual features and how you can extract them yourself if you want to change the default settings.


    CONCEPT FEATURES
    We have solicited the help of workers on the Amazon Mechanical Turk platform to perform the concept annotation for us. To ensure a high standard of annotation we used the CrowdFlower platform that acts as a quality control layer by removing the judgments of workers that fail to annotate properly. We reused several concepts of last year's task and for most of these we annotated the remaining photos of the MIRFLICKR-25K collection that had not yet been used before in the previous task; for some concepts we reannotated all 25,000 images to boost their quality. For the new concepts we naturally had to annotate all of the images.

    - Concepts
    For each concept we indicate in which images it is present. The 'raw' concepts contain the judgments of all annotators for each image, where a '1' means an annotator indicated the concept was present whereas a '0' means the concept was not present, while the 'clean' concepts only contain the images for which the majority of annotators indicated the concept was present. Some images in the raw data for which we reused last year's annotations only have one judgment for a concept, whereas the other images have between three and five judgments; the single judgment does not mean only one annotator looked at it, as it is the result of a majority vote amongst last year's annotators.

    - Annotations
    For each image we indicate which concepts are present, so this is the reverse version of the data above. The 'raw' annotations contain the average agreement of the annotators on the presence of each concept, while the 'clean' annotations only include those for which there was a majority agreement amongst the annotators.

    You will notice that the annotations are not perfect. Especially when the concepts are more subjective or abstract, the annotators tend to disagree more with each other. The raw versions of the concept annotations should help you get an understanding of the exact judgments given by the annotators.

  7. d

    330K+ Interior Design Images | AI Training Data | Annotated imagery data for...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds, 330K+ Interior Design Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/200k-interior-design-images-ai-training-data-annotated-i-data-seeds
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Data Seeds
    Area covered
    Indonesia, Jamaica, Egypt, Curaçao, Turks and Caicos Islands, Congo, Tajikistan, Kuwait, Nicaragua, Ethiopia
    Description

    This dataset features over 330,000 high-quality interior design images sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly varied and extensively annotated collection of indoor environment visuals.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, making it ideal for tasks such as room classification, furniture detection, and spatial layout analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.

    1. Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions centered on interior design themes ensure a steady stream of fresh, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours to fulfill specific requests, such as particular room types, design styles, or furnishings.

    2. Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide spectrum of architectural styles, cultural aesthetics, and functional spaces. The images include homes, offices, restaurants, studios, and public interiors—ranging from minimalist and modern to classic and eclectic designs.

    3. High-Quality Imagery: the dataset includes standard to ultra-high-definition images that capture fine interior details. Both professionally staged and candid real-life spaces are included, offering versatility for training AI across design evaluation, object detection, and environmental understanding.

    4. Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This provides valuable insights into global aesthetic trends, helping AI models learn user preferences, design appeal, and stylistic relevance.

    5. AI-Ready Design: the dataset is optimized for machine learning tasks such as interior scene recognition, style transfer, virtual staging, and layout generation. It integrates smoothly with popular AI development environments and tools.

    6. Licensing & Compliance: the dataset fully complies with data privacy regulations and includes transparent licensing suitable for commercial and academic use.

    Use Cases: 1. Training AI for interior design recommendation engines and virtual staging tools. 2. Enhancing smart home applications and spatial recognition systems. 3. Powering AR/VR platforms for virtual tours, furniture placement, and room redesign. 4. Supporting architectural visualization, decor style transfer, and real estate marketing.

    This dataset offers a comprehensive, high-quality resource tailored for AI-driven innovation in design, real estate, and spatial computing. Customizations are available upon request. Contact us to learn more!

  8. Annotations and associated frequency signals

    • figshare.com
    bin
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karl Löwenmark (2023). Annotations and associated frequency signals [Dataset]. http://doi.org/10.6084/m9.figshare.24470620.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Karl Löwenmark
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Labelled industry datasets are the most valuable asset in prognostics and health management (PHM) research. However, creating labelled industry datasets is both difficult and expensive, making publicly available industry datasets rare at best. While labels are generally unavailable, many industry datasets contain annotations, maintenance work orders, or logbooks, with free-form text containing technical language descriptions of component properties, valuable information for any PHM model. Alas, publicly available annotated industry datasets are also scarce, in particular ones with associated signals available. Therefore, we release data from an annotated process industry dataset, consisting of 21090 pairs of signals and annotations from one year of kraftliner production.The annotations are written, in Swedish, by on-site Swedish experts, and the signals consist of accelerometer vibration measurements from two large (80x10x10m) paper machines. The data is cleaned and structured so that each annotation is associated with ten days of signal measurements leading up to the annotation date, where one signal measurement consists of 8192 samples over 6.4 seconds, which becomes 3200 samples stretching over 500 Hz in the frequency domain. The associated annotations are attached to each signal sample, so that the list of annotations is as long as the list of signals. In total, there are 43 unique annotations, though most are associated with multiple signals from different machines due to commonalities in fault descriptions. The language data is pre-processed so that all letters are lower case, numbers are removed, and names are replaced with the Swedish word "egennamn", meaning "name of a person" in English.Also included are pre-computed embeddings, which facilitates faster and easier testing for researchers wanting to easily investigate training signal encoders supervised through technical language supervision. The data presented here was used in the article "Technical Language Supervision for Intelligent Fault Diagnosis in Process Industry" (https://papers.phmsociety.org/index.php/ijphm/article/view/3137). Please cite this article if you use this dataset.To use this dataset without understanding Swedish, please consult "Processing of Condition Monitoring Annotations with BERT and Technical Language Substitution: A Case Study" (https://www.papers.phmsociety.org/index.php/phme/article/view/3356) on how to augment the technical data to facilitate easier language model translations to other languages, and don't hesitate to contact me if you have questions regarding the data.Accessing the data is simple; all you need to do to load spectra and annotation pairs is:import pandas as pdspectra_note_df = pd.read_pickle("TL_spectra_note_df_big.pkl")all_spectra = TL_spectra_note_df['Spectra']all_annotations = TL_spectra_note_df['Notes']Pre-computed embeddings can be accessed through:all_embeddings = TL_spectra_note_df['Embeddings']

  9. f

    Data from: Quetzal: Comprehensive Peptide Fragmentation Annotation and...

    • acs.figshare.com
    xlsx
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric W. Deutsch; Luis Mendoza; Robert L. Moritz (2025). Quetzal: Comprehensive Peptide Fragmentation Annotation and Visualization [Dataset]. http://doi.org/10.1021/acs.jproteome.5c00092.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    ACS Publications
    Authors
    Eric W. Deutsch; Luis Mendoza; Robert L. Moritz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Proteomics data-dependent acquisition data sets collected with high-resolution mass-spectrometry (MS) can achieve very high-quality results, but nearly every analysis yields results that are thresholded at some accepted false discovery rate, meaning that a substantial number of results are incorrect. For study conclusions that rely on a small number of peptide-spectrum matches being correct, it is thus important to examine at least some crucial spectra to ensure that they are not one of the incorrect identifications. We present Quetzal, a peptide fragment ion spectrum annotation tool to assist researchers in annotating and examining such spectra to ensure that they correctly support study conclusions. We describe how Quetzal annotates spectra using the new Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) mzPAF standard for fragment ion peak annotation, including the Python-based code, a web-service end point that provides annotation services, and a web-based application for annotating spectra and producing publication-quality figures. We illustrate its functionality with several annotated spectra of varying complexity. Quetzal provides easily accessible functionality that can assist in the effort to ensure and demonstrate that crucial spectra support study conclusions. Quetzal is publicly available at https://proteomecentral.proteomexchange.org/quetzal/.

  10. Z

    Expert annotations for the Catalan Common Voice (v13)

    • data.niaid.nih.gov
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Unit (2024). Expert annotations for the Catalan Common Voice (v13) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11104387
    Explore at:
    Dataset updated
    May 2, 2024
    Dataset provided by
    Barcelona Supercomputing Centerhttps://www.bsc.es/
    Authors
    Language Technologies Unit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Dataset Summary

    These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

    The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

    The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

    See annotations for more details.

    Supported Tasks and Leaderboards

    Gender classification, Accent classification.

    Languages

    The dataset is in Catalan (ca).

    Dataset Structure

    Instances

    Two xlsx documents are published, one for each round of annotations.

    The following information is available in each of the documents:

    { 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }

    We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

    Data Fields

    speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus

    idx (int): Id in this corpus

    AN1 (string): Annotations from Annotator 1

    AN2 (string): Annotations from Annotator 2

    AN3 (string): Annotations from Annotator 3

    agreed (string): Annotation from the majority of the annotators

    percentage (int): Percentage of annotators that agree with the agreed annotation

    mean quality (float): Mean of the quality annotation

    stdev quality (float): Standard deviation of the mean quality

    Data Splits

    The corpus remains undivided into splits, as its purpose does not involve training models.

    Dataset Creation

    Curation Rationale

    During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

    In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Source Data

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Initial Data Collection and Normalization

    We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

    Who are the source language producers?

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Annotations

    Annotation process

    Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

    A team of three annotators was tasked with annotating:

    if all the recordings correspond to the same person

    the gender of the speaker

    the accent of the speaker

    the quality of the recording

    They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Who are the annotators?

    The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

    The annotation team was composed of:

    Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.

    Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.

    1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

    To do the annotation they used a Google Drive spreadsheet

    Personal and Sensitive Information

    The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    Considerations for Using the Data

    Social Impact of Dataset

    The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

    You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Discussion of Biases

    Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

    For the gender annotation, we have only considered "H" (male) and "D" (female).

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset is licensed under a CC BY 4.0 license.

    It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The annotation was entrusted to the STeL team from the University of Barcelona.

  11. Data from: FluoroMatch 2.0-making automated and comprehensive non-targeted...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality [Dataset]. https://catalog.data.gov/dataset/fluoromatch-2-0-making-automated-and-comprehensive-non-targeted-pfas-annotation-a-reality
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Data for "Koelmel JP, Stelben P, McDonough CA, Dukes DA, Aristizabal-Henao JJ, Nason SL, Li Y, Sternberg S, Lin E, Beckmann M, Williams AJ, Draper J, Finch JP, Munk JK, Deigl C, Rennie EE, Bowden JA, Godri Pollitt KJ. FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality. Anal Bioanal Chem. 2022 Jan;414(3):1201-1215. doi: 10.1007/s00216-021-03392-7. Epub 2021 May 20. PMID: 34014358.". Portions of this dataset are inaccessible because: The link provided by UCSD doesn't seem to be working. They can be accessed through the following means: Contact Jeremy Koelmel at Yale University, jeremykoelmel@innovativeomics.com. Format: The final annotated excel sheets with feature intensities, annotations, homologous series groupings, etc., are available as a supplemental excel file with the online version of this manuscript. The raw Agilent “.d” files can be downloaded at: ftp://massive.ucsd.edu/MSV000086811/updates/2021-02-05_jeremykoelmel_e5b21166/raw/McDonough_AFFF_3M_ddMS2_Neg.zip (Note use Google Chrome or Firefox, Microsoft Edge and certain other browsers are unable to download from an ftp link). This dataset is associated with the following publication: Koelmel, J.P., P. Stelben, C.A. McDonough, D.A. Dukes, J.J. Aristizabal-Henao, S.L. Nason, Y. Li, S. Sternberg, E. Lin, M. Beckmann, A. Williams, J. Draper, J. Finch, J.K. Munk, C. Deigl, E. Rennie, J.A. Bowden, and K.J. Godri Pollitt. FluoroMatch 2.0—making automated and comprehensive non-targeted PFAS annotation a reality. Analytical and Bioanalytical Chemistry. Springer, New York, NY, USA, 414(3): 1201-1215, (2022).

  12. c

    Data from: Slovenian Word in Context dataset SloWiC 1.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781?locale-attribute=en
    Explore at:
    Dataset updated
    Mar 23, 2023
    Authors
    Timotej Knez; Slavko Žitnik
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

    Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example

  13. GMB Data

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghassen Khaled (2025). GMB Data [Dataset]. https://www.kaggle.com/datasets/ghassenkhaled/gmb-data
    Explore at:
    zip(3265952 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Authors
    Ghassen Khaled
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    For this notebook, we're going to use the GMB (Groningen Meaning Bank) corpus for named entity recognition. GMB is a fairly large corpus with a lot of annotations. The data is labeled using the IOB format (short for inside, outside, beginning), which means each annotation also needs a prefix of I, O, or B.

    The following classes appear in the dataset:

    LOC - Geographical Entity ORG - Organization PER - Person GPE - Geopolitical Entity TIME - Time indicator ART - Artifact EVE - Event NAT - Natural Phenomenon Note: GMB is not completely human annotated, and it’s not considered 100% correct. For this exercise, classes ART, EVE, and NAT were combined into a MISC class due to small number of examples for these classes.

  14. HelpSteer: AI Alignment Dataset

    • kaggle.com
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). HelpSteer: AI Alignment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/helpsteer-ai-alignment-dataset
    Explore at:
    zip(16614333 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    HelpSteer: AI Alignment Dataset

    Real-World Helpfulness Annotated for AI Alignment

    By Huggingface Hub [source]

    About this dataset

    HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to Use HelpSteer: An Open-Source AI Alignment Dataset

    HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.

    Step 1 - Choosing the Data File

    Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.

    ## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as “Feature Engineering”

    ## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality

    Research Ideas

    • Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.
    • Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.
    • Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...

  15. Social Media Corpus: Stigma Identification in Vaccination Discourse

    • figshare.com
    txt
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Straton (2025). Social Media Corpus: Stigma Identification in Vaccination Discourse [Dataset]. http://doi.org/10.6084/m9.figshare.23277392.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 21, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Straton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Current research introduces an annotated gold standard dataset based on 2,663 comments from Meta (Facebook). The dataset is manually labelled for stigma, not stigma, and ambiguous sentiment. Each comment is labelled three times (four times in case of dissensus) by independent expert annotators. The overall observed share of agreement reached 68% and Fleiss Kappa agreement rate achieved 0.62 on the annotation task with three labels ("stigma, "not stigma", and "ambiguous" category). Annotation share of agreement between two labels ("stigma, "not stigma") is 89% and Fleiss Kappa is 0.84. The labels are consequently propagated from the annotated Facebook (Meta) to a dataset discussing COVID vaccines with 40,084 comments from Twitter, Reddit, and YouTube corpora. In addition, the corpora are annotated with linguistic features from LIWC (Linguistic Inquiry and Word Count) [1], [2] and additional features: number of characters in the comment string, sentiment score, subjectivity score.

    1. Pennebaker, J. W., Francis, M. E. & Booth, R. J. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Assoc. 71, 2001 (2001).
    2. Tausczik, Y. R. & Pennebaker, J. W. The psychological meaning of words: Liwc and computerised text analysis methods. J. language social psychology 29, 24–54 (2010)
  16. Annotated GMB Corpus

    • kaggle.com
    zip
    Updated Oct 7, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shoumik (2018). Annotated GMB Corpus [Dataset]. https://www.kaggle.com/shoumikgoswami/annotated-gmb-corpus
    Explore at:
    zip(473318 bytes)Available download formats
    Dataset updated
    Oct 7, 2018
    Authors
    Shoumik
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Named Entity Recognition for annotated corpus using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

    Content

    The dataset an extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed. The attached dataset is in tab separated format, the goal is to create a good model to classify the Tag column. The data is labelled using the IOB tagging system. Here are the following classes in the dataset - geo = Geographical Entity org = Organization per = Person gpe = Geopolitical Entity tim = Time indicator art = Artifact eve = Event nat = Natural Phenomenon

    Acknowledgements

    The dataset is a subset of the original dataset shared here - https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/kernels

    Inspiration

    The data can be used by anyone who is starting off with NER in NLP.

  17. d

    25M+ Images | AI Training Data | Annotated imagery data for AI | Object &...

    • datarade.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds, 25M+ Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/15m-images-ai-training-data-annotated-imagery-data-for-a-data-seeds
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Data Seeds
    Area covered
    Venezuela (Bolivarian Republic of), Macedonia (the former Yugoslav Republic of), Bulgaria, Iraq, Botswana, China, Cabo Verde, United Republic of, Sierra Leone, Honduras
    Description

    This dataset features over 25,000,000 high-quality general-purpose images sourced from photographers worldwide. Designed to support a wide range of AI and machine learning applications, it offers a richly diverse and extensively annotated collection of everyday visual content.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.

    2.Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions spanning various themes ensure a steady influx of diverse, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements—such as themes, subjects, or scenarios—to be met efficiently.

    1. Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide range of human experiences, cultures, environments, and activities. The dataset includes images of people, nature, objects, animals, urban and rural life, and more—captured across different times of day, seasons, and lighting conditions.

    2. High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a balance of realism and creativity across visual domains.

    3. Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on aesthetics, engagement, or content curation.

    4. AI-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in general image recognition, multi-label classification, content filtering, and scene understanding. It integrates easily with leading machine learning frameworks and pipelines.

    5. Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.

    Use Cases: 1. Training AI models for general-purpose image classification and tagging. 2. Enhancing content moderation and visual search systems. 3. Building foundational datasets for large-scale vision-language models. 4. Supporting research in computer vision, multimodal AI, and generative modeling.

    This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models across a wide array of domains. Customizations are available to suit specific project needs. Contact us to learn more!

  18. D

    Robotics Data Labeling Services Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Robotics Data Labeling Services Market Research Report 2033 [Dataset]. https://dataintelo.com/report/robotics-data-labeling-services-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Robotics Data Labeling Services Market Outlook



    According to our latest research, the global robotics data labeling services market size reached USD 1.34 billion in 2024, reflecting robust expansion fueled by the rapid adoption of robotics across multiple industries. The market is set to grow at a CAGR of 21.7% from 2025 to 2033, reaching an estimated USD 9.29 billion by 2033. This impressive growth trajectory is primarily driven by increasing investments in artificial intelligence (AI), machine learning (ML), and automation technologies, which demand high-quality labeled data for effective robotics training and deployment. As per our latest research, the proliferation of autonomous systems and the need for precise data annotation are the key contributors to this market’s upward momentum.




    One of the primary growth factors for the robotics data labeling services market is the accelerating adoption of AI-powered robotics in industrial and commercial domains. The increasing sophistication of robotics, especially in sectors like automotive manufacturing, logistics, and healthcare, requires vast amounts of accurately labeled data to train algorithms for object detection, navigation, and interaction. The emergence of Industry 4.0 and the transition toward smart factories have amplified the need for reliable data annotation services. Moreover, the growing complexity of robotic tasks necessitates not just basic labeling but advanced contextual annotation, further fueling demand. The rise in collaborative robots (cobots) in manufacturing environments also underlines the necessity for precise data labeling to ensure safety and efficiency.




    Another significant driver is the surge in autonomous vehicle development, which relies heavily on high-quality labeled data for perception, decision-making, and real-time response. Automotive giants and tech startups alike are investing heavily in robotics data labeling services to enhance the performance of their autonomous driving systems. The expansion of sensor technologies, including LiDAR, radar, and high-definition cameras, has led to an exponential increase in the volume and complexity of data that must be annotated. This trend is further supported by regulatory pressures to ensure the safety and reliability of autonomous systems, making robust data labeling a non-negotiable requirement for market players.




    Additionally, the healthcare sector is emerging as a prominent end-user of robotics data labeling services. The integration of robotics in surgical procedures, diagnostics, and patient care is driving demand for meticulously annotated datasets to train AI models in recognizing anatomical structures, pathological features, and procedural steps. The need for precision and accuracy in healthcare robotics is unparalleled, as errors can have significant consequences. As a result, healthcare organizations are increasingly outsourcing data labeling tasks to specialized service providers to leverage their expertise and ensure compliance with stringent regulatory standards. The expansion of telemedicine and remote diagnostics is also contributing to the growing need for reliable data annotation in healthcare robotics.




    From a regional perspective, North America currently dominates the robotics data labeling services market, accounting for the largest share in 2024, followed closely by Asia Pacific and Europe. The United States is at the forefront, driven by substantial investments in AI research, a strong presence of leading robotics companies, and a mature technology ecosystem. Meanwhile, Asia Pacific is experiencing the fastest growth, propelled by large-scale industrial automation initiatives in China, Japan, and South Korea. Europe remains a critical market, driven by advancements in automotive and healthcare robotics, as well as supportive government policies. The Middle East & Africa and Latin America are also witnessing gradual adoption, primarily in manufacturing and logistics sectors, albeit at a slower pace compared to other regions.



    Service Type Analysis



    The service type segment in the robotics data labeling services market encompasses image labeling, video labeling, sensor data labeling, text labeling, and others. Image labeling remains the cornerstone of data annotation for robotics, as computer vision is integral to most robotic applications. The demand for image labeling services has surged with the proliferation of robots that rely on visual perception for nav

  19. Image Dataset of Accessibility Barriers

    • zenodo.org
    zip
    Updated Mar 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakob Stolberg; Jakob Stolberg (2022). Image Dataset of Accessibility Barriers [Dataset]. http://doi.org/10.5281/zenodo.6382090
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jakob Stolberg; Jakob Stolberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Data
    The dataset consist of 5538 images of public spaces, annotated with steps, stairs, ramps and grab bars for stairs and ramps. The dataset has annotations 3564 of steps, 1492 of stairs, 143 of ramps and 922 of grab bars.

    Each step annotation is attributed with an estimate of the height of the step, as falling into one of three categories: less than 3cm, 3cm to 7cm or more than 7cm. Additionally it is attributed with a 'type', with the possibilities 'doorstep', 'curb' or 'other'.

    Stair annotations are attributed with the number of steps in the stair.

    Ramps are attributed with an estimate of their width, also falling into three categories: less than 50cm, 50cm to 100cm and more than 100cm.

    In order to preserve all additional attributes of the labels, the data is published in the CVAT XML format for images.

    Annotating Process
    The labelling has been done using bounding boxes around the objects. This format is compatible with many popular object detection models, e.g. the YOLO object model. A bounding box is placed so it contains exactly the visible part of the respective objects. This implies that only objects that are visible in the photo are annotated. This means in particular a photo of a stair or step from above, where the object cannot be seen, have not been annotated, even when a human viewer can possibly infer that there is a stair or a step from other features in the photo.

    Steps
    A step is annotated, when there is an vertical increment that functions as a passage between two surface areas intended human or vehicle traffic. This means that we have not included:

    • Increments that are to high to reasonably be considered at passage.
    • Increments that does not lead to a surface intended for human or vehicle traffic, e.g. a 'step' in front of a wall or a curb in front of a bush.

    In particular, the bounding box of a step object contains exactly the incremental part of the step, but does not extend into the top or bottom horizontal surface any more than necessary to enclose entirely the incremental part. This has been chosen for consistency reasons, as including parts of the horizontal surfaces would imply a non-trivial choice of how much to include, which we deemed would most likely lead to more inconstistent annotations.

    The height of the steps are estimated by the annotators, and are therefore not guarranteed to be accurate.

    The type of the steps typically fall into the category 'doorstep' or 'curb'. Steps that are in a doorway, entrance or likewise are attributed as doorsteps. We also include in this category steps that are immediately leading to a doorway within a proximity of 1-2m. Steps between different types of pathways, e.g. between streets and sidewalks, are annotated as curbs. Any other type of step are annotated with 'other'. Many of the 'other' steps are for example steps to terraces.

    Stairs
    The stair label is used whenever two or more steps directly follow each other in a consistent pattern. All vertical increments are enclosed in the bounding box, as well as intermediate surfaces of the steps. However the top and bottom surface is not included more than necessary for the same reason as for steps, as described in the previous section.

    The annotator counts the number of steps, and attribute this to the stair object label.

    Ramps
    Ramps have been annotated when a sloped passage way has been placed or built to connect two surface areas intended for human or vehicle traffic. This implies the same considerations as with steps. Alike also only the sloped part of a ramp is annotated, not including the bottom or top surface area.

    For each ramp, the annotator makes an assessment of the width of the ramp in three categories: less than 50cm, 50cm to 100cm and more than 100cm. This parameter is visually hard to assess, and sometimes impossible due to the view of the ramp.

    Grab Bars
    Grab bars are annotated for hand rails and similar that are in direct connection to a stair or a ramp. While horizontal grab bars could also have been included, this was omitted due to the implied ambiguities of fences and similar objects. As the grab bar was originally intended as an attributal information to stairs and ramps, we chose to keep this focus. The bounding box encloses the part of the grab bar that functions as a hand rail for the stair or ramp.

    Usage
    As is often the case when annotating data, much information depends on the subjective assessment of the annotator. As each data point in this dataset has been annotated only by one person, caution should be taken if the data is applied.

    Generally speaking, the mindset and usage guiding the annotations have been wheelchair accessibility. While we have strived to annotate at an object level, hopefully making the data more widely applicable than this, we state this explicitly as it may have swayed untrivial annotation choices.

    The attributal data, such as step height or ramp width are highly subjective estimations. We still provide this data to give a post-hoc method to adjust which annotations to use. E.g. for some purposes, one may be interested in detecting only steps that are indeed more than 3cm. The attributal data makes it possible to sort away the steps less than 3cm, so a machine learning algorithm can be trained on this more appropriate dataset for that use case. We stress however, that one cannot expect to train accurate machine learning algorithms inferring the attributal data, as this is not accurate data in the first place.

    We hope this dataset will be a useful building block in the endeavours for automating barrier detection and documentation.

  20. f

    Data_Sheet_1_An Estimation of Online Video User Engagement From Features of...

    • frontiersin.figshare.com
    pdf
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Stappen; Alice Baird; Michelle Lienhart; Annalena Bätz; Björn Schuller (2023). Data_Sheet_1_An Estimation of Online Video User Engagement From Features of Time- and Value-Continuous, Dimensional Emotions.pdf [Dataset]. http://doi.org/10.3389/fcomp.2022.773154.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Lukas Stappen; Alice Baird; Michelle Lienhart; Annalena Bätz; Björn Schuller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Portraying emotion and trustworthiness is known to increase the appeal of video content. However, the causal relationship between these signals and online user engagement is not well understood. This limited understanding is partly due to a scarcity in emotionally annotated data and the varied modalities which express user engagement online. In this contribution, we utilize a large dataset of YouTube review videos which includes ca. 600 h of dimensional arousal, valence and trustworthiness annotations. We investigate features extracted from these signals against various user engagement indicators including views, like/dislike ratio, as well as the sentiment of comments. In doing so, we identify the positive and negative influences which single features have, as well as interpretable patterns in each dimension which relate to user engagement. Our results demonstrate that smaller boundary ranges and fluctuations for arousal lead to an increase in user engagement. Furthermore, the extracted time-series features reveal significant (p < 0.05) correlations for each dimension, such as, count below signal mean (arousal), number of peaks (valence), and absolute energy (trustworthiness). From this, an effective combination of features is outlined for approaches aiming to automatically predict several user engagement indicators. In a user engagement prediction paradigm we compare all features against semi-automatic (cross-task), and automatic (task-specific) feature selection methods. These selected feature sets appear to outperform the usage of all features, e.g., using all features achieves 1.55 likes per day (Lp/d) mean absolute error from valence; this improves through semi-automatic and automatic selection to 1.33 and 1.23 Lp/d, respectively (data mean 9.72 Lp/d with a std. 28.75 Lp/d).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056

PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data"

Explore at:
zip(873 bytes)Available download formats
Dataset updated
Jul 29, 2020
Dataset provided by
HydroShare
Authors
Sean Cleveland; Gwen Jacobs; Jennifer Geis
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Time period covered
Jul 29, 2020
Description

Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.

Search
Clear search
Close search
Google apps
Main menu