31 datasets found
  1. H

    PEARC20 submitted paper: "Scientific Data Annotation and Dissemination:...

    • hydroshare.org
    • beta.hydroshare.org
    zip
    Updated Jul 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ā€˜Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056
    Explore at:
    zip(873 bytes)Available download formats
    Dataset updated
    Jul 29, 2020
    Dataset provided by
    HydroShare
    Authors
    Sean Cleveland; Gwen Jacobs; Jennifer Geis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Jul 29, 2020
    Description

    Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawaiā€˜i (UH), called the ā€˜Ike Wai Gateway. In Hawaiian, ā€˜Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawaiā€˜i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ā€˜Ike Wai research team and wider Hawaiā€˜i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ā€˜Ike Wai project through the ā€˜Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.

  2. Z

    Taxonomies for Semantic Research Data Annotation

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gƶpfert, Christoph; Haas, Jan Ingo; Schrƶder, Lucas; Gaedke, Martin (2024). Taxonomies for Semantic Research Data Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7908854
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Technische UniversitƤt Chemnitz
    Authors
    Gƶpfert, Christoph; Haas, Jan Ingo; Schrƶder, Lucas; Gaedke, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 35 of 39 taxonomies that were the result of a systematic review. The systematic review was conducted with the goal of identifying taxonomies suitable for semantically annotating research data. A special focus was set on research data from the hybrid societies domain.

    The following taxonomies were identified as part of the systematic review:

    Filename

    Taxonomy Title

    acm_ccs

    ACM Computing Classification System [1]

    amec

    A Taxonomy of Evaluation Towards Standards [2]

    bibo

    A BIBO Ontology Extension for Evaluation of Scientific Research Results [3]

    cdt

    Cross-Device Taxonomy [4]

    cso

    Computer Science Ontology [5]

    ddbm

    What Makes a Data-driven Business Model? A Consolidated Taxonomy [6]

    ddi_am

    DDI Aggregation Method [7]

    ddi_moc

    DDI Mode of Collection [8]

    n/a

    DemoVoc [9]

    discretization

    Building a New Taxonomy for Data Discretization Techniques [10]

    dp

    Demopaedia [11]

    dsg

    Data Science Glossary [12]

    ease

    A Taxonomy of Evaluation Approaches in Software Engineering [13]

    eco

    Evidence & Conclusion Ontology [14]

    edam

    EDAM: The Bioscientific Data Analysis Ontology [15]

    n/a

    European Language Social Science Thesaurus [16]

    et

    Evaluation Thesaurus [17]

    glos_hci

    The Glossary of Human Computer Interaction [18]

    n/a

    Humanities and Social Science Electronic Thesaurus [19]

    hcio

    A Core Ontology on the Human-Computer Interaction Phenomenon [20]

    hft

    Human-Factors Taxonomy [21]

    hri

    A Taxonomy to Structure and Analyze Human–Robot Interaction [22]

    iim

    A Taxonomy of Interaction for Instructional Multimedia [23]

    interrogation

    A Taxonomy of Interrogation Methods [24]

    iot

    Design Vocabulary for Human–IoT Systems Communication [25]

    kinect

    Understanding Movement and Interaction: An Ontology for Kinect-Based 3D Depth Sensors [26]

    maco

    Thesaurus Mass Communication [27]

    n/a

    Thesaurus Cognitive Psychology of Human Memory [28]

    mixed_initiative

    Mixed-Initiative Human-Robot Interaction: Definition, Taxonomy, and Survey [29]

    qos_qoe

    A Taxonomy of Quality of Service and Quality of Experience of Multimodal Human-Machine Interaction [30]

    ro

    The Research Object Ontology [31]

    senses_sensors

    A Human-Centered Taxonomy of Interaction Modalities and Devices [32]

    sipat

    A Taxonomy of Spatial Interaction Patterns and Techniques [33]

    social_errors

    A Taxonomy of Social Errors in Human-Robot Interaction [34]

    sosa

    Semantic Sensor Network Ontology [35]

    swo

    The Software Ontology [36]

    tadirah

    Taxonomy of Digital Research Activities in the Humanities [37]

    vrs

    Virtual Reality and the CAVE: Taxonomy, Interaction Challenges and Research Directions [38]

    xdi

    Cross-Device Interaction [39]

    We converted the taxonomies into SKOS (Simple Knowledge Organisation System) representation. The following 4 taxonomies were not converted as they were already available in SKOS and were for this reason excluded from this dataset:

    1) DemoVoc, cf. http://thesaurus.web.ined.fr/navigateur/ available at https://thesaurus.web.ined.fr/exports/demovoc/demovoc.rdf

    2) European Language Social Science Thesaurus, cf. https://thesauri.cessda.eu/elsst/en/ available at https://zenodo.org/record/5506929

    3) Humanities and Social Science Electronic Thesaurus, cf. https://hasset.ukdataservice.ac.uk/hasset/en/ available at https://zenodo.org/record/7568355

    4) Thesaurus Cognitive Psychology of Human Memory, cf. https://www.loterre.fr/presentation/ available at https://skosmos.loterre.fr/P66/en/

    References

    [1] ā€œThe 2012 ACM Computing Classification System,ā€ ACM Digital Library, 2012. https://dl.acm.org/ccs (accessed May 08, 2023).

    [2] AMEC, ā€œA Taxonomy of Evaluation Towards Standards.ā€ Aug. 31, 2016. Accessed: May 08, 2023. [Online]. Available: https://amecorg.com/amecframework/home/supporting-material/taxonomy/

    [3] B. Dimić Surla, M. Segedinac, and D. Ivanović, ā€œA BIBO ontology extension for evaluation of scientific research results,ā€ in Proceedings of the Fifth Balkan Conference in Informatics, in BCI ’12. New York, NY, USA: Association for Computing Machinery, Sep. 2012, pp. 275–278. doi: 10.1145/2371316.2371376.

    [4] F. Brudy et al., ā€œCross-Device Taxonomy: Survey, Opportunities and Challenges of Interactions Spanning Across Multiple Devices,ā€ in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19. New York, NY, USA: Association for Computing Machinery, Mai 2019, pp. 1–28. doi: 10.1145/3290605.3300792.

    [5] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, and E. Motta, ā€œThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas,ā€ in Lecture Notes in Computer Science 1137, D. Vrandečić, K. Bontcheva, M. C. SuĆ”rez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kaffee, and E. Simperl, Eds., Monterey, California, USA: Springer, Oct. 2018, pp. 187–205. Accessed: May 08, 2023. [Online]. Available: http://oro.open.ac.uk/55484/

    [6] M. Dehnert, A. Gleiss, and F. Reiss, ā€œWhat makes a data-driven business model? A consolidated taxonomy,ā€ presented at the European Conference on Information Systems, 2021.

    [7] DDI Alliance, ā€œDDI Controlled Vocabulary for Aggregation Method,ā€ 2014. https://ddialliance.org/Specification/DDI-CV/AggregationMethod_1.0.html (accessed May 08, 2023).

    [8] DDI Alliance, ā€œDDI Controlled Vocabulary for Mode Of Collection,ā€ 2015. https://ddialliance.org/Specification/DDI-CV/ModeOfCollection_2.0.html (accessed May 08, 2023).

    [9] INED - French Institute for Demographic Studies, ā€œThĆ©saurus DemoVoc,ā€ Feb. 26, 2020. https://thesaurus.web.ined.fr/navigateur/en/about (accessed May 08, 2023).

    [10] A. A. Bakar, Z. A. Othman, and N. L. M. Shuib, ā€œBuilding a new taxonomy for data discretization techniques,ā€ in 2009 2nd Conference on Data Mining and Optimization, Oct. 2009, pp. 132–140. doi: 10.1109/DMO.2009.5341896.

    [11] N. Brouard and C. Giudici, ā€œUnified second edition of the Multilingual Demographic Dictionary (Demopaedia.org project),ā€ presented at the 2017 International Population Conference, IUSSP, Oct. 2017. Accessed: May 08, 2023. [Online]. Available: https://iussp.confex.com/iussp/ipc2017/meetingapp.cgi/Paper/5713

    [12] DuCharme, Bob, ā€œData Science Glossary.ā€ https://www.datascienceglossary.org/ (accessed May 08, 2023).

    [13] A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, and E. Stiakakis, ā€œA Taxonomy of Evaluation Approaches in Software Engineering,ā€ in Proceedings of the 7th Balkan Conference on Informatics Conference, in BCI ’15. New York, NY, USA: Association for Computing Machinery, Sep. 2015, pp. 1–8. doi: 10.1145/2801081.2801084.

    [14] M. C. Chibucos, D. A. Siegele, J. C. Hu, and M. Giglio, ā€œThe Evidence and Conclusion Ontology (ECO): Supporting GO Annotations,ā€ in The Gene Ontology Handbook, C. Dessimoz and N. Å kunca, Eds., in Methods in Molecular Biology. New York, NY: Springer, 2017, pp. 245–259. doi: 10.1007/978-1-4939-3743-1_18.

    [15] M. Black et al., ā€œEDAM: the bioscientific data analysis ontology,ā€ F1000Research, vol. 11, Jan. 2021, doi: 10.7490/f1000research.1118900.1.

    [16] Council of European Social Science Data Archives (CESSDA), ā€œEuropean Language Social Science Thesaurus ELSST,ā€ 2021. https://thesauri.cessda.eu/en/ (accessed May 08, 2023).

    [17] M. Scriven, Evaluation Thesaurus, 3rd Edition. Edgepress, 1981. Accessed: May 08, 2023. [Online]. Available: https://us.sagepub.com/en-us/nam/evaluation-thesaurus/book3562

    [18] Papantoniou, Bill et al., The Glossary of Human Computer Interaction. Interaction Design Foundation. Accessed: May 08, 2023. [Online]. Available: https://www.interaction-design.org/literature/book/the-glossary-of-human-computer-interaction

    [19] ā€œUK Data Service Vocabularies: HASSET Thesaurus.ā€ https://hasset.ukdataservice.ac.uk/hasset/en/ (accessed May 08, 2023).

    [20] S. D. Costa, M. P. Barcellos, R. de A. Falbo, T. Conte, and K. M. de Oliveira, ā€œA core ontology on the Human–Computer Interaction phenomenon,ā€ Data Knowl. Eng., vol. 138, p. 101977, Mar. 2022, doi: 10.1016/j.datak.2021.101977.

    [21] V. J. Gawron et al., ā€œHuman Factors Taxonomy,ā€ Proc. Hum. Factors Soc. Annu. Meet., vol. 35, no. 18, pp. 1284–1287, Sep. 1991, doi: 10.1177/154193129103501807.

    [22] L. Onnasch and E. Roesler, ā€œA Taxonomy to Structure and Analyze Human–Robot Interaction,ā€ Int. J. Soc. Robot., vol. 13, no. 4, pp. 833–849, Jul. 2021, doi: 10.1007/s12369-020-00666-5.

    [23] R. A. Schwier, ā€œA Taxonomy of Interaction for Instructional Multimedia.ā€ Sep. 28, 1992. Accessed: May 09, 2023. [Online]. Available: https://eric.ed.gov/?id=ED352044

    [24] C. Kelly, J. Miller, A. Redlich, and S. Kleinman, ā€œA Taxonomy of Interrogation Methods,ā€

  3. r

    Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC)

    • demo.researchdata.se
    • researchdata.se
    Updated Jan 15, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Kerren; Carita Paradis (2019). Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC) [Dataset]. http://doi.org/10.5878/002925
    Explore at:
    Dataset updated
    Jan 15, 2019
    Dataset provided by
    Linnaeus University
    Authors
    Andreas Kerren; Carita Paradis
    Time period covered
    Jun 1, 2015 - May 31, 2016
    Description

    In this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.

    Purpose:

    The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.

    The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.

    For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.

    The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.

    When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060

  4. f

    Data_Sheet_1_A Perceiver-Centered Approach for Representing and Annotating...

    • figshare.com
    pdf
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Bedoya; Lawrence Fyfe; Elaine Chew (2023). Data_Sheet_1_A Perceiver-Centered Approach for Representing and Annotating Prosodic Functions in Performed Music.PDF [Dataset]. http://doi.org/10.3389/fpsyg.2022.886570.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Frontiers
    Authors
    Daniel Bedoya; Lawrence Fyfe; Elaine Chew
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Musical prosody is characterized by the acoustic variations that make music expressive. However, few systematic and scalable studies exist on the function it serves or on effective tools to carry out such studies. To address this gap, we introduce a novel approach to capturing information about prosodic functions through a citizen science paradigm. In typical bottom-up approaches to studying musical prosody, acoustic properties in performed music and basic musical structures such as accents and phrases are mapped to prosodic functions, namely segmentation and prominence. In contrast, our top-down, human-centered method puts listener annotations of musical prosodic functions first, to analyze the connection between these functions, the underlying musical structures, and acoustic properties. The method is applied primarily to the exploring of segmentation and prominence in performed solo piano music. These prosodic functions are marked by means of four annotation types—boundaries, regions, note groups, and comments—in the CosmoNote web-based citizen science platform, which presents the music signal or MIDI data and related acoustic features in information layers that can be toggled on and off. Various annotation strategies are discussed and appraised: intuitive vs. analytical; real-time vs. retrospective; and, audio-based vs. visual. The end-to-end process of the data collection is described, from the providing of prosodic examples to the structuring and formatting of the annotation data for analysis, to techniques for preventing precision errors. The aim is to obtain reliable and coherent annotations that can be applied to theoretical and data-driven models of musical prosody. The outcomes include a growing library of prosodic examples with the goal of achieving an annotation convention for studying musical prosody in performed music.

  5. z

    ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR)

    • zenodo.org
    txt, zip
    Updated May 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bart Thomee; Adrian Popescu; Bart Thomee; Adrian Popescu (2020). ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR) [Dataset]. http://doi.org/10.5281/zenodo.1246796
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    May 22, 2020
    Dataset provided by
    Zenodo
    Authors
    Bart Thomee; Adrian Popescu; Bart Thomee; Adrian Popescu
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    DESCRIPTION
    For this task, we use a subset of the MIRFLICKR (http://mirflickr.liacs.nl) collection. The entire collection contains 1 million images from the social photo sharing website Flickr and was formed by downloading up to a thousand photos per day that were deemed to be the most interesting according to Flickr. All photos in this collection were released by their users under a Creative Commons license, allowing them to be freely used for research purposes. Of the entire collection, 25 thousand images were manually annotated with a limited number of concepts and many of these annotations have been further refined and expanded over the lifetime of the ImageCLEF photo annotation task. This year we used crowd sourcing to annotate all of these 25 thousand images with the concepts.

    On this page we provide you with more information about the textual features, visual features and concept features we supply with each image in the collection we use for this year's task.


    TEXTUAL FEATURES
    All images are accompanied by the following textual features:

    - Flickr user tags
    These are the tags that the users assigned to the photos their uploaded to Flickr. The 'raw' tags are the original tags, while the 'clean' tags are those collapsed to lowercase and condensed to removed spaces.

    - EXIF metadata
    If available, the EXIF metadata contains information about the camera that took the photo and the parameters used. The 'raw' exif is the original camera data, while the 'clean' exif reduces the verbosity.

    - User information and Creative Commons license information
    This contains information about the user that took the photo and the license associated with it.


    VISUAL FEATURES
    Over the previous years of the photo annotation task we noticed that often the same types of visual features are used by the participants, in particular features based on interest points and bag-of-words are popular. To assist you we have extracted several features for you that you may want to use, so you can focus on the concept detection instead. We additionally give you some pointers to easy to use toolkits that will help you extract other features or the same features but with different default settings.

    - SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT
    We used the ISIS Color Descriptors (http://www.colordescriptors.com) toolkit to extract these descriptors. This package provides you with many different types of features based on interest points, mostly using SIFT. It furthermore assists you with building codebooks for bag-of-words. The toolkit is available for Windows, Linux and Mac OS X.

    - SURF
    We used the OpenSURF (http://www.chrisevansdev.com/computer-vision-opensurf.html) toolkit to extract this descriptor. The open source code is available in C++, C#, Java and many more languages.

    - TOP-SURF
    We used the TOP-SURF (http://press.liacs.nl/researchdownloads/topsurf) toolkit to extract this descriptor, which represents images with SURF-based bag-of-words. The website provides codebooks of several different sizes that were created using a combination of images from the MIR-FLICKR collection and from the internet. The toolkit also offers the ability to create custom codebooks from your own image collection. The code is open source, written in C++ and available for Windows, Linux and Mac OS X.

    - GIST
    We used the LabelMe (http://labelme.csail.mit.edu) toolkit to extract this descriptor. The MATLAB-based library offers a comprehensive set of tools for annotating images.

    For the interest point-based features above we used a Fast Hessian-based technique to detect the interest points in each image. This detector is built into the OpenSURF library. In comparison with the Hessian-Laplace technique built into the ColorDescriptors toolkit it detects fewer points, resulting in a considerably reduced memory footprint. We therefore also provide you with the interest point locations in each image that the Fast Hessian-based technique detected, so when you would like to recalculate some features you can use them as a starting point for the extraction. The ColorDescriptors toolkit for instance accepts these locations as a separate parameter. Please go to http://www.imageclef.org/2012/photo-flickr/descriptors for more information on the file format of the visual features and how you can extract them yourself if you want to change the default settings.


    CONCEPT FEATURES
    We have solicited the help of workers on the Amazon Mechanical Turk platform to perform the concept annotation for us. To ensure a high standard of annotation we used the CrowdFlower platform that acts as a quality control layer by removing the judgments of workers that fail to annotate properly. We reused several concepts of last year's task and for most of these we annotated the remaining photos of the MIRFLICKR-25K collection that had not yet been used before in the previous task; for some concepts we reannotated all 25,000 images to boost their quality. For the new concepts we naturally had to annotate all of the images.

    - Concepts
    For each concept we indicate in which images it is present. The 'raw' concepts contain the judgments of all annotators for each image, where a '1' means an annotator indicated the concept was present whereas a '0' means the concept was not present, while the 'clean' concepts only contain the images for which the majority of annotators indicated the concept was present. Some images in the raw data for which we reused last year's annotations only have one judgment for a concept, whereas the other images have between three and five judgments; the single judgment does not mean only one annotator looked at it, as it is the result of a majority vote amongst last year's annotators.

    - Annotations
    For each image we indicate which concepts are present, so this is the reverse version of the data above. The 'raw' annotations contain the average agreement of the annotators on the presence of each concept, while the 'clean' annotations only include those for which there was a majority agreement amongst the annotators.

    You will notice that the annotations are not perfect. Especially when the concepts are more subjective or abstract, the annotators tend to disagree more with each other. The raw versions of the concept annotations should help you get an understanding of the exact judgments given by the annotators.

  6. Z

    Expert annotations for the Catalan Common Voice (v13)

    • data.niaid.nih.gov
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Unit (2024). Expert annotations for the Catalan Common Voice (v13) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11104387
    Explore at:
    Dataset updated
    May 2, 2024
    Dataset provided by
    Barcelona Supercomputing Centerhttps://www.bsc.es/
    Authors
    Language Technologies Unit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Dataset Summary

    These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

    The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

    The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

    See annotations for more details.

    Supported Tasks and Leaderboards

    Gender classification, Accent classification.

    Languages

    The dataset is in Catalan (ca).

    Dataset Structure

    Instances

    Two xlsx documents are published, one for each round of annotations.

    The following information is available in each of the documents:

    { 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }

    We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

    Data Fields

    speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus

    idx (int): Id in this corpus

    AN1 (string): Annotations from Annotator 1

    AN2 (string): Annotations from Annotator 2

    AN3 (string): Annotations from Annotator 3

    agreed (string): Annotation from the majority of the annotators

    percentage (int): Percentage of annotators that agree with the agreed annotation

    mean quality (float): Mean of the quality annotation

    stdev quality (float): Standard deviation of the mean quality

    Data Splits

    The corpus remains undivided into splits, as its purpose does not involve training models.

    Dataset Creation

    Curation Rationale

    During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

    In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Source Data

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Initial Data Collection and Normalization

    We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

    Who are the source language producers?

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Annotations

    Annotation process

    Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

    A team of three annotators was tasked with annotating:

    if all the recordings correspond to the same person

    the gender of the speaker

    the accent of the speaker

    the quality of the recording

    They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Who are the annotators?

    The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

    The annotation team was composed of:

    Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.

    Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.

    1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

    To do the annotation they used a Google Drive spreadsheet

    Personal and Sensitive Information

    The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    Considerations for Using the Data

    Social Impact of Dataset

    The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

    You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Discussion of Biases

    Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

    For the gender annotation, we have only considered "H" (male) and "D" (female).

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset is licensed under a CC BY 4.0 license.

    It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The annotation was entrusted to the STeL team from the University of Barcelona.

  7. Image Dataset of Accessibility Barriers

    • zenodo.org
    zip
    Updated Mar 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakob Stolberg; Jakob Stolberg (2022). Image Dataset of Accessibility Barriers [Dataset]. http://doi.org/10.5281/zenodo.6382090
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jakob Stolberg; Jakob Stolberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Data
    The dataset consist of 5538 images of public spaces, annotated with steps, stairs, ramps and grab bars for stairs and ramps. The dataset has annotations 3564 of steps, 1492 of stairs, 143 of ramps and 922 of grab bars.

    Each step annotation is attributed with an estimate of the height of the step, as falling into one of three categories: less than 3cm, 3cm to 7cm or more than 7cm. Additionally it is attributed with a 'type', with the possibilities 'doorstep', 'curb' or 'other'.

    Stair annotations are attributed with the number of steps in the stair.

    Ramps are attributed with an estimate of their width, also falling into three categories: less than 50cm, 50cm to 100cm and more than 100cm.

    In order to preserve all additional attributes of the labels, the data is published in the CVAT XML format for images.

    Annotating Process
    The labelling has been done using bounding boxes around the objects. This format is compatible with many popular object detection models, e.g. the YOLO object model. A bounding box is placed so it contains exactly the visible part of the respective objects. This implies that only objects that are visible in the photo are annotated. This means in particular a photo of a stair or step from above, where the object cannot be seen, have not been annotated, even when a human viewer can possibly infer that there is a stair or a step from other features in the photo.

    Steps
    A step is annotated, when there is an vertical increment that functions as a passage between two surface areas intended human or vehicle traffic. This means that we have not included:

    • Increments that are to high to reasonably be considered at passage.
    • Increments that does not lead to a surface intended for human or vehicle traffic, e.g. a 'step' in front of a wall or a curb in front of a bush.

    In particular, the bounding box of a step object contains exactly the incremental part of the step, but does not extend into the top or bottom horizontal surface any more than necessary to enclose entirely the incremental part. This has been chosen for consistency reasons, as including parts of the horizontal surfaces would imply a non-trivial choice of how much to include, which we deemed would most likely lead to more inconstistent annotations.

    The height of the steps are estimated by the annotators, and are therefore not guarranteed to be accurate.

    The type of the steps typically fall into the category 'doorstep' or 'curb'. Steps that are in a doorway, entrance or likewise are attributed as doorsteps. We also include in this category steps that are immediately leading to a doorway within a proximity of 1-2m. Steps between different types of pathways, e.g. between streets and sidewalks, are annotated as curbs. Any other type of step are annotated with 'other'. Many of the 'other' steps are for example steps to terraces.

    Stairs
    The stair label is used whenever two or more steps directly follow each other in a consistent pattern. All vertical increments are enclosed in the bounding box, as well as intermediate surfaces of the steps. However the top and bottom surface is not included more than necessary for the same reason as for steps, as described in the previous section.

    The annotator counts the number of steps, and attribute this to the stair object label.

    Ramps
    Ramps have been annotated when a sloped passage way has been placed or built to connect two surface areas intended for human or vehicle traffic. This implies the same considerations as with steps. Alike also only the sloped part of a ramp is annotated, not including the bottom or top surface area.

    For each ramp, the annotator makes an assessment of the width of the ramp in three categories: less than 50cm, 50cm to 100cm and more than 100cm. This parameter is visually hard to assess, and sometimes impossible due to the view of the ramp.

    Grab Bars
    Grab bars are annotated for hand rails and similar that are in direct connection to a stair or a ramp. While horizontal grab bars could also have been included, this was omitted due to the implied ambiguities of fences and similar objects. As the grab bar was originally intended as an attributal information to stairs and ramps, we chose to keep this focus. The bounding box encloses the part of the grab bar that functions as a hand rail for the stair or ramp.

    Usage
    As is often the case when annotating data, much information depends on the subjective assessment of the annotator. As each data point in this dataset has been annotated only by one person, caution should be taken if the data is applied.

    Generally speaking, the mindset and usage guiding the annotations have been wheelchair accessibility. While we have strived to annotate at an object level, hopefully making the data more widely applicable than this, we state this explicitly as it may have swayed untrivial annotation choices.

    The attributal data, such as step height or ramp width are highly subjective estimations. We still provide this data to give a post-hoc method to adjust which annotations to use. E.g. for some purposes, one may be interested in detecting only steps that are indeed more than 3cm. The attributal data makes it possible to sort away the steps less than 3cm, so a machine learning algorithm can be trained on this more appropriate dataset for that use case. We stress however, that one cannot expect to train accurate machine learning algorithms inferring the attributal data, as this is not accurate data in the first place.

    We hope this dataset will be a useful building block in the endeavours for automating barrier detection and documentation.

  8. f

    Data from: Quetzal: Comprehensive Peptide Fragmentation Annotation and...

    • acs.figshare.com
    xlsx
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric W. Deutsch; Luis Mendoza; Robert L. Moritz (2025). Quetzal: Comprehensive Peptide Fragmentation Annotation and Visualization [Dataset]. http://doi.org/10.1021/acs.jproteome.5c00092.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    ACS Publications
    Authors
    Eric W. Deutsch; Luis Mendoza; Robert L. Moritz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Proteomics data-dependent acquisition data sets collected with high-resolution mass-spectrometry (MS) can achieve very high-quality results, but nearly every analysis yields results that are thresholded at some accepted false discovery rate, meaning that a substantial number of results are incorrect. For study conclusions that rely on a small number of peptide-spectrum matches being correct, it is thus important to examine at least some crucial spectra to ensure that they are not one of the incorrect identifications. We present Quetzal, a peptide fragment ion spectrum annotation tool to assist researchers in annotating and examining such spectra to ensure that they correctly support study conclusions. We describe how Quetzal annotates spectra using the new Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) mzPAF standard for fragment ion peak annotation, including the Python-based code, a web-service end point that provides annotation services, and a web-based application for annotating spectra and producing publication-quality figures. We illustrate its functionality with several annotated spectra of varying complexity. Quetzal provides easily accessible functionality that can assist in the effort to ensure and demonstrate that crucial spectra support study conclusions. Quetzal is publicly available at https://proteomecentral.proteomexchange.org/quetzal/.

  9. PPG Heart Beat for Cognitive Fatigue Prediction

    • kaggle.com
    zip
    Updated Nov 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Canaria (2018). PPG Heart Beat for Cognitive Fatigue Prediction [Dataset]. https://www.kaggle.com/canaria/5-gamers
    Explore at:
    zip(176292669 bytes)Available download formats
    Dataset updated
    Nov 10, 2018
    Dataset authored and provided by
    Canaria
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Goal šŸ†

    Predict "too tired to work" events from PPG data continuously monitored throughout their shift

    Predicting & preventing fatigue-related accidents 😓

    Workers' cognitive fatigue is the cause of 2/3 of accidents in the mining industry and there are similar dangers in road and air transportation and any shift-work environments where people are working 12-hour shifts (eg hospitals).

    The gold-standard for neuroscientists studying fatigue is an electroencephalogram (EEG) but it is hard to get good EEG measurements outside of a clinical setting. There has recently been more research using electrocardiogram (ECG or EKG) measurements, particularly using Heart Rate Variability (HRV) as a factor. Good ECG measurements are also not practical on people who are moving around in their working environment.

    Photoplethysmograph (PPG) measurement on the ear is another technique to measure HRV and this can be packaged in a way that is accurate, comfortable and unobtrusive for people to wear all the way through their shift and on their drive home afterwards.

    5-Gamer trial data šŸŽ®

    There were 5 participants, each attempting a 22 hour "shift" of computer gaming. For each participant there is:

    Analysis šŸ“ˆ

    Factors to derive could include:

    • every heartbeat (traditionally ECG "R-peak" but fastest-changing edge of the PPG curve may be a more accurate proxy)
    • heart rate (HR) and heart rate variability (HRV)
    • respiratory rate (RR)

    All in order to estimate:

    • Time to "too tired to work" epoch. This is the point at which you will lose the battle against falling asleep.

    Consider:

    • It's a "mean time to failure" problem
    • "Cognitive fatigue" is not the same as the sleepiness you feel every day at bedtime
    • Cleaning and annotating the data
    • Dealing with gaps in the data (eg earpiece taken off)
    • Dealing with noise in the data
    • Factors to control or monitor better with subjects during future trials
    • Possible sources of more 3rd-party fatigue data, for example Sleep Centres

    Possible approach

    For each gamer:

    1. Clean the PPG .CSV data files -> new raw PPG time-series file(s) (A)
    2. Clean/normalise the annotations in the .CSV fatigue diary? -> new annotations file (B)
    3. Heartbeat peak detection to obtain nanosecond timestamps for every heartbeat in the PPG data (A) -> new heartbeat timestamps file (C)
    4. Calculate heart rate and heart rate variability from the heartbeat timestamps (C) -> new time-series file (D)
    5. Explore trends and anomalies in the HR & HRV data (D). -> append new annotations to gamer’s annotation file (B)
    6. Document all we learn about baselines, trends, anomalies and correlations with the annotations and any ideas for further factors that move us towards a mean-time-to-failure prediction of their ā€œtoo tired to workā€ epoch.
    7. -> new time-series file of experimental attempts of time-to-ā€œtoo tired to workā€-epoch (F)

    8. Breath (respiration) detection to obtain nanosecond timestamps for every breath from the PPG data (A) and/or the heartbeat timestamps (C) -> new breath timestamps file (E)

    9. Calculate respiration rate from (E) -> append Respiratory Rate column to (D)

    10. Revisit 5-7 with extra respiration factor

    11. Eyes closed and/or ā€œnodding dogā€ detection from .WMV webcam videos -> append new micro-sleep annotations to gamer’s annotation file (B)

    12. Revisit 5-7 for correlations in (D) with the new micro-sleep annotations

    For new annotations

    Don’t worry if 7 is rubbish at the moment because there are too few events to make any sense of. Still try!

    For new anomalies you find in 5, I can review webcam video of the gamer at these timestamps to see what they are indications of and then add appropriate "ground truth" annotations to (B) with accurate timestamps. Just contact me through Kaggle to do this. Also if anyone knows how to automatically detect "eyes closed" and/or ā€œnodding dogā€ micro-sleep events from .WMV webcam videos please let me know.

    Bibliography šŸ“š

    A great place to start is this flawed but nevertheless interesting 2013 article from Korea:

  10. u

    Moods and activities in music

    • datacatalogue.ukdataservice.ac.uk
    Updated Aug 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eerola, T, Durham University, UK; Saari, P, Durham University, UK (2018). Moods and activities in music [Dataset]. http://doi.org/10.5255/UKDA-SN-852024
    Explore at:
    Dataset updated
    Aug 16, 2018
    Authors
    Eerola, T, Durham University, UK; Saari, P, Durham University, UK
    Area covered
    Colombia, Trinidad and Tobago, Germany (October 1990-), Canada, Serbia, Albania, Brazil, Tanzania, Ireland, Dominican Republic
    Description

    Data consists of annotations of music in terms of moods music may express and activities that music might fit. The data structures are related to different kinds of annotation tasks, which addressed these questions: 1) annotations of 9 activities that fit a wide range of moods related to music, 2) nominations of music tracks that best fit the a particular mood and annotating the activities that fit them, and 3) annotations of these nominated tracks in terms of mood and activities. Users are anonymised, but the background information (gender, music preferences, age, etc.) are also available. Dataset consists of relational database, that is linked together by means of common ids (tracks, users, activities, moods, genres, expertise, language skill).

    Current approaches to the tagging of music in online databases predominantly rely on music genre and artist name, with music tags being often ambiguous and inexact. Yet, the possibly most salient feature musical experiences is emotion. The few attempts so far undertaken to tag music for mood or emotion lack a scientific foundation in emotion research. The current project proposes to incorporate recent research on music-evoked emotion into the growing number of online musical databases and catalogues, notably the Geneva Emotional Music Scale (GEMS) - a rating measure for describing emotional effects of music recently developed by our group. Specifically, the aim here is to develop the GEMS into an innovative conceptual and technical tool for tagging of online musical content for emotion. To this end, three studies are proposed. In study 1, we will examine whether the GEMS labels and their grouping holds up against a much wider range of musical genres than those that were originally used for its development. In Study 2, we will use advanced data reduction techniques to select the most recurrent and important labels for describing music-evoked emotion. In a third study we will examine the added benefit of the new GEMS compared to conventional approaches to the tagging of music. The anticipated impact of the findings is threefold. First, the research to be described next will advance our understanding of the nature and structure of emotions evoked by music. Developing a valid model of music-evoked emotion is crucial for meaningful research in the social and in the neurosciences. Second, music information organization and retrieval can benefit from a scientifically sound and parsimonious taxonomy for describing the emotional effects of music. Thus, searches for relevant online music databases need not be longer confined to genre or artist, but can also incorporate emotion as a key experiential dimension of music. Third, a valid tagging scheme for emotion can assist both researchers and professionals in the choice of music to induce specific emotions. For example, psychologists, behavioural economists, and neuroscientists often need to induce emotion in their experiments to understand how behaviour or performance is modulated by emotion. Music is an obvious choice for emotion induction in controlled settings because it is a universal language that lends itself to comparisons across cultures and because it is ethically unproblematic.

  11. f

    Data from: An Annotation Protocol for Diachronic Evaluation of Semantic...

    • figshare.com
    csv
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nitisha Jain; Albert Merono Penuela (2025). An Annotation Protocol for Diachronic Evaluation of Semantic Drift in Disability Sources [Dataset]. http://doi.org/10.6084/m9.figshare.29198132.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 30, 2025
    Dataset provided by
    figshare
    Authors
    Nitisha Jain; Albert Merono Penuela
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Annotating terms referring to aspects of disability in historical texts is crucial for understanding how societies in different periods conceptualized and treated disability. Such annotations help modern readers grasp the evolving language, cultural attitudes, and social structures surrounding disability, shedding light on both marginalization and inclusion throughout history. This is important as evolving societal attitudes can influence the perpetuation of harmful language that reinforces stereotypes and discrimination. However, this task presents significant challenges. Terminology often reflects outdated, offensive, or ambiguous concepts that require sensitive interpretation. Meaning of terms may have shifted over time, making it difficult to align historical terms with contemporary understandings of disability. Additionally, contextual nuances and the lack of standardized language in historical records demand careful scholarly judgment to avoid anachronism or misrepresentation.In this paper we introduce an annotation protocol for analysing and describing semantic shifts in the discourse on disabilities in historical texts, reporting on how our protocol's design evolved to address these specific challenges and on issues around annotators' agreement. For designing the annotation protocol for measuring the semantic change in the disability domain, we selected texts for annotation from Gale’s History of Disabilities: Disabilities in Society, Seventeenth to Twentieth Century(https://www.gale.com/intl/c/history-of-disabilities-disabilities-in-society-seventeenth-to-twentieth-century).Nitisha Jain, Chiara Di Bonaventura, Albert MeroƱo-PeƱuela, Barbara McGillivray. 2025. An Annotation Protocol for Diachronic Evaluation of Semantic Drift in Disability Sources. In Proceedings of the 19th Linguistic Annotation Workshop (LAW 2025). Association for Computational Linguistics.

  12. Z

    Disco-Annotation

    • data.niaid.nih.gov
    Updated Oct 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popescu-Bellis, Andrei; Meyer, Thomas; Liyanapathirana, Jeevanthi; Cartoni, Bruno; Zufferey, Sandrine; Hajlaoui, Najeh (2020). Disco-Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4061389
    Explore at:
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Idiap Research Institute
    UniversitƩ Catholique de Louvain
    University of Geneva
    Authors
    Popescu-Bellis, Andrei; Meyer, Thomas; Liyanapathirana, Jeevanthi; Cartoni, Bruno; Zufferey, Sandrine; Hajlaoui, Najeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives in europarl texts.

    The 8 connectives with their annotated relations are:

    although (contrast|concession)

    as (prep|causal|temporal|comparison|concession)

    however (contrast|concession)

    meanwhile (contrast|temporal)

    since (causal|temporal|temporal-causal)

    though (contrast|concession)

    while (contrast|concession|temporal|temporal-contrast|temporal-causal)

    yet (adv|contrast|concession)

    For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.

    If you need software for the latter, have a look at: https://github.com/idiap/DiscoConn-Classifier

    Citation

    Please cite the following papers if you make use of these datasets (and to know more about the annotation method):

    @INPROCEEDINGS{Popescu-Belis-LREC-2012, author = {Popescu-Belis, Andrei and Meyer, Thomas and Liyanapathirana, Jeevanthi and Cartoni, Bruno and Zufferey, Sandrine}, title = {{D}iscourse-level {A}nnotation over {E}uroparl for {M}achine {T}ranslation: {C}onnectives and {P}ronouns}, booktitle = {Proceedings of the eighth international conference on Language Resources and Evaluation ({LREC})}, year = {2012}, address = {Istanbul, Turkey} }

    @Article{Cartoni-DD-2013, Author = {Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas}, Title = {{Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique}}, Journal = {Dialogue & Discourse}, Volume = {4}, Number = {2}, pages = {65--86}, year = {2013} }

    @ARTICLE{Meyer-TSLP-submitted, author = {Meyer, Thomas and Hajlaoui, Najeh and Popescu-Belis, Andrei}, title = {{Disambiguating Discourse Connectives for Statistical Machine Translation in Several Languages}}, journal = {IEEE/ACM Transactions of Audio, Speech, and Language Processing}, year = {submitted}, volume = {}, pages = {}, number = {} }

  13. o

    Digitised comparative word list of Malay, Nias, Toba-Batak, and Enggano in...

    • ora.ox.ac.uk
    • portal.sds.ox.ac.uk
    zip
    Updated Feb 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajeg, GPW (2025). Digitised comparative word list of Malay, Nias, Toba-Batak, and Enggano in Modigliani’s ā€œL’isola Delle Donneā€ from 1894 [Dataset]. http://doi.org/10.25446/oxford.28330022.v1
    Explore at:
    zip(281039)Available download formats
    Dataset updated
    Feb 2, 2025
    Dataset provided by
    University of Oxford's Sustainable Digital Scholarship (SDS)
    Authors
    Rajeg, GPW
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Enggano Island
    Description

    The first release (v1.0.0) for the digitised, computer-readable Enggano word list in Modigliani (1894), with comparison from Nias, Toba-Batak, and Malay. The Enggano word list is included in the EnoLEX database. The data-source directory contains the original data in .xlsx file that the author hand-digitised from the original source (Modigliani 1894). The light annotation included reflects the content of the original source, covering several aspects. First, annotating the string component that is printed in italics in the original source; the marking is indicated by the XML tag so it can be traced computationally. Second, there is also annotation concerning remark (

  14. Synthetic Chess Board Images

    • kaggle.com
    zip
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheFamousRat (2022). Synthetic Chess Board Images [Dataset]. https://www.kaggle.com/datasets/thefamousrat/synthetic-chess-board-images
    Explore at:
    zip(457498797 bytes)Available download formats
    Dataset updated
    Feb 13, 2022
    Authors
    TheFamousRat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data collection is perhaps the most crucial part of any machine learning model: without it being done properly, not enough information is present for the model to learn from the patterns leading to one output or another. Data collection is however a very complex endeavor, time-consuming due to the volume of data that needs to be acquired and annotated. Annotation is an especially problematic step, due to its difficulty, length, and vulnerability to human error and inaccuracies when annotating complex data.

    With high processing power becoming ever more accessible, synthetic dataset generation is becoming a viable option when looking to generate large volumes of accurately annotated data. With the help of photorealistic renderers, it is for example possible now to generate immense amounts of data, annotated with pixel-perfect precision and whose content is virtually indistinguishable from real-world pictures.

    As an exercise of synthetic dataset generation, the data offered here was generated using the Python API of Blender, with the images rendered through the Cycles raycaster. It represents plausible images representing pictures of chessboard and pieces. The goal is, from those pictures and their annotation, to build a model capable of recognizing the pieces, as well as their positions on the board.

    Content

    The dataset contains a large amount of synthetic, randomly generated images representing pictures of chess images, taken at an angle overlooking the board and its pieces. Each image is associated with a .json file containing its annotations. The naming convention is that each render is associated with a number X, and that the images and annotations associated with that render are respectively named X.jpg and X.json.

    The data has been generated using the Python scripts and .blend file present in this repository. The chess board and pieces models that have been used for those renders are not provided with the code.

    Data characteristics :

    • Images : 1280x1280 JPEG images representing pictures of chess game boards.
    • Annotations : JSON files containing two variables :
      • "config", a dictionary associating a cell to the type of piece it contains. If a cell is not presented in the keys, it means that it is empty.
      • "corners", a 4x2 list which contains the coordinates, in the image, of the board corners. Those corners coordinates are normalized to the [0;1] range.
    • config.json : A JSON file generated before rendering, which contains variables relative to the constant properties of the boards in the renders :
      • "cellsCoordinates", a dictionary associating a cell name to its coordinates on the board. We have for example
      • "piecesTypes", a list of strings containing the types of pieces present in the renders.

    No distinction has been hard-built between training, validation, and testing data, and is left completely up to the users. A proposed pipeline for the extraction, recognition, and placement of chess pieces is proposed in a notebook added with this dataset.

    Acknowledgements

    I would like to express my gratitude for the efforts of the Blender Foundation and all its participants, for their incredible open-source tool which once again has allowed me to conduct interesting projects with great ease.

    Inspiration

    Two interesting papers on the generation and use of synthetic data, which have inspired me to conduct this project :

    Erroll Wood, Tadas BaltruŔaitis, Charlie Hewitt (2021) Fake It Till You Make It: Face analysis in the wild using synthetic data alone https://arxiv.org/abs/2109.15102 Salehe Erfanian Ebadi, You-Cyuan Jhang, Alex Zook (2021) PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision https://arxiv.org/abs/2112.09290

  15. e

    Snow Height Classification Dataset

    • envidat.ch
    not available, zip
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Svoboda; Marc Ruesch; David Liechti; Corinne Jones; Michael Zehnder; Michele Volpi; Jürg Schweizer (2025). Snow Height Classification Dataset [Dataset]. http://doi.org/10.16904/envidat.512
    Explore at:
    not available, zipAvailable download formats
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    Swiss Data Science Center
    WSL Institute for Snow and Avalanche Research SLF
    Authors
    Jan Svoboda; Marc Ruesch; David Liechti; Corinne Jones; Michael Zehnder; Michele Volpi; Jürg Schweizer
    Time period covered
    Mar 31, 2023 - Jun 30, 2023
    Area covered
    Switzerland
    Dataset funded by
    WSL Institute for Snow and Avalanche Research SLF
    Swiss Data Science Center
    Description

    Snow Height Classification dataset provides manually annotated snow height data that can be used for development and evaluation of automatic snow height classification approaches. A subset of 20 IMIS stations which span different locations and elevations and vary in underlying surface (e.g., vegetation, bare ground, glacier, etc.) were selected and manually annotated with binary two-class ground truth information regarding snow height data: * Class 0 - Snow - the surface is covered by snow * Class 1 - No Snow - the surface is snow-free (e.g., vegetation, soil, rocks, etc.) Data has been annotated with the help of domain experts. It should be mentioned that annotating historical data is problematic, as there is no way of checking whether there really was snow at the station or not. This means that assessing the presence of snow with the help of information from other sensors should be considered a best effort approach. Besides the annotated snow height data, the dataset also contains additional data which are relevant to reproduce results in the following repository: https://gitlabext.wsl.ch/jan.svoboda/snow-height-classification Any further use of the data has to comply with the CC BY-NC license: https://creativecommons.org/licenses/by-nc/4.0/

  16. D

    Replication Data for: Die Entwicklung des Definitartikels im...

    • dataverse.no
    • dataverse.azure.uit.no
    • +2more
    bin, csv, pdf, txt +1
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johanna Flick; Johanna Flick (2023). Replication Data for: Die Entwicklung des Definitartikels im Althochdeutschen. Eine kognitiv-linguistische Korpusuntersuchung [Dataset]. http://doi.org/10.18710/HZKYL4
    Explore at:
    type/x-r-syntax(9181), type/x-r-syntax(2167), csv(17865610), type/x-r-syntax(17690), csv(10109732), csv(6618188), type/x-r-syntax(224), csv(11987224), type/x-r-syntax(12677), csv(12677192), type/x-r-syntax(3554), csv(7423674), bin(31249), type/x-r-syntax(900), pdf(108283), csv(4594937), txt(4459), csv(2264708), csv(1051840), csv(1242602), type/x-r-syntax(311), type/x-r-syntax(17412), type/x-r-syntax(2548), type/x-r-syntax(2084), type/x-r-syntax(7109), csv(953507), csv(7818231), csv(783216), csv(10158246), csv(7649791), type/x-r-syntax(2637)Available download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    DataverseNO
    Authors
    Johanna Flick; Johanna Flick
    License

    https://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HZKYL4https://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HZKYL4

    Area covered
    St. Gallen, Switzerland, Bavaria, Germany, Franconia, Germany
    Description

    This data set is the appendix for Flick (2020): Die Entwicklung des Definitartikels im Althochdeutschen. Eine kognitiv-linguistische Korpusuntersuchung (Empirically oriented theoretical morphology and syntax 6). Berlin: Language Science Press. English abstract (the publication is in German): The German definite article originated from the adnominally used demonstrative determiner dĆ«r (ā€˜this’). It is known that the categorical shift from demonstrative to definite article took place during the Old High German (OHG) period (750-1050 AD). The genuine demonstrative loses its demonstrative force and appears in contexts in which referents can be identified in dependently of the speech situation (the so called semantic definite contexts). In contrast to former investigations the present study examines this development by means of a broad corpus study. The data consists of the five largest texts of the OHG period which are accessible via the Old German Reference Corpus. The investigation takes a usage-based and constructional approach to language and shows how animacy interacts with the development of the definite article. The data set contains: (i) annotation guidelines for the following categories (particularly with regard to Old High German data): Animacy, individuation, semantic roles, different types of definiteness, nominal features of the noun phrase. (ii) corpus data which was extracted from a pre-version of the ā€œReferenzkorpus Altdeutsch" (0.1) plus all the annotations that were made during the investigation. (iii) R code that helped transforming, annotating and statistically analyzing the data. Please change to tree view to see the folder structure of the data set.

  17. Z

    WormSwin: C. elegans Video Datasets

    • data.niaid.nih.gov
    Updated Jan 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deserno, Maurice; Bozek, Katarzyna (2024). WormSwin: C. elegans Video Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7456802
    Explore at:
    Dataset updated
    Jan 31, 2024
    Dataset provided by
    University of Cologne
    Authors
    Deserno, Maurice; Bozek, Katarzyna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data used for our paper "WormSwin: Instance Segmentation of C. elegans using Vision Transformer".This publication is divided into three parts:

    CSB-1 Dataset

    Synthetic Images Dataset

    MD Dataset

    The CSB-1 Dataset consists of frames extracted from videos of Caenorhabditis elegans (C. elegans) annotated with binary masks. Each C. elegans is separately annotated, providing accurate annotations even for overlapping instances. All annotations are provided in binary mask format and as COCO Annotation JSON files (see COCO website).

    The videos are named after the following pattern:

    <"worm age in hours"_"mutation"_"irradiated (binary)"_"video index (zero based)">

    For mutation the following values are possible:

    wild type

    csb-1 mutant

    csb-1 with rescue mutation

    An example video name would be 24_1_1_2 meaning it shows C. elegans with csb-1 mutation, being 24h old which got irradiated.

    Video data was provided by M. Rieckher; Instance Segmentation Annotations were created under supervision of K. Bozek and M. Deserno.The Synthetic Images Dataset was created by cutting out C. elegans (foreground objects) from the CSB-1 Dataset and placing them randomly on background images also taken from the CSB-1 Dataset. Foreground objects were flipped, rotated and slightly blurred before placed on the background images.The same was done with the binary mask annotations taken from CSB-1 Dataset so that they match the foreground objects in the synthetic images. Additionally, we added rings of random color, size, thickness and position to the background images to simulate petri-dish edges.

    This synthetic dataset was generated by M. Deserno.The Mating Dataset (MD) consists of 450 grayscale image patches of 1,012 x 1,012 px showing C. elegans with high overlap, crawling on a petri-dish.We took the patches from a 10 min. long video of size 3,036 x 3,036 px. The video was downsampled from 25 fps to 5 fps before selecting 50 random frames for annotating and patching.Like the other datasets, worms were annotated with binary masks and annotations are provided as COCO Annotation JSON files.

    The video data was provided by X.-L. Chu; Instance Segmentation Annotations were created under supervision of K. Bozek and M. Deserno.

    Further details about the datasets can be found in our paper.

  18. Data from: MetaHarm: Harmful YouTube Video Dataset Annotated by Domain...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wonjeong Jo; Wonjeong Jo; Magdalena Wojcieszak; Magdalena Wojcieszak (2025). MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers [Dataset]. http://doi.org/10.5281/zenodo.14647452
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wonjeong Jo; Wonjeong Jo; Magdalena Wojcieszak; Magdalena Wojcieszak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    We provide text metadata, image frames, and thumbnails of YouTube videos classified as harmful or harmless by domain experts, GPT-4-Turbo, and crowdworkers. Harmful videos are categorized into one or more of six harm categories: Information harms (IH), Hate and Harassment harms (HH), Clickbait harms (CB), Addictive harms (ADD), Sexual harms (SXL), and Physical harms (PH).

    This repository includes the text metadata and a link to external cloud storage for the image data.

    Text Metadata

    Folder Subfolder#Videos
    Ground TruthHarmful_full_agreement
    (classified as harmful by all the three actors)
    5,109
    Harmful_subset_agreement
    (classified as harmful by more than two actors)
    14,019
    Domain ExpertsHarmful15,115
    Harmless3,303
    GPT-4-TurboHarmful10,495
    Harmless7,818
    Crowdworkers
    (Workers from Amazon Mechanical Turk)
    Harmful12,668
    Harmless4,390
    Unannotated large pool-60,906
    Note. The term "actor" refers to the annotating entities: domain experts, GPT-4-Turbo, and crowdworkers

    Explanations about the indicators

    1. Ground truth - harmful_full_agreement & harmful_subset agreement
    - links
    - video_id
    - channel
    - description
    - transcript
    - date
    - maj_harmcat: In the full_agreement version, this represents a harm category identified by all three actors. In the subset_agreement version, it represents a harm category classified by more than two actors.
    - all_harmcat: This includes all harm categories classified by any of the actors without requiring agreement. It captures all classified categories.
    2. Domain Experts, GPT-4-Turbo, Crowdworkers
    - links
    - video_id
    - channel
    - description
    - transcript
    - date
    - harmcat
    3. Unannotated large pool
    - links
    - video_id
    - channel
    - description
    - transcript
    - date
    Note. Some data from the external dataset does not include date information. In such cases, the date was marked as 1990-01-01.
    We retrieved transcripts using the YouTubeTranscriptApi. If a video does not have any text data in the transcript section, it means the API failed to retrieve the transcript, possibly because the video does not contain any detectable language.
    Some image frames are also available in the pickle file.

    Image data

    The image frames and thumbnails are available at this link: https://ucdavis.app.box.com/folder/302772803692?s=d23b20snl1slwkuh4pgvjs31m7r1xae2
    1. Image frames (imageframes_1-20.zip): Image frames are organized into 20 zip folders due to the large size of the image frames. Each zip folder contains subfolders named after the unique video IDs of the annotated videos. Inside each subfolder, there are 15 sequentially numbered image frames (from 0 to 14) extracted from the corresponding video. The image frame folders do not distinguish between videos classified as harmful or non-harmful.
    2. Thumbnails (Thumbnails.zip): The zip folder contains thumbnails from the individual videos used in classification. Each thumbnail is named using the unique video ID. This folder does not distinguish between videos classified as harmful or harmless

    Related works (in preprint)

    For details about the harm classification taxonomy and the performance comparison between crowdworkers, GPT-4-Turbo, and domain experts, please see https://arxiv.org/abs/2411.05854.

  19. f

    PlotTwist: A web app for plotting and annotating continuous data

    • plos.figshare.com
    • figshare.com
    docx
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joachim Goedhart (2020). PlotTwist: A web app for plotting and annotating continuous data [Dataset]. http://doi.org/10.1371/journal.pbio.3000581
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    PLOS Biology
    Authors
    Joachim Goedhart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experimental data can broadly be divided in discrete or continuous data. Continuous data are obtained from measurements that are performed as a function of another quantitative variable, e.g., time, length, concentration, or wavelength. The results from these types of experiments are often used to generate plots that visualize the measured variable on a continuous, quantitative scale. To simplify state-of-the-art data visualization and annotation of data from such experiments, an open-source tool was created with R/shiny that does not require coding skills to operate it. The freely available web app accepts wide (spreadsheet) and tidy data and offers a range of options to normalize the data. The data from individual objects can be shown in 3 different ways: (1) lines with unique colors, (2) small multiples, and (3) heatmap-style display. Next to this, the mean can be displayed with a 95% confidence interval for the visual comparison of different conditions. Several color-blind-friendly palettes are available to label the data and/or statistics. The plots can be annotated with graphical features and/or text to indicate any perturbations that are relevant. All user-defined settings can be stored for reproducibility of the data visualization. The app is dubbed PlotTwist and runs locally or online: https://huygens.science.uva.nl/PlotTwist

  20. h

    imagenet_safety_annotated

    • huggingface.co
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Intelligence & Machine Learning Lab at TU Darmstadt (2025). imagenet_safety_annotated [Dataset]. https://huggingface.co/datasets/AIML-TUDA/imagenet_safety_annotated
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2025
    Dataset authored and provided by
    Artificial Intelligence & Machine Learning Lab at TU Darmstadt
    Description

    This is a safety annotation set for ImageNet. It uses the LlavaGuard-13B model for annotating. The annotations entail a safety category (image-category), an explanation (assessment), and a safety rating (decision). Furthermore, it contains the unique ImageNet id class_sampleId, i.e. n04542943_1754. These annotations allow you to train your model on only safety-aligned data. Plus, you can define yourself what safety-aligned means, i.e. discard all images where decision=="Review Needed" or… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/imagenet_safety_annotated.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ā€˜Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056

PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ā€˜Ike Wai Gateway to Manage Research Data"

Explore at:
zip(873 bytes)Available download formats
Dataset updated
Jul 29, 2020
Dataset provided by
HydroShare
Authors
Sean Cleveland; Gwen Jacobs; Jennifer Geis
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Time period covered
Jul 29, 2020
Description

Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawaiā€˜i (UH), called the ā€˜Ike Wai Gateway. In Hawaiian, ā€˜Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawaiā€˜i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ā€˜Ike Wai research team and wider Hawaiā€˜i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ā€˜Ike Wai project through the ā€˜Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.

Search
Clear search
Close search
Google apps
Main menu