100+ datasets found
  1. Z

    Taxonomies for Semantic Research Data Annotation

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Göpfert, Christoph; Haas, Jan Ingo; Schröder, Lucas; Gaedke, Martin (2024). Taxonomies for Semantic Research Data Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7908854
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Technische Universität Chemnitz
    Authors
    Göpfert, Christoph; Haas, Jan Ingo; Schröder, Lucas; Gaedke, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 35 of 39 taxonomies that were the result of a systematic review. The systematic review was conducted with the goal of identifying taxonomies suitable for semantically annotating research data. A special focus was set on research data from the hybrid societies domain.

    The following taxonomies were identified as part of the systematic review:

    Filename

    Taxonomy Title

    acm_ccs

    ACM Computing Classification System [1]

    amec

    A Taxonomy of Evaluation Towards Standards [2]

    bibo

    A BIBO Ontology Extension for Evaluation of Scientific Research Results [3]

    cdt

    Cross-Device Taxonomy [4]

    cso

    Computer Science Ontology [5]

    ddbm

    What Makes a Data-driven Business Model? A Consolidated Taxonomy [6]

    ddi_am

    DDI Aggregation Method [7]

    ddi_moc

    DDI Mode of Collection [8]

    n/a

    DemoVoc [9]

    discretization

    Building a New Taxonomy for Data Discretization Techniques [10]

    dp

    Demopaedia [11]

    dsg

    Data Science Glossary [12]

    ease

    A Taxonomy of Evaluation Approaches in Software Engineering [13]

    eco

    Evidence & Conclusion Ontology [14]

    edam

    EDAM: The Bioscientific Data Analysis Ontology [15]

    n/a

    European Language Social Science Thesaurus [16]

    et

    Evaluation Thesaurus [17]

    glos_hci

    The Glossary of Human Computer Interaction [18]

    n/a

    Humanities and Social Science Electronic Thesaurus [19]

    hcio

    A Core Ontology on the Human-Computer Interaction Phenomenon [20]

    hft

    Human-Factors Taxonomy [21]

    hri

    A Taxonomy to Structure and Analyze Human–Robot Interaction [22]

    iim

    A Taxonomy of Interaction for Instructional Multimedia [23]

    interrogation

    A Taxonomy of Interrogation Methods [24]

    iot

    Design Vocabulary for Human–IoT Systems Communication [25]

    kinect

    Understanding Movement and Interaction: An Ontology for Kinect-Based 3D Depth Sensors [26]

    maco

    Thesaurus Mass Communication [27]

    n/a

    Thesaurus Cognitive Psychology of Human Memory [28]

    mixed_initiative

    Mixed-Initiative Human-Robot Interaction: Definition, Taxonomy, and Survey [29]

    qos_qoe

    A Taxonomy of Quality of Service and Quality of Experience of Multimodal Human-Machine Interaction [30]

    ro

    The Research Object Ontology [31]

    senses_sensors

    A Human-Centered Taxonomy of Interaction Modalities and Devices [32]

    sipat

    A Taxonomy of Spatial Interaction Patterns and Techniques [33]

    social_errors

    A Taxonomy of Social Errors in Human-Robot Interaction [34]

    sosa

    Semantic Sensor Network Ontology [35]

    swo

    The Software Ontology [36]

    tadirah

    Taxonomy of Digital Research Activities in the Humanities [37]

    vrs

    Virtual Reality and the CAVE: Taxonomy, Interaction Challenges and Research Directions [38]

    xdi

    Cross-Device Interaction [39]

    We converted the taxonomies into SKOS (Simple Knowledge Organisation System) representation. The following 4 taxonomies were not converted as they were already available in SKOS and were for this reason excluded from this dataset:

    1) DemoVoc, cf. http://thesaurus.web.ined.fr/navigateur/ available at https://thesaurus.web.ined.fr/exports/demovoc/demovoc.rdf

    2) European Language Social Science Thesaurus, cf. https://thesauri.cessda.eu/elsst/en/ available at https://zenodo.org/record/5506929

    3) Humanities and Social Science Electronic Thesaurus, cf. https://hasset.ukdataservice.ac.uk/hasset/en/ available at https://zenodo.org/record/7568355

    4) Thesaurus Cognitive Psychology of Human Memory, cf. https://www.loterre.fr/presentation/ available at https://skosmos.loterre.fr/P66/en/

    References

    [1] “The 2012 ACM Computing Classification System,” ACM Digital Library, 2012. https://dl.acm.org/ccs (accessed May 08, 2023).

    [2] AMEC, “A Taxonomy of Evaluation Towards Standards.” Aug. 31, 2016. Accessed: May 08, 2023. [Online]. Available: https://amecorg.com/amecframework/home/supporting-material/taxonomy/

    [3] B. Dimić Surla, M. Segedinac, and D. Ivanović, “A BIBO ontology extension for evaluation of scientific research results,” in Proceedings of the Fifth Balkan Conference in Informatics, in BCI ’12. New York, NY, USA: Association for Computing Machinery, Sep. 2012, pp. 275–278. doi: 10.1145/2371316.2371376.

    [4] F. Brudy et al., “Cross-Device Taxonomy: Survey, Opportunities and Challenges of Interactions Spanning Across Multiple Devices,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19. New York, NY, USA: Association for Computing Machinery, Mai 2019, pp. 1–28. doi: 10.1145/3290605.3300792.

    [5] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, and E. Motta, “The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas,” in Lecture Notes in Computer Science 1137, D. Vrandečić, K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kaffee, and E. Simperl, Eds., Monterey, California, USA: Springer, Oct. 2018, pp. 187–205. Accessed: May 08, 2023. [Online]. Available: http://oro.open.ac.uk/55484/

    [6] M. Dehnert, A. Gleiss, and F. Reiss, “What makes a data-driven business model? A consolidated taxonomy,” presented at the European Conference on Information Systems, 2021.

    [7] DDI Alliance, “DDI Controlled Vocabulary for Aggregation Method,” 2014. https://ddialliance.org/Specification/DDI-CV/AggregationMethod_1.0.html (accessed May 08, 2023).

    [8] DDI Alliance, “DDI Controlled Vocabulary for Mode Of Collection,” 2015. https://ddialliance.org/Specification/DDI-CV/ModeOfCollection_2.0.html (accessed May 08, 2023).

    [9] INED - French Institute for Demographic Studies, “Thésaurus DemoVoc,” Feb. 26, 2020. https://thesaurus.web.ined.fr/navigateur/en/about (accessed May 08, 2023).

    [10] A. A. Bakar, Z. A. Othman, and N. L. M. Shuib, “Building a new taxonomy for data discretization techniques,” in 2009 2nd Conference on Data Mining and Optimization, Oct. 2009, pp. 132–140. doi: 10.1109/DMO.2009.5341896.

    [11] N. Brouard and C. Giudici, “Unified second edition of the Multilingual Demographic Dictionary (Demopaedia.org project),” presented at the 2017 International Population Conference, IUSSP, Oct. 2017. Accessed: May 08, 2023. [Online]. Available: https://iussp.confex.com/iussp/ipc2017/meetingapp.cgi/Paper/5713

    [12] DuCharme, Bob, “Data Science Glossary.” https://www.datascienceglossary.org/ (accessed May 08, 2023).

    [13] A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, and E. Stiakakis, “A Taxonomy of Evaluation Approaches in Software Engineering,” in Proceedings of the 7th Balkan Conference on Informatics Conference, in BCI ’15. New York, NY, USA: Association for Computing Machinery, Sep. 2015, pp. 1–8. doi: 10.1145/2801081.2801084.

    [14] M. C. Chibucos, D. A. Siegele, J. C. Hu, and M. Giglio, “The Evidence and Conclusion Ontology (ECO): Supporting GO Annotations,” in The Gene Ontology Handbook, C. Dessimoz and N. Škunca, Eds., in Methods in Molecular Biology. New York, NY: Springer, 2017, pp. 245–259. doi: 10.1007/978-1-4939-3743-1_18.

    [15] M. Black et al., “EDAM: the bioscientific data analysis ontology,” F1000Research, vol. 11, Jan. 2021, doi: 10.7490/f1000research.1118900.1.

    [16] Council of European Social Science Data Archives (CESSDA), “European Language Social Science Thesaurus ELSST,” 2021. https://thesauri.cessda.eu/en/ (accessed May 08, 2023).

    [17] M. Scriven, Evaluation Thesaurus, 3rd Edition. Edgepress, 1981. Accessed: May 08, 2023. [Online]. Available: https://us.sagepub.com/en-us/nam/evaluation-thesaurus/book3562

    [18] Papantoniou, Bill et al., The Glossary of Human Computer Interaction. Interaction Design Foundation. Accessed: May 08, 2023. [Online]. Available: https://www.interaction-design.org/literature/book/the-glossary-of-human-computer-interaction

    [19] “UK Data Service Vocabularies: HASSET Thesaurus.” https://hasset.ukdataservice.ac.uk/hasset/en/ (accessed May 08, 2023).

    [20] S. D. Costa, M. P. Barcellos, R. de A. Falbo, T. Conte, and K. M. de Oliveira, “A core ontology on the Human–Computer Interaction phenomenon,” Data Knowl. Eng., vol. 138, p. 101977, Mar. 2022, doi: 10.1016/j.datak.2021.101977.

    [21] V. J. Gawron et al., “Human Factors Taxonomy,” Proc. Hum. Factors Soc. Annu. Meet., vol. 35, no. 18, pp. 1284–1287, Sep. 1991, doi: 10.1177/154193129103501807.

    [22] L. Onnasch and E. Roesler, “A Taxonomy to Structure and Analyze Human–Robot Interaction,” Int. J. Soc. Robot., vol. 13, no. 4, pp. 833–849, Jul. 2021, doi: 10.1007/s12369-020-00666-5.

    [23] R. A. Schwier, “A Taxonomy of Interaction for Instructional Multimedia.” Sep. 28, 1992. Accessed: May 09, 2023. [Online]. Available: https://eric.ed.gov/?id=ED352044

    [24] C. Kelly, J. Miller, A. Redlich, and S. Kleinman, “A Taxonomy of Interrogation Methods,”

  2. H

    PEARC20 submitted paper: "Scientific Data Annotation and Dissemination:...

    • hydroshare.org
    • beta.hydroshare.org
    zip
    Updated Jul 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sean Cleveland; Gwen Jacobs; Jennifer Geis (2020). PEARC20 submitted paper: "Scientific Data Annotation and Dissemination: Using the ‘Ike Wai Gateway to Manage Research Data" [Dataset]. http://doi.org/10.4211/hs.d66ef2686787403698bac5368a29b056
    Explore at:
    zip(873 bytes)Available download formats
    Dataset updated
    Jul 29, 2020
    Dataset provided by
    HydroShare
    Authors
    Sean Cleveland; Gwen Jacobs; Jennifer Geis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Jul 29, 2020
    Description

    Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawai’i Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawai‘i (UH), called the ‘Ike Wai Gateway. In Hawaiian, ‘Ike means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawai‘i. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Science’s (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSI’s Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawai’i EPSCoR ‘Ike Wai research team and wider Hawai‘i hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the ‘Ike Wai project through the ‘Ike Wai data gateway and Hydroshare makes the research products accessible and reusable.

  3. R

    Data Annotate Dataset

    • universe.roboflow.com
    zip
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NecleusAIPublic (2025). Data Annotate Dataset [Dataset]. https://universe.roboflow.com/necleusaipublic/data-annotate-ojqb1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    NecleusAIPublic
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    2 Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Historical Weapon Classification: This computer vision model can be utilized by historians, archeologists, and museum curators to classify and catalog historical weapons and artifacts, including swords, arrows, guns, and knives, enabling them to better understand and contextualize the weapons' origins and usage throughout history.

    2. Video Game Asset Management: Game developers can use the Data Annotate model to automatically tag and categorize in-game assets, such as weapons and visual effects, to streamline their development process and more easily manage game content.

    3. Prop and Costume Design: The model can aid prop and costume designers in the film, theater, and cosplay industries by identifying and categorizing various weapons and related items, allowing them to find suitable props or inspirations for their designs more quickly.

    4. Law Enforcement and Security: Data Annotate can be used by law enforcement agencies and security personnel to effectively detect weapons in surveillance footage or images, enabling them to respond more quickly to potential threats and uphold public safety.

    5. Educational Applications: Teachers and educators can use the model to develop interactive and engaging learning materials in the fields of history, art, and technology. It can help students identify and understand the significance of various weapons and their roles in shaping human history and culture.

  4. d

    Data from: The Distributed Annotation System

    • catalog.data.gov
    • data.virginia.gov
    • +2more
    Updated Sep 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). The Distributed Annotation System [Dataset]. https://catalog.data.gov/dataset/the-distributed-annotation-system
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Currently, most genome annotation is curated by centralized groups with limited resources. Efforts to share annotations transparently among multiple groups have not yet been satisfactory. Results Here we introduce a concept called the Distributed Annotation System (DAS). DAS allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. The communication between client and servers in DAS is defined by the DAS XML specification. Annotations are displayed in layers, one per server. Any client or server adhering to the DAS XML specification can participate in the system; we describe a simple prototype client and server example. Conclusions The DAS specification is being used experimentally by Ensembl, WormBase, and the Berkeley Drosophila Genome Project. Continued success will depend on the readiness of the research community to adopt DAS and provide annotations. All components are freely available from the project website .

  5. E

    INCEpTION Text Annotation Platform

    • live.european-language-grid.eu
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). INCEpTION Text Annotation Platform [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/23683
    Explore at:
    Dataset updated
    Sep 2, 2024
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    INCEpTION is an open-source text annotation tool primarily designed to annotate text documents. It supports annotations of words and sentences as well as linking annotations to each other.

    • Collaborative Annotation: Multiple users can work on the same project, with built-in support for inter-annotator agreement metrics and different workflow management schemes.
    • Customizable Annotation Layers: Users can define custom annotation schemas and layers tailored to specific project needs.
    • Knowledge Base Integration: Annotations can be linked to knowledge bases and terminologies such as Wikidata, SNOMED CT and other RDF/OWL/OBO resources.
    • Machine Learning Integration: INCEpTION can train and use machine learning models to suggest annotations, improving the efficiency of the annotation process.
    • Interoperability: It supports a wide range of data formats (e.g., XMI, CoNLL, TSV), making it easy to import and export data for use with other NLP tools.

    These features make INCEpTION a comprehensive solution for building and managing annotated corpora.

  6. w

    Global Data Labeling and Annotation Service Market Research Report: By...

    • wiseguyreports.com
    Updated Oct 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Data Labeling and Annotation Service Market Research Report: By Application (Image Recognition, Text Annotation, Video Annotation, Audio Annotation), By Service Type (Image Annotation, Text Annotation, Audio Annotation, Video Annotation, 3D Point Cloud Annotation), By Industry (Healthcare, Automotive, Retail, Finance, Robotics), By Deployment Model (On-Premise, Cloud-Based, Hybrid) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/data-labeling-and-annotation-service-market
    Explore at:
    Dataset updated
    Oct 14, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Oct 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20242.88(USD Billion)
    MARKET SIZE 20253.28(USD Billion)
    MARKET SIZE 203512.0(USD Billion)
    SEGMENTS COVEREDApplication, Service Type, Industry, Deployment Model, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSgrowing AI adoption, increasing demand for accuracy, rise in machine learning, cost optimization needs, regulatory compliance requirements
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDDeep Vision, Amazon, Google, Scale AI, Microsoft, Defined.ai, Samhita, Samasource, Figure Eight, Cognitive Cloud, CloudFactory, Appen, Tegas, iMerit, Labelbox
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESAI and machine learning growth, Increasing demand for annotated data, Expansion in autonomous vehicles, Healthcare data management needs, Real-time data processing requirements
    COMPOUND ANNUAL GROWTH RATE (CAGR) 13.9% (2025 - 2035)
  7. Z

    Expert annotations for the Catalan Common Voice (v13)

    • data.niaid.nih.gov
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Unit (2024). Expert annotations for the Catalan Common Voice (v13) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11104387
    Explore at:
    Dataset updated
    May 2, 2024
    Dataset provided by
    Barcelona Supercomputing Centerhttps://www.bsc.es/
    Authors
    Language Technologies Unit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Dataset Summary

    These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

    The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

    The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

    See annotations for more details.

    Supported Tasks and Leaderboards

    Gender classification, Accent classification.

    Languages

    The dataset is in Catalan (ca).

    Dataset Structure

    Instances

    Two xlsx documents are published, one for each round of annotations.

    The following information is available in each of the documents:

    { 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }

    We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

    Data Fields

    speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus

    idx (int): Id in this corpus

    AN1 (string): Annotations from Annotator 1

    AN2 (string): Annotations from Annotator 2

    AN3 (string): Annotations from Annotator 3

    agreed (string): Annotation from the majority of the annotators

    percentage (int): Percentage of annotators that agree with the agreed annotation

    mean quality (float): Mean of the quality annotation

    stdev quality (float): Standard deviation of the mean quality

    Data Splits

    The corpus remains undivided into splits, as its purpose does not involve training models.

    Dataset Creation

    Curation Rationale

    During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

    In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Source Data

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Initial Data Collection and Normalization

    We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

    Who are the source language producers?

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Annotations

    Annotation process

    Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

    A team of three annotators was tasked with annotating:

    if all the recordings correspond to the same person

    the gender of the speaker

    the accent of the speaker

    the quality of the recording

    They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Who are the annotators?

    The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

    The annotation team was composed of:

    Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.

    Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.

    1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

    To do the annotation they used a Google Drive spreadsheet

    Personal and Sensitive Information

    The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    Considerations for Using the Data

    Social Impact of Dataset

    The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

    You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Discussion of Biases

    Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

    For the gender annotation, we have only considered "H" (male) and "D" (female).

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset is licensed under a CC BY 4.0 license.

    It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The annotation was entrusted to the STeL team from the University of Barcelona.

  8. Z

    Data from: iRead4Skills Dataset 2: annotated corpora by level of complexity...

    • data.niaid.nih.gov
    • chef.afue.org
    • +3more
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pintard, Alice; François, Thomas; Justine, Nagant de Deuxchaisnes; Barbosa, Sílvia; Reis, Maria Leonor; Moutinho, Michell; Monteiro, Ricardo; Amaro, Raquel; Correia, Susana; Rodríguez Rey, Sandra; Mu, Keran; Garcia González, Marcos; Bernárdez Braña, André; Blanco Escoda, Xavier (2025). iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12821881
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Universitat Autònoma de Barcelona
    NOVA FCSH
    CENTAL
    CITIUS
    CLUNL
    Authors
    Pintard, Alice; François, Thomas; Justine, Nagant de Deuxchaisnes; Barbosa, Sílvia; Reis, Maria Leonor; Moutinho, Michell; Monteiro, Ricardo; Amaro, Raquel; Correia, Susana; Rodríguez Rey, Sandra; Mu, Keran; Garcia González, Marcos; Bernárdez Braña, André; Blanco Escoda, Xavier
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in Excel format (.xlsx). These corpora were compiled and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com).

    This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform.

    The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a sample of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The sample texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution:

    · 140 Very Easy texts

    · 140 Easy texts

    · 140 Plain texts

    · 42 More Complex texts.

    Trainers and students were asked to classify the texts according to the complexity levels of the project, here informally defined as:

    · Very Easy (everyone can understand the text or most of the text).

    · Easy (a person with less than the 9th year of schooling can understand the text or most of the text)

    · Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it)

    · More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it).

    Annotators were also asked to mark the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), The full details regarding the students and the trainers’ tasks, data qualitative and quantitative description and inter-annotator agreement are described here: https://zenodo.org/records/14653180

    The results are here presented in Excel format. For each language, and for each group (trainers and students), two pairs of files exist – the annotation and the classification files – resulting in four files per language and twelve files, in total.

    In all files, the data is organized as a matrix, with each row representing an ‘answer’ from a particular participant, and the columns containing various details about that specific input, as shown below:

    Column name

    Data

    Annotator's ID

    The randomly generated ID code for each annotator, together with information on the dataset assigned to them.

    Progress

    Information on the completion of the task (for each text).

    Duration (seconds)

    Time used in the completion of the task (for each text).

    File Name

    N1 = Very Easy

    N2 = Easy

    N3 = Plain

    N4=More Complex

    File internal identification, providing its iRead4Skills classification.

    Text

    The content of the file, i.e. the text itself.

    Annotated Level

    Level assigned by the annotator (trainer).

    Proficiency SubLevel

    (Likert Scale - 1 to 5)

    SubLevel assigned by the annotator (trainer) for FR data.

    Corresponding CEFR Level

    CEFR level closest to the iRead4Skills

    Additional Info

    Observations made by the trainers/students

    Annotated Term

    Word or set of words selected for annotation

    Term Label

    Annotation assigned to the Annotated Term (difficult word, word order, etc.)

    Term Index

    Position of the annotated term in the text

    Annotator's Proficiency Level

    Level of AL/VET of the student

    Text adequate for user

    Validation of the text by the students

    The content of the column “File Name” is color-coded, where a green shade alludes to a text with a lower level of complexity and a red one alludes to one with a higher level of complexity.

    The complete datasets are available under creative CC BY-NC-ND 4.0.

  9. DWUG ES: Diachronic Word Usage Graphs for Spanish

    • zenodo.org
    zip
    Updated Apr 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2025). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.14891659
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    Please find more information on the provided data in the papers referenced below.

    The annotation was funded by

    • ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,
    • ANID - Millennium Science Initiative Program - Code ICN17 002 and
    • SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

    Version: 4.0.2, 7.1.2025. Full data. Quoting issues in uses resolved. Target word and target sentence indices corrected. One corrected context for word 'metro'. Judgments anonymized. Annotator 'gecsa' removed. Issues with special characters in filenames resolved. Additional removal of wrongly copied graphs.

    Reference

    Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.

    Dominik Schlechtweg, Tejaswi Choppa, Wei Zhao, Michael Roth. 2025. The CoMeDi Shared Task: Median Judgment Classification & Mean Disagreement Ranking with Ordinal Word-in-Context Judgments. In Proceedings of the 1st Workshop on Context and Meaning--Navigating Disagreements in NLP Annotations.

  10. r

    Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC)

    • demo.researchdata.se
    • researchdata.se
    Updated Jan 15, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Kerren; Carita Paradis (2019). Annotating speaker stance in discourse: the Brexit Blog Corpus (BBC) [Dataset]. http://doi.org/10.5878/002925
    Explore at:
    Dataset updated
    Jan 15, 2019
    Dataset provided by
    Linnaeus University
    Authors
    Andreas Kerren; Carita Paradis
    Time period covered
    Jun 1, 2015 - May 31, 2016
    Description

    In this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.

    Purpose:

    The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.

    The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.

    For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.

    The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.

    When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060

  11. Data from: CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine...

    • data.europa.eu
    • datos.gob.es
    • +1more
    unknown
    Updated Feb 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine in Spanish [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6059737?locale=da
    Explore at:
    unknown(2576817)Available download formats
    Dataset updated
    Feb 12, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 1200 texts (292 173 tokens) about clinical trials studies and clinical trials announcements in Spanish: - 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO). - 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos. Texts were annotated with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). 46 699 entities were annotated (13.98% are nested entities). 10% of the corpus was doubly annotated, and inter-annotator agreement (IAA) achieved a mean F-measure of 85.65% (±4.79, strict match) and a mean F-measure of 93.94% (±3.31, relaxed match). The corpus is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License.

  12. SURel: Synchronic Usage Relatedness

    • zenodo.org
    zip
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty (2025). SURel: Synchronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5543307
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    -------------------------------------
    Siehe unten für die deutsche Version.
    -------------------------------------

    Synchronic Usage Relatedness (SURel) - Test Set and Annotation Data


    This data collection supplementing the paper referenced below contains:

    - a semantic meaning shift test set with 22 German lexemes with different degrees of meaning shifts from general language to the domain of cooking. It comes as a tab-separated csv file where each line has the form

    lemma POS translations mean relatedness score frequency GEN frequency SPEC

    The 'mean relatedness score' denotes the annotation-based measure of semantic shift described in the paper. 'frequency GEN' and 'frequency SPEC' list the frequencies of the target words in the general language corpus (GEN) and the domain-specific cooking corpus (SPEC). 'translations' gives English translations for different senses, illustrating possible meaning shifts. Note that further senses might exist;

    - the full annotation tables as annotators received it filled it. The tables come in the form of a tab-separated csv file where each line has the form

    sentence 1 rating comment sentence 2;

    - the annotation guidelines in English and German (only the German version was used);
    - data visualization plots.

    Find more information in

    Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota USA 2019.

    Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.


    The resources are freely available for education, research and other non-commercial purposes. More information can be requested via email to the authors.

    -------
    Deutsch
    -------

    Synchroner Wortverwendungsbezug (SURel) - Test Set und Annotationsdaten


    Diese Datensammlung ergänzt den unten zitierten Artikel und enthält folgende Dateien:

    - ein Test Set für semantische Bedeutungsverschiebung mit 22 deutschen Lexemen, mit unteschiedlichen Graden an Bedeutungsverschiebungen von der Allgemeinsprache hin zur domänenspezifischen Sprache des Kochens. Hierbei handelt es sich um eine tab-separierte CSV-Datei, in der jede Zeile folgende Form hat:

    Lexem Wortart Übersetzungen Mean Relatedness Score Freqeunz GEN Frequenz SPEC

    Der 'Mean Realtedness Score' bezeichnet das annotationsbasierte Maß für Bedeutungsverschiebungen wie im Paper beschrieben. 'Frequenz GEN' und 'Frequenz SPEC' listen die Häufigkeiten der Zielwörter im allgemeinsprachlichen Korpus (GEN) und im domänenspezifischen Korpus (SPEC) auf. 'Übersetzungen' enthält englische Übersetzungen für mögliche Bedeutungen um die Bedeutungsverschiebung zu illustrieren. Beachten Sie dass auch andere Bedeutungen exitieren können;

    - Die Annotationstabellen, wie sie die Annotatoren erhalten aus ausgefüllt haben. Die Ergebnistabellen sind tab-separierte CSV-Dateien, in der jede Zeile folgende Form hat:

    Satz 1 Bewertung Kommentar Satz 2

    - die Annotationsrichtlinien auf Deutsch und Englisch (nur die deutsche Version wurde genutzt);
    - Visualisierungsplots der Daten.

    Mehr Informationen finden Sie in

    Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota USA 2019.

    Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.

    Die Ressourcen sind frei verfügbar für Lehre, Forschung sowie andere nicht-kommerzielle Zwecke. Für weitere Informationen schreiben Sie bitte eine E-Mail an die Autoren.

  13. Z

    ActiveHuman Part 2

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Nov 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8361113
    Explore at:
    Dataset updated
    Nov 14, 2023
    Dataset provided by
    Aristotle University of Thessaloniki (AUTh)
    Authors
    Charalampos Georgiadis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

    Folder configuration The dataset consists of 3 folders:

    JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

    Essential Terminology

    Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

    Dataset Data The dataset includes 4 types of JSON annotation files files:

    annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

    id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

    Most Labelers generate different annotation specifications in the spec key-value pair:

    BoundingBox2DLabeler/BoundingBox3DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

    template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

    label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

    label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

    label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

    captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

    id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

    sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

    ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

    id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

    Each Labeler generates different annotation specifications in the values key-value pair:

    BoundingBox2DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

    label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

    index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

    The SemanticSegmentationLabeler does not contain a values list.

    egos.json: Contains collections of key-value pairs for each ego. These include:

    id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

    id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

    Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

    e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.

  14. f

    Data from: Veneer Is a Webtool for Rapid, Standardized, and Transparent...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linda Berg Luecke; Roneldine Mesidor; Jack Littrell; Morgan Carpenter; Melinda Wojtkiewicz; Rebekah L. Gundry (2024). Veneer Is a Webtool for Rapid, Standardized, and Transparent Interpretation, Annotation, and Reporting of Mammalian Cell Surface N‑Glycocapture Data [Dataset]. http://doi.org/10.1021/acs.jproteome.3c00800.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    ACS Publications
    Authors
    Linda Berg Luecke; Roneldine Mesidor; Jack Littrell; Morgan Carpenter; Melinda Wojtkiewicz; Rebekah L. Gundry
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Currently, no consensus exists regarding criteria required to designate a protein within a proteomic data set as a cell surface protein. Most published proteomic studies rely on varied ontology annotations or computational predictions instead of experimental evidence when attributing protein localization. Consequently, standardized approaches for analyzing and reporting cell surface proteome data sets would increase confidence in localization claims and promote data use by other researchers. Recently, we developed Veneer, a web-based bioinformatic tool that analyzes results from cell surface N-glycocapture workflowsthe most popular cell surface proteomics method used to date that generates experimental evidence of subcellular location. Veneer assigns protein localization based on defined experimental and bioinformatic evidence. In this study, we updated the criteria and process for assigning protein localization and added new functionality to Veneer. Results of Veneer analysis of 587 cell surface N-glycocapture data sets from 32 published studies demonstrate the importance of applying defined criteria when analyzing cell surface proteomics data sets and exemplify how Veneer can be used to assess experimental quality and facilitate data extraction for informing future biological studies and annotating public repositories.

  15. Z

    Disco-Annotation

    • data.niaid.nih.gov
    Updated Oct 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popescu-Bellis, Andrei; Meyer, Thomas; Liyanapathirana, Jeevanthi; Cartoni, Bruno; Zufferey, Sandrine; Hajlaoui, Najeh (2020). Disco-Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4061389
    Explore at:
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Idiap Research Institute
    University of Geneva
    Université Catholique de Louvain
    Authors
    Popescu-Bellis, Andrei; Meyer, Thomas; Liyanapathirana, Jeevanthi; Cartoni, Bruno; Zufferey, Sandrine; Hajlaoui, Najeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives in europarl texts.

    The 8 connectives with their annotated relations are:

    although (contrast|concession)

    as (prep|causal|temporal|comparison|concession)

    however (contrast|concession)

    meanwhile (contrast|temporal)

    since (causal|temporal|temporal-causal)

    though (contrast|concession)

    while (contrast|concession|temporal|temporal-contrast|temporal-causal)

    yet (adv|contrast|concession)

    For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.

    If you need software for the latter, have a look at: https://github.com/idiap/DiscoConn-Classifier

    Citation

    Please cite the following papers if you make use of these datasets (and to know more about the annotation method):

    @INPROCEEDINGS{Popescu-Belis-LREC-2012, author = {Popescu-Belis, Andrei and Meyer, Thomas and Liyanapathirana, Jeevanthi and Cartoni, Bruno and Zufferey, Sandrine}, title = {{D}iscourse-level {A}nnotation over {E}uroparl for {M}achine {T}ranslation: {C}onnectives and {P}ronouns}, booktitle = {Proceedings of the eighth international conference on Language Resources and Evaluation ({LREC})}, year = {2012}, address = {Istanbul, Turkey} }

    @Article{Cartoni-DD-2013, Author = {Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas}, Title = {{Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique}}, Journal = {Dialogue & Discourse}, Volume = {4}, Number = {2}, pages = {65--86}, year = {2013} }

    @ARTICLE{Meyer-TSLP-submitted, author = {Meyer, Thomas and Hajlaoui, Najeh and Popescu-Belis, Andrei}, title = {{Disambiguating Discourse Connectives for Statistical Machine Translation in Several Languages}}, journal = {IEEE/ACM Transactions of Audio, Speech, and Language Processing}, year = {submitted}, volume = {}, pages = {}, number = {} }

  16. Z

    Data from: TAIR functional annotation data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanya Berardini; Leonore Reiser; Eva Huala (2024). TAIR functional annotation data [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_1422852
    Explore at:
    Dataset updated
    Oct 6, 2024
    Dataset provided by
    Phoenix Bioinformatics
    Authors
    Tanya Berardini; Leonore Reiser; Eva Huala
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quarterly release of curated gene function data for Arabidopsis thaliana from The Arabidopsis Information Resource (www.arabidopsis.org)

    The contents of the compressed archive include the following files which are described in detail in the included README file.

    1.ATH_GO_GOSLIM.txt.gz This document is a tab-delimited file containing GO annotations for Arabidopsis genes annotated by TAIR and TIGR with terms from the Gene Ontology Consortium controlled vocabularies (see www.geneontology.org). This file includes an updated set of literature based annotations and >40,000 electronic annotations based upon matches to INTERPRO domains supplied by Nicola Mulder from SWISS PROT/INTERPRO.

    Please cite this paper when using TAIR's GO annotations in your research: Berardini, TZ, Mundodi, S, Reiser, L, Huala, E, Garcia-Hernandez, M, Zhang, P, Mueller, LM, Yoon, J, Doyle, A, Lander, G, Moseyko, N, Yoo, D, Xu, I, Zoeckler, B, Montoya, M, Miller, N, Weems, D, and Rhee, SY (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. 135(2):1-11.

    2.gene_aliases_yyyymmdd.txt(.gz) This file lists alternative names for each gene.

    3.Locus_Germplasm_Phenotype_yyyymmdd.txt.gz This file contains links between loci, germplasms, and phenotypes.

    4.Locus_Published_yyyymmdd.txt.gz This file contains links between loci and publications.

    5.po_temporal_gene_arabidopsis_tair.assoc.gz po_anatomy_gene_arabidopsis_tair.assoc.gz These two files are tab-delimited files. Each contains the set of literature-based annotations of Arabidopsis genes and loci annotated at TAIR to the terms from the Plant Ontology developed by the Plant Ontology Consortium (POC, www.plantontology.org).

    6.TAIR10 or ARAPORT11_functional_descriptions_yyyymmdd.txt(.gz) This file contains functional descriptions for gene models included in either the TAIR 10 or as of 20170630 the Araport11 genome release. TAIR10/Araport11 refers to the version of the genome annotation.

    7.Araport11_GFF3_genes_transposons.[DATE].gff.gz

    1. Araport11_GFF3_genes_transposons.MMMYYYY.gff.gz This document is a tab-delimited file in GFF format. This document contains annotations from Araport11 genome release. Annotations in this file include information curated from recent scientific literature. Note: This file is available starting with the 20211231 Data Release.

    Column header: explanation 1. Name of the chromosome 2. Source: Name of the the data source that generated this feature (Araport11) 3. Annotation type: eg gene, mRNA etc. 4. Start position of annotation. 5. Stop position of annotation. 6. Score - A floating point value. 7. Strand information. Defined as + (forward) or - (reverse). 8. Frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on. 9. Detailed annotation information with a semicolon-separated list of tag-value pairs, providing additional information about each feature, including curator summary, computational description,. etc.

    1. Araport11_GTF_genes_transposons.MMMYYYY.gtf.gz This document is a tab-delimited file in GTF format. This document contains annotations from Araport11 genome release. Annotations in this file include information curated from recent scientific literature. Note: This file is available starting with the 20211231 Data Release.

    Column header: explanation 1. Name of the chromosome 2. Source: Name of the the data source that generated this feature (Araport11) 3. Annotation type: eg gene, mRNA etc. 4. Start position of annotation. 5. Stop position of annotation. 6. Score - A floating point value. 7. Strand information. Defined as + (forward) or - (reverse). 8. Frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on. 9. Detailed annotation information with a semicolon-separated list of tag-value pairs, providing additional information about each feature, including transcript_id. gene_id, Note, etc.

  17. Annotation performance in terms of AUC (mean and standard deviation), using...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iulian Pruteanu-Malinici; Daniel L. Mace; Uwe Ohler (2023). Annotation performance in terms of AUC (mean and standard deviation), using the LOO-CV scheme, data set . [Dataset]. http://doi.org/10.1371/journal.pcbi.1002098.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Iulian Pruteanu-Malinici; Daniel L. Mace; Uwe Ohler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    , and denote the performance obtained by the SVM and SMLR classifiers on lateral view images only, using both majority (maj) and minority voting (min). For more details on majority and minority voting, please see ‘Materials and methods’. For each case, random partitions of the training and testing data sets are generated, on the most popular annotation terms. Abbreviations of the anatomical annotations: AMP - anterior midgut primordium; BP - brain primordium; DEP - dorsal epidermis primordium; FP - foregut primordium; HMP - head mesoderm primordium; HPP - hindgut proper primordium; PMP - posterior midgut primordium; SMP - somatic muscle primordium; TMP - trunk mesoderm primordium; VNCP - ventral nerve cord primordium.

  18. Data from: FluoroMatch 2.0-making automated and comprehensive non-targeted...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality [Dataset]. https://catalog.data.gov/dataset/fluoromatch-2-0-making-automated-and-comprehensive-non-targeted-pfas-annotation-a-reality
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Data for "Koelmel JP, Stelben P, McDonough CA, Dukes DA, Aristizabal-Henao JJ, Nason SL, Li Y, Sternberg S, Lin E, Beckmann M, Williams AJ, Draper J, Finch JP, Munk JK, Deigl C, Rennie EE, Bowden JA, Godri Pollitt KJ. FluoroMatch 2.0-making automated and comprehensive non-targeted PFAS annotation a reality. Anal Bioanal Chem. 2022 Jan;414(3):1201-1215. doi: 10.1007/s00216-021-03392-7. Epub 2021 May 20. PMID: 34014358.". Portions of this dataset are inaccessible because: The link provided by UCSD doesn't seem to be working. They can be accessed through the following means: Contact Jeremy Koelmel at Yale University, jeremykoelmel@innovativeomics.com. Format: The final annotated excel sheets with feature intensities, annotations, homologous series groupings, etc., are available as a supplemental excel file with the online version of this manuscript. The raw Agilent “.d” files can be downloaded at: ftp://massive.ucsd.edu/MSV000086811/updates/2021-02-05_jeremykoelmel_e5b21166/raw/McDonough_AFFF_3M_ddMS2_Neg.zip (Note use Google Chrome or Firefox, Microsoft Edge and certain other browsers are unable to download from an ftp link). This dataset is associated with the following publication: Koelmel, J.P., P. Stelben, C.A. McDonough, D.A. Dukes, J.J. Aristizabal-Henao, S.L. Nason, Y. Li, S. Sternberg, E. Lin, M. Beckmann, A. Williams, J. Draper, J. Finch, J.K. Munk, C. Deigl, E. Rennie, J.A. Bowden, and K.J. Godri Pollitt. FluoroMatch 2.0—making automated and comprehensive non-targeted PFAS annotation a reality. Analytical and Bioanalytical Chemistry. Springer, New York, NY, USA, 414(3): 1201-1215, (2022).

  19. f

    Data from: Quetzal: Comprehensive Peptide Fragmentation Annotation and...

    • acs.figshare.com
    xlsx
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric W. Deutsch; Luis Mendoza; Robert L. Moritz (2025). Quetzal: Comprehensive Peptide Fragmentation Annotation and Visualization [Dataset]. http://doi.org/10.1021/acs.jproteome.5c00092.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    ACS Publications
    Authors
    Eric W. Deutsch; Luis Mendoza; Robert L. Moritz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Proteomics data-dependent acquisition data sets collected with high-resolution mass-spectrometry (MS) can achieve very high-quality results, but nearly every analysis yields results that are thresholded at some accepted false discovery rate, meaning that a substantial number of results are incorrect. For study conclusions that rely on a small number of peptide-spectrum matches being correct, it is thus important to examine at least some crucial spectra to ensure that they are not one of the incorrect identifications. We present Quetzal, a peptide fragment ion spectrum annotation tool to assist researchers in annotating and examining such spectra to ensure that they correctly support study conclusions. We describe how Quetzal annotates spectra using the new Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) mzPAF standard for fragment ion peak annotation, including the Python-based code, a web-service end point that provides annotation services, and a web-based application for annotating spectra and producing publication-quality figures. We illustrate its functionality with several annotated spectra of varying complexity. Quetzal provides easily accessible functionality that can assist in the effort to ensure and demonstrate that crucial spectra support study conclusions. Quetzal is publicly available at https://proteomecentral.proteomexchange.org/quetzal/.

  20. Z

    Mappings for "Developing a Scalable Annotation Method for Large Datasets...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Klopfenstein, Sophie Anne Inès; Flint, Anne Rike; Heeren, Patrick; Prendke, Mona; Chaoui, Amin; Ocker, Thomas; Chromik, Jonas; Arnrich, Bert; Balzer, Felix; Poncette, Akira-Sebastian (2025). Mappings for "Developing a Scalable Annotation Method for Large Datasets That Enhances Alarms With Actionability Data to Increase Informativeness: Mixed Methods Approach" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7511031
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Department of Anesthesiology and Intensive Care Medicine
    Hasso Plattner Institute, University of Potsdam, Digital Health - Connected Healthcare
    Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Institute of Medical Informatics; Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Core Facility Digital Medicine and Interoperability
    Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Institute of Medical Informatics; Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Department of Anesthesiology and Intensive Care Medicine
    Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Institute of Medical Informatics
    Authors
    Klopfenstein, Sophie Anne Inès; Flint, Anne Rike; Heeren, Patrick; Prendke, Mona; Chaoui, Amin; Ocker, Thomas; Chromik, Jonas; Arnrich, Bert; Balzer, Felix; Poncette, Akira-Sebastian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Studies identified false and non-actionnable alarms as a factor for alarm fatigue in intensive care units.

    To annotate patient alarms, and analyse the alarm situation in intensive care units, we conceptualized and performed data mappings related to airway management and medication interventions. The mappings were based on information retrieved from the patient data management system (PDMS) and clinical expertise. For the airway management mappings, we used additional resources such as ISO 19223:2019 or ventilator instruction manuals. The mappings do not include patient data.

    As the mappings are generic, they could be used in other contexts than alarm annotation and research.

    1. Respiratory Management Mappings:

    General tables summarizing the 1) categories based on ISO 19223:2019 to describe respiratory support therapies (RSTs), 2) defining the invasiveness level of a RST and 3) listing the abbreviations used in the mappings

    Tables including PDMS entries for airway devices (ADs), ventilation devices (VDs), and ventilation modes (VMs)

    Mapping of AD entries (from the PDMS) to defined categories

    Mapping of VDs, VMs, and ADs to defined RSTs, including information on invasiveness

    Table specifying suitable ventilation parameters in the context of each RST

    1. Medication Mappings:

    General tables providing information on physiological alarm conditions (PACs), interventions, routes, and techniques of administration of interest

    Mapping of routes of administration to techniques of administration including PDMS entries

    Mapping of active ingredients (including SNOMED CT Fully Specified Names and Identifiers), related PDMS information, and routes and techniques of administration to defined PAC and interventions

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Göpfert, Christoph; Haas, Jan Ingo; Schröder, Lucas; Gaedke, Martin (2024). Taxonomies for Semantic Research Data Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7908854

Taxonomies for Semantic Research Data Annotation

Explore at:
Dataset updated
Jul 23, 2024
Dataset provided by
Technische Universität Chemnitz
Authors
Göpfert, Christoph; Haas, Jan Ingo; Schröder, Lucas; Gaedke, Martin
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains 35 of 39 taxonomies that were the result of a systematic review. The systematic review was conducted with the goal of identifying taxonomies suitable for semantically annotating research data. A special focus was set on research data from the hybrid societies domain.

The following taxonomies were identified as part of the systematic review:

Filename

Taxonomy Title

acm_ccs

ACM Computing Classification System [1]

amec

A Taxonomy of Evaluation Towards Standards [2]

bibo

A BIBO Ontology Extension for Evaluation of Scientific Research Results [3]

cdt

Cross-Device Taxonomy [4]

cso

Computer Science Ontology [5]

ddbm

What Makes a Data-driven Business Model? A Consolidated Taxonomy [6]

ddi_am

DDI Aggregation Method [7]

ddi_moc

DDI Mode of Collection [8]

n/a

DemoVoc [9]

discretization

Building a New Taxonomy for Data Discretization Techniques [10]

dp

Demopaedia [11]

dsg

Data Science Glossary [12]

ease

A Taxonomy of Evaluation Approaches in Software Engineering [13]

eco

Evidence & Conclusion Ontology [14]

edam

EDAM: The Bioscientific Data Analysis Ontology [15]

n/a

European Language Social Science Thesaurus [16]

et

Evaluation Thesaurus [17]

glos_hci

The Glossary of Human Computer Interaction [18]

n/a

Humanities and Social Science Electronic Thesaurus [19]

hcio

A Core Ontology on the Human-Computer Interaction Phenomenon [20]

hft

Human-Factors Taxonomy [21]

hri

A Taxonomy to Structure and Analyze Human–Robot Interaction [22]

iim

A Taxonomy of Interaction for Instructional Multimedia [23]

interrogation

A Taxonomy of Interrogation Methods [24]

iot

Design Vocabulary for Human–IoT Systems Communication [25]

kinect

Understanding Movement and Interaction: An Ontology for Kinect-Based 3D Depth Sensors [26]

maco

Thesaurus Mass Communication [27]

n/a

Thesaurus Cognitive Psychology of Human Memory [28]

mixed_initiative

Mixed-Initiative Human-Robot Interaction: Definition, Taxonomy, and Survey [29]

qos_qoe

A Taxonomy of Quality of Service and Quality of Experience of Multimodal Human-Machine Interaction [30]

ro

The Research Object Ontology [31]

senses_sensors

A Human-Centered Taxonomy of Interaction Modalities and Devices [32]

sipat

A Taxonomy of Spatial Interaction Patterns and Techniques [33]

social_errors

A Taxonomy of Social Errors in Human-Robot Interaction [34]

sosa

Semantic Sensor Network Ontology [35]

swo

The Software Ontology [36]

tadirah

Taxonomy of Digital Research Activities in the Humanities [37]

vrs

Virtual Reality and the CAVE: Taxonomy, Interaction Challenges and Research Directions [38]

xdi

Cross-Device Interaction [39]

We converted the taxonomies into SKOS (Simple Knowledge Organisation System) representation. The following 4 taxonomies were not converted as they were already available in SKOS and were for this reason excluded from this dataset:

1) DemoVoc, cf. http://thesaurus.web.ined.fr/navigateur/ available at https://thesaurus.web.ined.fr/exports/demovoc/demovoc.rdf

2) European Language Social Science Thesaurus, cf. https://thesauri.cessda.eu/elsst/en/ available at https://zenodo.org/record/5506929

3) Humanities and Social Science Electronic Thesaurus, cf. https://hasset.ukdataservice.ac.uk/hasset/en/ available at https://zenodo.org/record/7568355

4) Thesaurus Cognitive Psychology of Human Memory, cf. https://www.loterre.fr/presentation/ available at https://skosmos.loterre.fr/P66/en/

References

[1] “The 2012 ACM Computing Classification System,” ACM Digital Library, 2012. https://dl.acm.org/ccs (accessed May 08, 2023).

[2] AMEC, “A Taxonomy of Evaluation Towards Standards.” Aug. 31, 2016. Accessed: May 08, 2023. [Online]. Available: https://amecorg.com/amecframework/home/supporting-material/taxonomy/

[3] B. Dimić Surla, M. Segedinac, and D. Ivanović, “A BIBO ontology extension for evaluation of scientific research results,” in Proceedings of the Fifth Balkan Conference in Informatics, in BCI ’12. New York, NY, USA: Association for Computing Machinery, Sep. 2012, pp. 275–278. doi: 10.1145/2371316.2371376.

[4] F. Brudy et al., “Cross-Device Taxonomy: Survey, Opportunities and Challenges of Interactions Spanning Across Multiple Devices,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19. New York, NY, USA: Association for Computing Machinery, Mai 2019, pp. 1–28. doi: 10.1145/3290605.3300792.

[5] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, and E. Motta, “The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas,” in Lecture Notes in Computer Science 1137, D. Vrandečić, K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kaffee, and E. Simperl, Eds., Monterey, California, USA: Springer, Oct. 2018, pp. 187–205. Accessed: May 08, 2023. [Online]. Available: http://oro.open.ac.uk/55484/

[6] M. Dehnert, A. Gleiss, and F. Reiss, “What makes a data-driven business model? A consolidated taxonomy,” presented at the European Conference on Information Systems, 2021.

[7] DDI Alliance, “DDI Controlled Vocabulary for Aggregation Method,” 2014. https://ddialliance.org/Specification/DDI-CV/AggregationMethod_1.0.html (accessed May 08, 2023).

[8] DDI Alliance, “DDI Controlled Vocabulary for Mode Of Collection,” 2015. https://ddialliance.org/Specification/DDI-CV/ModeOfCollection_2.0.html (accessed May 08, 2023).

[9] INED - French Institute for Demographic Studies, “Thésaurus DemoVoc,” Feb. 26, 2020. https://thesaurus.web.ined.fr/navigateur/en/about (accessed May 08, 2023).

[10] A. A. Bakar, Z. A. Othman, and N. L. M. Shuib, “Building a new taxonomy for data discretization techniques,” in 2009 2nd Conference on Data Mining and Optimization, Oct. 2009, pp. 132–140. doi: 10.1109/DMO.2009.5341896.

[11] N. Brouard and C. Giudici, “Unified second edition of the Multilingual Demographic Dictionary (Demopaedia.org project),” presented at the 2017 International Population Conference, IUSSP, Oct. 2017. Accessed: May 08, 2023. [Online]. Available: https://iussp.confex.com/iussp/ipc2017/meetingapp.cgi/Paper/5713

[12] DuCharme, Bob, “Data Science Glossary.” https://www.datascienceglossary.org/ (accessed May 08, 2023).

[13] A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, and E. Stiakakis, “A Taxonomy of Evaluation Approaches in Software Engineering,” in Proceedings of the 7th Balkan Conference on Informatics Conference, in BCI ’15. New York, NY, USA: Association for Computing Machinery, Sep. 2015, pp. 1–8. doi: 10.1145/2801081.2801084.

[14] M. C. Chibucos, D. A. Siegele, J. C. Hu, and M. Giglio, “The Evidence and Conclusion Ontology (ECO): Supporting GO Annotations,” in The Gene Ontology Handbook, C. Dessimoz and N. Škunca, Eds., in Methods in Molecular Biology. New York, NY: Springer, 2017, pp. 245–259. doi: 10.1007/978-1-4939-3743-1_18.

[15] M. Black et al., “EDAM: the bioscientific data analysis ontology,” F1000Research, vol. 11, Jan. 2021, doi: 10.7490/f1000research.1118900.1.

[16] Council of European Social Science Data Archives (CESSDA), “European Language Social Science Thesaurus ELSST,” 2021. https://thesauri.cessda.eu/en/ (accessed May 08, 2023).

[17] M. Scriven, Evaluation Thesaurus, 3rd Edition. Edgepress, 1981. Accessed: May 08, 2023. [Online]. Available: https://us.sagepub.com/en-us/nam/evaluation-thesaurus/book3562

[18] Papantoniou, Bill et al., The Glossary of Human Computer Interaction. Interaction Design Foundation. Accessed: May 08, 2023. [Online]. Available: https://www.interaction-design.org/literature/book/the-glossary-of-human-computer-interaction

[19] “UK Data Service Vocabularies: HASSET Thesaurus.” https://hasset.ukdataservice.ac.uk/hasset/en/ (accessed May 08, 2023).

[20] S. D. Costa, M. P. Barcellos, R. de A. Falbo, T. Conte, and K. M. de Oliveira, “A core ontology on the Human–Computer Interaction phenomenon,” Data Knowl. Eng., vol. 138, p. 101977, Mar. 2022, doi: 10.1016/j.datak.2021.101977.

[21] V. J. Gawron et al., “Human Factors Taxonomy,” Proc. Hum. Factors Soc. Annu. Meet., vol. 35, no. 18, pp. 1284–1287, Sep. 1991, doi: 10.1177/154193129103501807.

[22] L. Onnasch and E. Roesler, “A Taxonomy to Structure and Analyze Human–Robot Interaction,” Int. J. Soc. Robot., vol. 13, no. 4, pp. 833–849, Jul. 2021, doi: 10.1007/s12369-020-00666-5.

[23] R. A. Schwier, “A Taxonomy of Interaction for Instructional Multimedia.” Sep. 28, 1992. Accessed: May 09, 2023. [Online]. Available: https://eric.ed.gov/?id=ED352044

[24] C. Kelly, J. Miller, A. Redlich, and S. Kleinman, “A Taxonomy of Interrogation Methods,”

Search
Clear search
Close search
Google apps
Main menu