70 datasets found
  1. d

    Annotations on COVID-19 state data definitions as of March 7, 2021

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The COVID Tracking Project at The Atlantic (2022). Annotations on COVID-19 state data definitions as of March 7, 2021 [Dataset]. http://doi.org/10.7272/Q6JD4V1G
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 24, 2022
    Dataset provided by
    Dryad
    Authors
    The COVID Tracking Project at The Atlantic
    Time period covered
    Feb 14, 2022
    Description

    This dataset was compiled by volunteers with The COVID Tracking Project. As states changed their definitions of testing, outcomes, and hospitalization figures, we updated a centralized database of annotations by-state and by-metric.

  2. Z

    Taxonomies for Semantic Research Data Annotation

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Göpfert, Christoph; Haas, Jan Ingo; Schröder, Lucas; Gaedke, Martin (2024). Taxonomies for Semantic Research Data Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7908854
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Technische Universität Chemnitz
    Authors
    Göpfert, Christoph; Haas, Jan Ingo; Schröder, Lucas; Gaedke, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 35 of 39 taxonomies that were the result of a systematic review. The systematic review was conducted with the goal of identifying taxonomies suitable for semantically annotating research data. A special focus was set on research data from the hybrid societies domain.

    The following taxonomies were identified as part of the systematic review:

    Filename

    Taxonomy Title

    acm_ccs

    ACM Computing Classification System [1]

    amec

    A Taxonomy of Evaluation Towards Standards [2]

    bibo

    A BIBO Ontology Extension for Evaluation of Scientific Research Results [3]

    cdt

    Cross-Device Taxonomy [4]

    cso

    Computer Science Ontology [5]

    ddbm

    What Makes a Data-driven Business Model? A Consolidated Taxonomy [6]

    ddi_am

    DDI Aggregation Method [7]

    ddi_moc

    DDI Mode of Collection [8]

    n/a

    DemoVoc [9]

    discretization

    Building a New Taxonomy for Data Discretization Techniques [10]

    dp

    Demopaedia [11]

    dsg

    Data Science Glossary [12]

    ease

    A Taxonomy of Evaluation Approaches in Software Engineering [13]

    eco

    Evidence & Conclusion Ontology [14]

    edam

    EDAM: The Bioscientific Data Analysis Ontology [15]

    n/a

    European Language Social Science Thesaurus [16]

    et

    Evaluation Thesaurus [17]

    glos_hci

    The Glossary of Human Computer Interaction [18]

    n/a

    Humanities and Social Science Electronic Thesaurus [19]

    hcio

    A Core Ontology on the Human-Computer Interaction Phenomenon [20]

    hft

    Human-Factors Taxonomy [21]

    hri

    A Taxonomy to Structure and Analyze Human–Robot Interaction [22]

    iim

    A Taxonomy of Interaction for Instructional Multimedia [23]

    interrogation

    A Taxonomy of Interrogation Methods [24]

    iot

    Design Vocabulary for Human–IoT Systems Communication [25]

    kinect

    Understanding Movement and Interaction: An Ontology for Kinect-Based 3D Depth Sensors [26]

    maco

    Thesaurus Mass Communication [27]

    n/a

    Thesaurus Cognitive Psychology of Human Memory [28]

    mixed_initiative

    Mixed-Initiative Human-Robot Interaction: Definition, Taxonomy, and Survey [29]

    qos_qoe

    A Taxonomy of Quality of Service and Quality of Experience of Multimodal Human-Machine Interaction [30]

    ro

    The Research Object Ontology [31]

    senses_sensors

    A Human-Centered Taxonomy of Interaction Modalities and Devices [32]

    sipat

    A Taxonomy of Spatial Interaction Patterns and Techniques [33]

    social_errors

    A Taxonomy of Social Errors in Human-Robot Interaction [34]

    sosa

    Semantic Sensor Network Ontology [35]

    swo

    The Software Ontology [36]

    tadirah

    Taxonomy of Digital Research Activities in the Humanities [37]

    vrs

    Virtual Reality and the CAVE: Taxonomy, Interaction Challenges and Research Directions [38]

    xdi

    Cross-Device Interaction [39]

    We converted the taxonomies into SKOS (Simple Knowledge Organisation System) representation. The following 4 taxonomies were not converted as they were already available in SKOS and were for this reason excluded from this dataset:

    1) DemoVoc, cf. http://thesaurus.web.ined.fr/navigateur/ available at https://thesaurus.web.ined.fr/exports/demovoc/demovoc.rdf

    2) European Language Social Science Thesaurus, cf. https://thesauri.cessda.eu/elsst/en/ available at https://zenodo.org/record/5506929

    3) Humanities and Social Science Electronic Thesaurus, cf. https://hasset.ukdataservice.ac.uk/hasset/en/ available at https://zenodo.org/record/7568355

    4) Thesaurus Cognitive Psychology of Human Memory, cf. https://www.loterre.fr/presentation/ available at https://skosmos.loterre.fr/P66/en/

    References

    [1] “The 2012 ACM Computing Classification System,” ACM Digital Library, 2012. https://dl.acm.org/ccs (accessed May 08, 2023).

    [2] AMEC, “A Taxonomy of Evaluation Towards Standards.” Aug. 31, 2016. Accessed: May 08, 2023. [Online]. Available: https://amecorg.com/amecframework/home/supporting-material/taxonomy/

    [3] B. Dimić Surla, M. Segedinac, and D. Ivanović, “A BIBO ontology extension for evaluation of scientific research results,” in Proceedings of the Fifth Balkan Conference in Informatics, in BCI ’12. New York, NY, USA: Association for Computing Machinery, Sep. 2012, pp. 275–278. doi: 10.1145/2371316.2371376.

    [4] F. Brudy et al., “Cross-Device Taxonomy: Survey, Opportunities and Challenges of Interactions Spanning Across Multiple Devices,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19. New York, NY, USA: Association for Computing Machinery, Mai 2019, pp. 1–28. doi: 10.1145/3290605.3300792.

    [5] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, and E. Motta, “The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas,” in Lecture Notes in Computer Science 1137, D. Vrandečić, K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kaffee, and E. Simperl, Eds., Monterey, California, USA: Springer, Oct. 2018, pp. 187–205. Accessed: May 08, 2023. [Online]. Available: http://oro.open.ac.uk/55484/

    [6] M. Dehnert, A. Gleiss, and F. Reiss, “What makes a data-driven business model? A consolidated taxonomy,” presented at the European Conference on Information Systems, 2021.

    [7] DDI Alliance, “DDI Controlled Vocabulary for Aggregation Method,” 2014. https://ddialliance.org/Specification/DDI-CV/AggregationMethod_1.0.html (accessed May 08, 2023).

    [8] DDI Alliance, “DDI Controlled Vocabulary for Mode Of Collection,” 2015. https://ddialliance.org/Specification/DDI-CV/ModeOfCollection_2.0.html (accessed May 08, 2023).

    [9] INED - French Institute for Demographic Studies, “Thésaurus DemoVoc,” Feb. 26, 2020. https://thesaurus.web.ined.fr/navigateur/en/about (accessed May 08, 2023).

    [10] A. A. Bakar, Z. A. Othman, and N. L. M. Shuib, “Building a new taxonomy for data discretization techniques,” in 2009 2nd Conference on Data Mining and Optimization, Oct. 2009, pp. 132–140. doi: 10.1109/DMO.2009.5341896.

    [11] N. Brouard and C. Giudici, “Unified second edition of the Multilingual Demographic Dictionary (Demopaedia.org project),” presented at the 2017 International Population Conference, IUSSP, Oct. 2017. Accessed: May 08, 2023. [Online]. Available: https://iussp.confex.com/iussp/ipc2017/meetingapp.cgi/Paper/5713

    [12] DuCharme, Bob, “Data Science Glossary.” https://www.datascienceglossary.org/ (accessed May 08, 2023).

    [13] A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, and E. Stiakakis, “A Taxonomy of Evaluation Approaches in Software Engineering,” in Proceedings of the 7th Balkan Conference on Informatics Conference, in BCI ’15. New York, NY, USA: Association for Computing Machinery, Sep. 2015, pp. 1–8. doi: 10.1145/2801081.2801084.

    [14] M. C. Chibucos, D. A. Siegele, J. C. Hu, and M. Giglio, “The Evidence and Conclusion Ontology (ECO): Supporting GO Annotations,” in The Gene Ontology Handbook, C. Dessimoz and N. Škunca, Eds., in Methods in Molecular Biology. New York, NY: Springer, 2017, pp. 245–259. doi: 10.1007/978-1-4939-3743-1_18.

    [15] M. Black et al., “EDAM: the bioscientific data analysis ontology,” F1000Research, vol. 11, Jan. 2021, doi: 10.7490/f1000research.1118900.1.

    [16] Council of European Social Science Data Archives (CESSDA), “European Language Social Science Thesaurus ELSST,” 2021. https://thesauri.cessda.eu/en/ (accessed May 08, 2023).

    [17] M. Scriven, Evaluation Thesaurus, 3rd Edition. Edgepress, 1981. Accessed: May 08, 2023. [Online]. Available: https://us.sagepub.com/en-us/nam/evaluation-thesaurus/book3562

    [18] Papantoniou, Bill et al., The Glossary of Human Computer Interaction. Interaction Design Foundation. Accessed: May 08, 2023. [Online]. Available: https://www.interaction-design.org/literature/book/the-glossary-of-human-computer-interaction

    [19] “UK Data Service Vocabularies: HASSET Thesaurus.” https://hasset.ukdataservice.ac.uk/hasset/en/ (accessed May 08, 2023).

    [20] S. D. Costa, M. P. Barcellos, R. de A. Falbo, T. Conte, and K. M. de Oliveira, “A core ontology on the Human–Computer Interaction phenomenon,” Data Knowl. Eng., vol. 138, p. 101977, Mar. 2022, doi: 10.1016/j.datak.2021.101977.

    [21] V. J. Gawron et al., “Human Factors Taxonomy,” Proc. Hum. Factors Soc. Annu. Meet., vol. 35, no. 18, pp. 1284–1287, Sep. 1991, doi: 10.1177/154193129103501807.

    [22] L. Onnasch and E. Roesler, “A Taxonomy to Structure and Analyze Human–Robot Interaction,” Int. J. Soc. Robot., vol. 13, no. 4, pp. 833–849, Jul. 2021, doi: 10.1007/s12369-020-00666-5.

    [23] R. A. Schwier, “A Taxonomy of Interaction for Instructional Multimedia.” Sep. 28, 1992. Accessed: May 09, 2023. [Online]. Available: https://eric.ed.gov/?id=ED352044

    [24] C. Kelly, J. Miller, A. Redlich, and S. Kleinman, “A Taxonomy of Interrogation Methods,”

  3. E

    Data from: TermFrame: Terms, definitions and semantic annotations for...

    • live.european-language-grid.eu
    binary format
    Updated Nov 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). TermFrame: Terms, definitions and semantic annotations for karstology [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/20243
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 17, 2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The resource contains several datasets containing domain-specific data in three languages, English, Slovenian and Croatian, which can be used for various knowledge extraction or knowledge modelling tasks. The resource represents knowledge for the domain of karstology, a subfield of geography studying karst and related phenomena. It contains:

    1. Definitions Plain text files contain definitions of karst concepts from relevant glossaries and encyclopaedia, but also definitions which had been extracted from domain-specific corpora.

    2. Annotated definitions Definitions were manually annotated and curated in the WebAnno tool. Annotations include several layers including definition elements, semantic relations following the frame-based theory of terminology (FBT), relation definitors which can be used for learning relation patterns, and semantic categories defined in the domain model.

    3. Terms, definitions and sources The TermFrame knowledge base contains terms and their corresponding concept identifiers, definitions and definition sources.

  4. S

    The Semantic Data Dictionary – An Approach for Describing and Annotating...

    • scidb.cn
    Updated Oct 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabbir M. Rashid; James P. McCusker; Paulo Pinheiro; Marcello P. Bax; Henrique Santos; Jeanette A. Stingone; Amar K. Das; Deborah L. McGuinness (2020). The Semantic Data Dictionary – An Approach for Describing and Annotating Data [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00060
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2020
    Dataset provided by
    Science Data Bank
    Authors
    Sabbir M. Rashid; James P. McCusker; Paulo Pinheiro; Marcello P. Bax; Henrique Santos; Jeanette A. Stingone; Amar K. Das; Deborah L. McGuinness
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    17 tables and two figures of this paper. Table 1 is a subset of explicit entries identified in NHANES demographics data. Table 2 is a subset of implicit entries identified in NHANES demographics data. Table 3 is a subset of NHANES demographic Codebook entries. Table 4 presents a subset of explicit entries identified in SEER. Table 5 is a subset of Dictionary Mapping for the MIMIC-III Admission table. Table 6 shows high-level comparison of semantic data dictionaries, traditional data dictionaries, approaches involving mapping languages, and general data integration tools. Table A1 shows namespace prefixes and IRIs for relevant ontologies. Table B1 shows infosheet specification. Table B2 shows infosheet metadata supplement. Table B3 shows dictionary mapping specification. Table B4 is a codebook specification. Table B5 is a timeline specification. Table B6 is properties specification. Table C1 shows NHANES demographics infosheet. Table C2 shows NHANES demographic implicit entries. Table C3 shows NHANES demographic explicit entries. Table C4 presents expanded NHANES demographic Codebook entries. Figure 1 is a conceptual diagram of the Dictionary Mapping that allows for a representation model that aligns with existing scientific ontologies. The Dictionary Mapping is used to create a semantic representation of data columns. Each box, along with the “Relation” label, corresponds to a column in the Dictionary Mapping table. Blue rounded boxes correspond to columns that contain resource URIs, while white boxes refer to entities that are generated on a per-row/column basis. The actual cell value in concrete columns is, if there is no Codebook for the column, mapped to the “has value” object of the column object, which is generally either an attribute or an entity. Figure 2 presents (a) A conceptual diagram of the Codebook, which can be used to assign ontology classes to categorical concepts. Unlike other mapping approaches, the use of the Codebook allows for the annotation of cell values, rather than just columns. (b) A conceptual diagram of the Timeline, which can be used to represent complex time associated concepts, such as time intervals.

  5. Z

    ActiveHuman Part 2

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Nov 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8361113
    Explore at:
    Dataset updated
    Nov 14, 2023
    Dataset provided by
    Aristotle University of Thessaloniki (AUTh)
    Authors
    Charalampos Georgiadis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

    Folder configuration The dataset consists of 3 folders:

    JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

    Essential Terminology

    Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

    Dataset Data The dataset includes 4 types of JSON annotation files files:

    annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

    id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

    Most Labelers generate different annotation specifications in the spec key-value pair:

    BoundingBox2DLabeler/BoundingBox3DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

    template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

    label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

    label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

    label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

    captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

    id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

    sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

    ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

    id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

    Each Labeler generates different annotation specifications in the values key-value pair:

    BoundingBox2DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

    label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

    index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

    The SemanticSegmentationLabeler does not contain a values list.

    egos.json: Contains collections of key-value pairs for each ego. These include:

    id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

    id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

    Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

    e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.

  6. News Articles Defined by annotating and crediblity

    • kaggle.com
    zip
    Updated Jun 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kannan.K.R (2020). News Articles Defined by annotating and crediblity [Dataset]. https://www.kaggle.com/imkrkannan/news-articles-defined-by-annotating-and-crediblity
    Explore at:
    zip(108576 bytes)Available download formats
    Dataset updated
    Jun 22, 2020
    Authors
    Kannan.K.R
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    This study is based on a previous study: Amy Zhang, Aditya Ranganathan, Sarah Emlen Metz, Scott Appling, Connie Moon Sehat, Norman Gilmore, Nick B. Adams, Emmanuel Vincent, Jennifer 8. Lee, Martin Robbins, Ed Bice, Sandro Hawke, David Karger, and An Xiao Mina. A Structured Response to Misinformation: Defining and Annotating Credibility Indicators in News Articles. The Web Conference, April 2018 (available here; dataset here )

    A number of the questions remain the same from the study ("WebConf 2018"), with several modifications based on the earlier results to elicit more information related to indicators (https://credweb.org/cciv/).

    The study has been broken up into 2 major parts:

    new annotators reviewing the same articles from the original WebConf 2018 study, and new annotators reviewing new articles

  7. Qualitative analysis of manual annotations of clinical text with SNOMED CT

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Qualitative analysis of manual annotations of clinical text with SNOMED CT [Dataset]. http://doi.org/10.1371/journal.pone.0209547
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SNOMED CT provides about 300,000 codes with fine-grained concept definitions to support interoperability of health data. Coding clinical texts with medical terminologies it is not a trivial task and is prone to disagreements between coders. We conducted a qualitative analysis to identify sources of disagreements on an annotation experiment which used a subset of SNOMED CT with some restrictions. A corpus of 20 English clinical text fragments from diverse origins and languages was annotated independently by two domain medically trained annotators following a specific annotation guideline. By following this guideline, the annotators had to assign sets of SNOMED CT codes to noun phrases, together with concept and term coverage ratings. Then, the annotations were manually examined against a reference standard to determine sources of disagreements. Five categories were identified. In our results, the most frequent cause of inter-annotator disagreement was related to human issues. In several cases disagreements revealed gaps in the annotation guidelines and lack of training of annotators. The reminder issues can be influenced by some SNOMED CT features.

  8. f

    Data from: Veneer Is a Webtool for Rapid, Standardized, and Transparent...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linda Berg Luecke; Roneldine Mesidor; Jack Littrell; Morgan Carpenter; Melinda Wojtkiewicz; Rebekah L. Gundry (2024). Veneer Is a Webtool for Rapid, Standardized, and Transparent Interpretation, Annotation, and Reporting of Mammalian Cell Surface N‑Glycocapture Data [Dataset]. http://doi.org/10.1021/acs.jproteome.3c00800.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    ACS Publications
    Authors
    Linda Berg Luecke; Roneldine Mesidor; Jack Littrell; Morgan Carpenter; Melinda Wojtkiewicz; Rebekah L. Gundry
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Currently, no consensus exists regarding criteria required to designate a protein within a proteomic data set as a cell surface protein. Most published proteomic studies rely on varied ontology annotations or computational predictions instead of experimental evidence when attributing protein localization. Consequently, standardized approaches for analyzing and reporting cell surface proteome data sets would increase confidence in localization claims and promote data use by other researchers. Recently, we developed Veneer, a web-based bioinformatic tool that analyzes results from cell surface N-glycocapture workflowsthe most popular cell surface proteomics method used to date that generates experimental evidence of subcellular location. Veneer assigns protein localization based on defined experimental and bioinformatic evidence. In this study, we updated the criteria and process for assigning protein localization and added new functionality to Veneer. Results of Veneer analysis of 587 cell surface N-glycocapture data sets from 32 published studies demonstrate the importance of applying defined criteria when analyzing cell surface proteomics data sets and exemplify how Veneer can be used to assess experimental quality and facilitate data extraction for informing future biological studies and annotating public repositories.

  9. Z

    DWUG DE Sense: A data set of historical word sense annotations in German

    • data.niaid.nih.gov
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schlechtweg, Dominik (2024). DWUG DE Sense: A data set of historical word sense annotations in German [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8197552
    Explore at:
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    University of Stuttgart
    Authors
    Schlechtweg, Dominik
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains a subset of DWUG DE word usage data annotated with classical word sense definitions (DWUG DE Sense, see data/*/judgments_senses.csv). From these annotations aggregated and cleaned sense labels were derived (labels/*/labels_senses.csv). From these labels we derived additional binary semantic proximity labels between use pairs ('0' for different sense, '1' for same sense, labels/*/labels_proximity.csv) and change labels reflecting sense changes between the two time periods from which word usages were sampled (stats/*/stats_groupings.csv).

    The sense labels were derived from the sense annotation by removing instances where not at least 2/3 annotators agree on the label (maj_2/maj_3). Note that the binary proximity labels were derived from the sense annotation, and not directly judged by humans (in contrast to other WUG data sets). Note that consequently also the change scores EARLIER, LATER and COMPARE were not calculated directly from human judgments, but from the inferred binary proximity labels. Please find the code aggregating and cleaning the data, deriving proximity labels and deriving change labels in the WUG repository.

    Please find more information on the provided data in the paper referenced below.

    Version: 1.0.1, 01.11.2024. Correct or remove some normalization and lemmatization errors in the uses. Updated references.

    Reference

    Dominik Schlechtweg, Frank D. Zamora-Reina, Felipe Bravo-Marquez, Nikolay Arefyev. 2024. Sense Through Time: Diachronic Word Sense Annotations for Word Sense Induction and Lexical Semantic Change Detection. Language Resources and Evaluation.

    Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.

  10. c

    Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.3

    • clarin.si
    • live.european-language-grid.eu
    Updated Jul 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaka Čibej; Simon Krek; Carole Tiberius; Federico Martelli; Roberto Navigli; Jelena Kallas; Polona Gantar; Svetla Koeva; Sanni Nimb; Bolette Sandford Pedersen; Sussi Olsen; Margit Langemets; Kristina Koppel; Tiiu Üksik; Kaja Dobrovoljc; Rafael Ureña-Ruiz; José-Luis Sancho-Sánchez; Veronika Lipp; Tamás Váradi; András Győrffy; László Simon; Valeria Quochi; Monica Monachini; Francesca Frontini; Rob Tempelaars; Rute Costa; Ana Salgado; Tina Munda; Iztok Kosem; Rebeka Roblek; Urška Kamenšek; Petra Zaranšek; Karolina Zgaga; Primož Ponikvar; Luka Terčon; Jonas Jensen; Ida Flörke; Henrik Lorentzen; Thomas Troelsgård; Diana Blagoeva; Dimitar Hristov; Sia Kolkovska (2022). Parallel sense-annotated corpus ELEXIS-WSD 1.3 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/2029?show=full
    Explore at:
    Dataset updated
    Jul 30, 2022
    Authors
    Jaka Čibej; Simon Krek; Carole Tiberius; Federico Martelli; Roberto Navigli; Jelena Kallas; Polona Gantar; Svetla Koeva; Sanni Nimb; Bolette Sandford Pedersen; Sussi Olsen; Margit Langemets; Kristina Koppel; Tiiu Üksik; Kaja Dobrovoljc; Rafael Ureña-Ruiz; José-Luis Sancho-Sánchez; Veronika Lipp; Tamás Váradi; András Győrffy; László Simon; Valeria Quochi; Monica Monachini; Francesca Frontini; Rob Tempelaars; Rute Costa; Ana Salgado; Tina Munda; Iztok Kosem; Rebeka Roblek; Urška Kamenšek; Petra Zaranšek; Karolina Zgaga; Primož Ponikvar; Luka Terčon; Jonas Jensen; Ida Flörke; Henrik Lorentzen; Thomas Troelsgård; Diana Blagoeva; Dimitar Hristov; Sia Kolkovska
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.3 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.

    The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

    The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2.

    List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene

    The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl and elexis-wsd-en).

    Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.

    For more information, please refer to 00README.txt.

    Updates in version 1.3: - A handful of token ID issues were corrected in ELEXIS-WSD-sl. In addition, lemmas were corrected according to the version of ELEXIS-WSD-sl included in the SUK 1.1 Training Corpus of Slovene (http://hdl.handle.net/11356/1959). - Named entity annotations and named entity core concept annotations were added to ELEXIS-WSD-en. - For all languages, missing UPOS tags were added for non-content words.

  11. u

    Data from: ReBeatICG database

    • produccioncientifica.ucm.es
    • data.niaid.nih.gov
    • +1more
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pale, Una; Meier, David; Müller, Olivier; Valdes, Adriana Arza; Alonso, David Atienza; Pale, Una; Meier, David; Müller, Olivier; Valdes, Adriana Arza; Alonso, David Atienza (2021). ReBeatICG database [Dataset]. https://produccioncientifica.ucm.es/documentos/668fc442b9e7c03b01bd7f2e
    Explore at:
    Dataset updated
    2021
    Authors
    Pale, Una; Meier, David; Müller, Olivier; Valdes, Adriana Arza; Alonso, David Atienza; Pale, Una; Meier, David; Müller, Olivier; Valdes, Adriana Arza; Alonso, David Atienza
    Description

    ReBeatICG database contains ICG (impedance cardiography) signals recorded during an experimental session of a virtual search and rescue mission with drones. It includes beat-to-beat annotations of the ICG characteristic points, made by a cardiologist, with the purpose of testing ICG delineation algorithms. A reference of synchronous ECG signals is included to allow comparison and mark cardiac events. Raw data The database includes 48 recordings of ICG and ECG signals from 24 healthy subjects during an experimental session of a virtual search and rescue mission with drones, described in [1]. Two segments of 5-minute signals are selected from each subject; one corresponding to baseline state (task BL) and the second one is recorded during higher levels of cognitive workload (task CW). In total, the presented database consisted of 240 minutes of ICG signals. During the experiment, various signals were recorded, but here only ICG and ECG data are provided. Raw data was recorded with 2000Hz using the Biopac system. Data Preprocessing (filtering) Further, for the purpose of annotation by cardiologists, data were first downsampled to 250Hz instead of 2000Hz. Further, it was filtered with an adaptive Savitzky-Golay filter of order 3. “Adaptive'' refers to the adaptive selection of filter length, which plays a major role in the efficacy of the filter. The filter length was selected based on the first 3 seconds of each signal recording SNR level, following the procedure described below. Starting from a filter length of 3 (i.e., the minimum length allowed), the length is increased in steps of two until signal SNR reaches 30 or the improvements are lower than 1% (i.e., the saturation of SNR improvement with further filter length increase). These values present a good compromise between reducing noise and over-smoothing of the signal (and hence potentially losing valuable details) and a lower filter length, thus reducing complexity. The SNR is calculated as a ratio between the 2-norm of the high and low signal frequencies considering 20Hz as cut-off frequency. Data Annotation In order to assess the performance of the ICG delineation algorithms, a subset of the database was annotated by a cardiologist from Lausanne University Hospital (CHUV) in Switzerland. The annotated subset consists of 4 randomly chosen signal segments containing 10 beats from each subject and task (i.e., 4 segments from BL and 4 from CW task). Segments of signals with artifacts and very noisy were excluded when selecting the data for annotation, and in this case, 8 segments were chosen from the task with cleaner signals. In total, 1920 (80x24) beats were selected for annotation. For each cardiac cycle, four characteristic points were annotated: B, C, X and O. The following definitions were used when annotating the data: - C peak -- Defined as the peak with the greatest amplitude in one cardiac cycle that represents the maximum systolic flow. - B point -- Indicates the onset of the final rapid upstroke toward the C point [3] that is expressed as the point of significant change in the slope of the ICG signal preceding the C point. It is related to the aortic valve opening. However, its identification can be difficult due to variations in the ICG signals morphology. A decisional algorithm has been proposed to guide accurate and reproducible B point identification [4]. - X point -- Often defined as the minimum dZ/dt value in one cardiac cycle. However, this does not always hold true due to variations in the dZ/dt waveform morphology [5]. Thus, the X point is defined as the onset of the steep rise in ICG towards the O point. It represents the aortic valve closing which occurs simultaneously as the T wave end on the ECG signal. - O point -- The highest local maxima in the first half of the C-C interval. It represents the mitral valve opening. Annotation was performed using open-access software (https://doi.org/10.5281/zenodo.4724843). Annotated points are saved in separate files for each person and task, representing the location of points in the original signal. Data structure Data is organized in three folders, one for raw data (01_RawData), filtered data (02_FilteredData), and annotated points (03_ExpertAnnotations). In each folder, data is separated into files representing each subject and task (except in 03_ExpertAnnotations where 2 CW task files were not annotated due to an excessive amount of noise). All files are Matlab .mat files. Raw data and filtered data .mat files contain „ICG“, „ECG“ synchronized data, as well as “samplFreq“values. In filtered data final chosen Savitzky-Golay filter length (“SGFiltLen”) is provided too. In Annotated data .mat file contains only matrix „annotPoints“ with each row representing one cardiac cycle, while in columns are positions of B, C, X and O points, respectively. Positions are expressed as a number of samples from the beginning of full database files (signals from 01_RawData and 02_FilteredData folders). In rare cases, there are less than 40 (or 80) values per file, when data was noisy and cardiologists couldn't annotate confidently each cardiac cycle. ------------------- References [1] F. Dell’Agnola, “Cognitive Workload Monitoring in Virtual Reality Based Rescue Missions with Drones.,” pp. 397–409, 2020, doi: 10.1007/978-3-030-49695-1_26. [2] H. Yazdanian, A. Mahnam, M. Edrisi, and M. A. Esfahani, “Design and Implementation of a Portable Impedance Cardiography System for Noninvasive Stroke Volume Monitoring,” J. Med. Signals Sens., vol. 6, no. 1, pp. 47–56, Mar. 2016. [3] A. Sherwood(Chair), M. T. Allen, J. Fahrenberg, R. M. Kelsey, W. R. Lovallo, and L. J. P. van Doornen, “Methodological Guidelines for Impedance Cardiography,” Psychophysiology, vol. 27, no. 1, pp. 1–23, 1990, doi: https://doi.org/10.1111/j.1469-8986.1990.tb02171.x. [4] J. R. Árbol, P. Perakakis, A. Garrido, J. L. Mata, M. C. Fernández‐Santaella, and J. Vila, “Mathematical detection of aortic valve opening (B point) in impedance cardiography: A comparison of three popular algorithms,” Psychophysiology, vol. 54, no. 3, pp. 350–357, 2017, doi: https://doi.org/10.1111/psyp.12799. [5] M. Nabian, Y. Yin, J. Wormwood, K. S. Quigley, L. F. Barrett, and S. Ostadabbas, “An Open-Source Feature Extraction Tool for the Analysis of Peripheral Physiological Data,” IEEE J. Transl. Eng. Health Med., vol. 6, p. 2800711, 2018, doi: 10.1109/JTEHM.2018.2878000.

  12. d

    Data from: Grammar transformations of topographic feature type annotations...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.

  13. Hive Annotation Job Results - Cleaned and Audited

    • kaggle.com
    zip
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited
    Explore at:
    zip(471571 bytes)Available download formats
    Dataset updated
    Apr 28, 2021
    Authors
    Brendan Kelley
    Description

    Context

    This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

    Hive Data Audit Prompt

    The raw data that accompanies the prompt can be found below:

    Hive Annotation Job Results - Raw Data

    ^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

    To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

    Content

    Brendan Kelley April 23, 2021

    Hive Data Audit Prompt Results

    This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

    Observation

    The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

    Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

    Assumptions

    Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

    Preparation

    The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

    • A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

    These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

    For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

    For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...

  14. d

    Data from: Annotated reference transcriptome for female Culicoides...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Annotated reference transcriptome for female Culicoides sonorensis biting midges [Dataset]. https://catalog.data.gov/dataset/annotated-reference-transcriptome-for-female-culicoides-sonorensis-biting-midges-fde74
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Unigene sequences were annotated by BlastX alignment to the non-redundant protein database (National Center for Biotechnology Information/GenBank) and the Aedes aegypti and Culex quinquefasciatus gene annotations (Vectorbase). This was done with a 1e-05 expectation value. Top hits are shown including accession numbers and description, if available. Unigene number and corresponding GenBank accession numbers are provided for all C. sonorensis genes. Both tables are modified from supplementary information tables at http://dx.doi.org/10.1371/journal.pone.0098123.s003 and numbered accordingly. Resources in this dataset:Resource Title: table s2 annotation. File Name: table s2 annotation.xlsxResource Title: table S3 GO terms. File Name: table S3 GO terms.xlsxResource Title: data dictionary Nayduch S2 S3. File Name: data dictionary Nayduch S2 S3_2.csvResource Description: Defines parameters for annotation and GO terms.

  15. n

    Hebrew Text Database ETCBC4

    • narcis.nl
    • datasearch.gesis.org
    Updated Jul 18, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peursen, W.T. van (ETCBC, VU Amsterdam) (2014). Hebrew Text Database ETCBC4 [Dataset]. http://doi.org/10.17026/dans-2z3-arxf
    Explore at:
    application/x-cmdi+xmlAvailable download formats
    Dataset updated
    Jul 18, 2014
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Peursen, W.T. van (ETCBC, VU Amsterdam)
    Area covered
    Middle East
    Description

    The ETCBC database of the Hebrew Bible (formerly known as WIVU database), contains the scholarly text of the Hebrew Bible with linguistic markup. A previous version can be found in EASY (see the link below). The present dataset is an improvement in many ways:

    (A) it contains a new version of the data, called ETCBC4. The content has been heavily updated, with new linguistic annotations and a better organisation of them, and lots of additions and corrections as well.

    (B) the data format is now Linguistic Annotation Framework (see below). This contrasts with the previous version, which has been archived as a database dump in a specialised format: Emdros (see the link below).

    (C) a new tool, LAF-Fabric is added to process the ETCBC4 version directly from its LAF representation. The picture on this page shows a few samples what can be done with it.

    (D) extensive documentation is provided, including a description of all the computing steps involved in getting the data in LAF format.

    Since 2012 there is an ISO standard for the stand-off markup of language resources, Linguistic Annotation Framework (LAF).

    As a result of the SHEBANQ project (see link below), funded by CLARIN-NL and carried out by the ETCBC and DANS, we have a created a tool, LAF-Fabric, by which we can convert EMDROS databases of the ETCBC into LAF and then do data analytic work by means of e.g. IPython notebooks. This has been used for the Hebrew Bible, but it can also be applied to the Syriac text in the CALAP (see link below).

    This dataset contains a folder laf with the laf files, and the necessary declarations are contained in the folder decl. Among these declarations are feature declaration documents, in TEI format (see link below), with hyperlinks to concept definitions in ISOcat (see link below). For completeness, the ISOcat definitions are repeated in the feature declaration documents. These definitions are terse, and they are more fully documented in the folder documentation.

  16. f

    Data from: A Useful Guide to Lectin Binding: Machine-Learning Directed...

    • acs.figshare.com
    xlsx
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Bojar; Lawrence Meche; Guanmin Meng; William Eng; David F. Smith; Richard D. Cummings; Lara K. Mahal (2023). A Useful Guide to Lectin Binding: Machine-Learning Directed Annotation of 57 Unique Lectin Specificities [Dataset]. http://doi.org/10.1021/acschembio.1c00689.s006
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    ACS Publications
    Authors
    Daniel Bojar; Lawrence Meche; Guanmin Meng; William Eng; David F. Smith; Richard D. Cummings; Lara K. Mahal
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Glycans are critical to every facet of biology and medicine, from viral infections to embryogenesis. Tools to study glycans are rapidly evolving; however, the majority of our knowledge is deeply dependent on binding by glycan binding proteins (e.g., lectins). The specificities of lectins, which are often naturally isolated proteins, have not been well-defined, making it difficult to leverage their full potential for glycan analysis. Herein, we use a combination of machine learning algorithms and expert annotation to define lectin specificity for this important probe set. Our analysis uses comprehensive glycan microarray analysis of commercially available lectins we obtained using version 5.0 of the Consortium for Functional Glycomics glycan microarray (CFGv5). This data set was made public in 2011. We report the creation of this data set and its use in large-scale evaluation of lectin–glycan binding behaviors. Our motif analysis was performed by integrating 68 manually defined glycan features with systematic probing of computational rules for significant binding motifs using mono- and disaccharides and linkages. Combining machine learning with manual annotation, we create a detailed interpretation of glycan-binding specificity for 57 unique lectins, categorized by their major binding motifs: mannose, complex-type N-glycan, O-glycan, fucose, sialic acid and sulfate, GlcNAc and chitin, Gal and LacNAc, and GalNAc. Our work provides fresh insights into the complex binding features of commercially available lectins in current use, providing a critical guide to these important reagents.

  17. d

    Agrilus planipennis community manual annotations

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Agrilus planipennis community manual annotations [Dataset]. https://catalog.data.gov/dataset/agrilus-planipennis-community-manual-annotations
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Manual annotation at the i5k Workspace@NAL (https://i5k.nal.usda.gov) is the review and improvement of gene models derived from computational gene prediction. Community curators compare an existing gene model to evidence such as RNA-Seq or protein alignments from the same or closely related species and modify the structure or function of the gene accordingly, typically following the i5k Workspace@NAL manual annotation guidelines (https://i5k.nal.usda.gov/content/rules-web-apollo-annotation-i5k-pilot-project). If a gene model is missing, the annotator can also use this evidence to create a new gene model. Because manual annotation, by definition, improves or creates gene models where computational methods have failed, it can be a powerful tool to improve computational gene sets, which often serve as foundational datasets to facilitate research on a species.Here, community curators used manual annotation at the i5k Workspace@NAL to improve computational gene predictions from the dataset Agrilus planipennis genome annotations v0.5.3. The i5k Workspace@NAL set up the Apollo v1 manual annotation software and multiple evidence tracks to facilitate manual annotation. From 2014-10-20 to 2018-07-12, five community curators updated 263 genes, including developmental genes; cytochrome P450s; cathepsin peptidases; cuticle proteins; glycoside hydrolases; and polysaccharide lyases. For this dataset, we used the program LiftOff v1.6.3 to map the manual annotations to the genome assembly GCF_000699045.2. We computed overlaps with annotations from the RefSeq database using gff3_merge from the GFF3toolkit software v2.1.0. FASTA sequences were generated using gff3_to_fasta from the same toolkit. These improvements should facilitate continued research on Agrilus planipennis, or emerald ash borer (EAB), which is an invasive insect pest.While these manual annotations will not be integrated with other computational gene sets, they are available to view at the i5k Workspace@NAL (https://i5k.nal.usda.gov) to enhance future research on Agrilus planipennis.

  18. d

    330K+ Interior Design Images | AI Training Data | Annotated imagery data for...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds, 330K+ Interior Design Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/200k-interior-design-images-ai-training-data-annotated-i-data-seeds
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Data Seeds
    Area covered
    Jamaica, Congo, Kuwait, Nicaragua, Turks and Caicos Islands, Curaçao, Indonesia, Tajikistan, Ethiopia, Egypt
    Description

    This dataset features over 330,000 high-quality interior design images sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly varied and extensively annotated collection of indoor environment visuals.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, making it ideal for tasks such as room classification, furniture detection, and spatial layout analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.

    1. Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions centered on interior design themes ensure a steady stream of fresh, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours to fulfill specific requests, such as particular room types, design styles, or furnishings.

    2. Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide spectrum of architectural styles, cultural aesthetics, and functional spaces. The images include homes, offices, restaurants, studios, and public interiors—ranging from minimalist and modern to classic and eclectic designs.

    3. High-Quality Imagery: the dataset includes standard to ultra-high-definition images that capture fine interior details. Both professionally staged and candid real-life spaces are included, offering versatility for training AI across design evaluation, object detection, and environmental understanding.

    4. Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This provides valuable insights into global aesthetic trends, helping AI models learn user preferences, design appeal, and stylistic relevance.

    5. AI-Ready Design: the dataset is optimized for machine learning tasks such as interior scene recognition, style transfer, virtual staging, and layout generation. It integrates smoothly with popular AI development environments and tools.

    6. Licensing & Compliance: the dataset fully complies with data privacy regulations and includes transparent licensing suitable for commercial and academic use.

    Use Cases: 1. Training AI for interior design recommendation engines and virtual staging tools. 2. Enhancing smart home applications and spatial recognition systems. 3. Powering AR/VR platforms for virtual tours, furniture placement, and room redesign. 4. Supporting architectural visualization, decor style transfer, and real estate marketing.

    This dataset offers a comprehensive, high-quality resource tailored for AI-driven innovation in design, real estate, and spatial computing. Customizations are available upon request. Contact us to learn more!

  19. f

    Data from: Proteogenomic Annotation of Chinese Hamsters Reveals Extensive...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shangzhong Li; Seong Won Cha; Kelly Heffner; Deniz Baycin Hizal; Michael A. Bowen; Raghothama Chaerkady; Robert N. Cole; Vijay Tejwani; Prashant Kaushik; Michael Henry; Paula Meleady; Susan T. Sharfstein; Michael J. Betenbaugh; Vineet Bafna; Nathan E. Lewis (2023). Proteogenomic Annotation of Chinese Hamsters Reveals Extensive Novel Translation Events and Endogenous Retroviral Elements [Dataset]. http://doi.org/10.1021/acs.jproteome.8b00935.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Shangzhong Li; Seong Won Cha; Kelly Heffner; Deniz Baycin Hizal; Michael A. Bowen; Raghothama Chaerkady; Robert N. Cole; Vijay Tejwani; Prashant Kaushik; Michael Henry; Paula Meleady; Susan T. Sharfstein; Michael J. Betenbaugh; Vineet Bafna; Nathan E. Lewis
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    A high-quality genome annotation greatly facilitates successful cell line engineering. Standard draft genome annotation pipelines are based largely on de novo gene prediction, homology, and RNA-Seq data. However, draft annotations can suffer from incorrect predictions of translated sequence, inaccurate splice isoforms, and missing genes. Here, we generated a draft annotation for the newly assembled Chinese hamster genome and used RNA-Seq, proteomics, and Ribo-Seq to experimentally annotate the genome. We identified 3529 new proteins compared to the hamster RefSeq protein annotation and 2256 novel translational events (e.g., alternative splices, mutations, and novel splices). Finally, we used this pipeline to identify the source of translated retroviruses contaminating recombinant products from Chinese hamster ovary (CHO) cell lines, including 119 type-C retroviruses, thus enabling future efforts to eliminate retroviruses to reduce the costs incurred with retroviral particle clearance. In summary, the improved annotation provides a more accurate resource for CHO cell line engineering, by facilitating the interpretation of omics data, defining of cellular pathways, and engineering of complex phenotypes.

  20. K

    TheSu XML Schema Definition, ns 1.0

    • rdr.kuleuven.be
    • data.europa.eu
    txt, xml
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele Morrone; Daniele Morrone (2023). TheSu XML Schema Definition, ns 1.0 [Dataset]. http://doi.org/10.48804/KD8QPO
    Explore at:
    xml(796287), txt(5295)Available download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    KU Leuven RDR
    Authors
    Daniele Morrone; Daniele Morrone
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset includes a replica of the current version of the TheSu XML Schema Definition (XSD) within the namespace 1.0, as available at the URL: https://alchemeast.eu/thesu/ns/1.0/TheSu.xsd. TheSu XML, an acronym for 'Thesis-Support', is an XML annotation schema for the digital analysis, indexing, and mapping of ideas and their contexts of enunciation in any source. It is tailored to assist research in the history of ideas, philosophy, science, and technology. The complete documentation for this XML Schema Definition is accessible at the URL: https://alchemeast.eu/thesu/ns/1.0/documentation/TheSu.html.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The COVID Tracking Project at The Atlantic (2022). Annotations on COVID-19 state data definitions as of March 7, 2021 [Dataset]. http://doi.org/10.7272/Q6JD4V1G

Annotations on COVID-19 state data definitions as of March 7, 2021

Explore at:
zipAvailable download formats
Dataset updated
Feb 24, 2022
Dataset provided by
Dryad
Authors
The COVID Tracking Project at The Atlantic
Time period covered
Feb 14, 2022
Description

This dataset was compiled by volunteers with The COVID Tracking Project. As states changed their definitions of testing, outcomes, and hospitalization figures, we updated a centralized database of annotations by-state and by-metric.

Search
Clear search
Close search
Google apps
Main menu