31 datasets found
  1. Z

    ActiveHuman Part 2

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Nov 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8361113
    Explore at:
    Dataset updated
    Nov 14, 2023
    Dataset provided by
    Aristotle University of Thessaloniki (AUTh)
    Authors
    Charalampos Georgiadis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

    Folder configuration The dataset consists of 3 folders:

    JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

    Essential Terminology

    Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

    Dataset Data The dataset includes 4 types of JSON annotation files files:

    annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

    id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

    Most Labelers generate different annotation specifications in the spec key-value pair:

    BoundingBox2DLabeler/BoundingBox3DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

    template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

    label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

    label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

    label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

    captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

    id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

    sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

    ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

    id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

    Each Labeler generates different annotation specifications in the values key-value pair:

    BoundingBox2DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

    label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

    index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

    The SemanticSegmentationLabeler does not contain a values list.

    egos.json: Contains collections of key-value pairs for each ego. These include:

    id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

    id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

    Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

    e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.

  2. DWUG ES: Diachronic Word Usage Graphs for Spanish

    • zenodo.org
    zip
    Updated Apr 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2025). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.14891659
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    Please find more information on the provided data in the papers referenced below.

    The annotation was funded by

    • ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,
    • ANID - Millennium Science Initiative Program - Code ICN17 002 and
    • SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

    Version: 4.0.2, 7.1.2025. Full data. Quoting issues in uses resolved. Target word and target sentence indices corrected. One corrected context for word 'metro'. Judgments anonymized. Annotator 'gecsa' removed. Issues with special characters in filenames resolved. Additional removal of wrongly copied graphs.

    Reference

    Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.

    Dominik Schlechtweg, Tejaswi Choppa, Wei Zhao, Michael Roth. 2025. The CoMeDi Shared Task: Median Judgment Classification & Mean Disagreement Ranking with Ordinal Word-in-Context Judgments. In Proceedings of the 1st Workshop on Context and Meaning--Navigating Disagreements in NLP Annotations.

  3. Hive Annotation Job Results - Cleaned and Audited

    • kaggle.com
    zip
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited
    Explore at:
    zip(471571 bytes)Available download formats
    Dataset updated
    Apr 28, 2021
    Authors
    Brendan Kelley
    Description

    Context

    This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

    Hive Data Audit Prompt

    The raw data that accompanies the prompt can be found below:

    Hive Annotation Job Results - Raw Data

    ^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

    To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

    Content

    Brendan Kelley April 23, 2021

    Hive Data Audit Prompt Results

    This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

    Observation

    The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

    Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

    Assumptions

    Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

    Preparation

    The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

    • A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

    These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

    For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

    For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...

  4. E

    Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

    • live.european-language-grid.eu
    binary format
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 30, 2021
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

    Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

    The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.

  5. f

    Definition of concept coverage scores for ASSESS CT manual annotation.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Definition of concept coverage scores for ASSESS CT manual annotation. [Dataset]. http://doi.org/10.1371/journal.pone.0209547.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Connecticut
    Description

    Definition of concept coverage scores for ASSESS CT manual annotation.

  6. g

    Dataset with four years of condition monitoring technical language...

    • gimi9.com
    Updated Jan 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset with four years of condition monitoring technical language annotations from paper machine industries in northern Sweden | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-hafd-ms27/
    Explore at:
    Dataset updated
    Jan 8, 2024
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Sweden
    Description

    This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title. Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']

  7. Z

    Disco-Annotation

    • data.niaid.nih.gov
    Updated Oct 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popescu-Bellis, Andrei; Meyer, Thomas; Liyanapathirana, Jeevanthi; Cartoni, Bruno; Zufferey, Sandrine; Hajlaoui, Najeh (2020). Disco-Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4061389
    Explore at:
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    University of Geneva
    Idiap Research Institute
    Université Catholique de Louvain
    Authors
    Popescu-Bellis, Andrei; Meyer, Thomas; Liyanapathirana, Jeevanthi; Cartoni, Bruno; Zufferey, Sandrine; Hajlaoui, Najeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives in europarl texts.

    The 8 connectives with their annotated relations are:

    although (contrast|concession)

    as (prep|causal|temporal|comparison|concession)

    however (contrast|concession)

    meanwhile (contrast|temporal)

    since (causal|temporal|temporal-causal)

    though (contrast|concession)

    while (contrast|concession|temporal|temporal-contrast|temporal-causal)

    yet (adv|contrast|concession)

    For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.

    If you need software for the latter, have a look at: https://github.com/idiap/DiscoConn-Classifier

    Citation

    Please cite the following papers if you make use of these datasets (and to know more about the annotation method):

    @INPROCEEDINGS{Popescu-Belis-LREC-2012, author = {Popescu-Belis, Andrei and Meyer, Thomas and Liyanapathirana, Jeevanthi and Cartoni, Bruno and Zufferey, Sandrine}, title = {{D}iscourse-level {A}nnotation over {E}uroparl for {M}achine {T}ranslation: {C}onnectives and {P}ronouns}, booktitle = {Proceedings of the eighth international conference on Language Resources and Evaluation ({LREC})}, year = {2012}, address = {Istanbul, Turkey} }

    @Article{Cartoni-DD-2013, Author = {Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas}, Title = {{Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique}}, Journal = {Dialogue & Discourse}, Volume = {4}, Number = {2}, pages = {65--86}, year = {2013} }

    @ARTICLE{Meyer-TSLP-submitted, author = {Meyer, Thomas and Hajlaoui, Najeh and Popescu-Belis, Andrei}, title = {{Disambiguating Discourse Connectives for Statistical Machine Translation in Several Languages}}, journal = {IEEE/ACM Transactions of Audio, Speech, and Language Processing}, year = {submitted}, volume = {}, pages = {}, number = {} }

  8. GMB Data

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghassen Khaled (2025). GMB Data [Dataset]. https://www.kaggle.com/datasets/ghassenkhaled/gmb-data
    Explore at:
    zip(3265952 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Authors
    Ghassen Khaled
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    For this notebook, we're going to use the GMB (Groningen Meaning Bank) corpus for named entity recognition. GMB is a fairly large corpus with a lot of annotations. The data is labeled using the IOB format (short for inside, outside, beginning), which means each annotation also needs a prefix of I, O, or B.

    The following classes appear in the dataset:

    LOC - Geographical Entity ORG - Organization PER - Person GPE - Geopolitical Entity TIME - Time indicator ART - Artifact EVE - Event NAT - Natural Phenomenon Note: GMB is not completely human annotated, and it’s not considered 100% correct. For this exercise, classes ART, EVE, and NAT were combined into a MISC class due to small number of examples for these classes.

  9. Preliminary functional annotation of the sheep genome

    • data.csiro.au
    • researchdata.edu.au
    Updated Nov 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marina Naval Sanchez; Quan Nguyen; Sean McWilliam; Laercio Porto Neto; Toni Reverter-Gomez; James Kijas (2017). Preliminary functional annotation of the sheep genome [Dataset]. http://doi.org/10.4225/08/5a03a9c39a0ba
    Explore at:
    Dataset updated
    Nov 9, 2017
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Marina Naval Sanchez; Quan Nguyen; Sean McWilliam; Laercio Porto Neto; Toni Reverter-Gomez; James Kijas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2016 - Jan 1, 2017
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    In the absence of detailed functional annotation for any livestock genome, we used comparative genomics to predict ovine regulatory elements using human data. Reciprocal liftOver was used to predict the ovine genome location of ENCODE promoters and enhancers, along with 12 chromatin states built using 127 diverse epigenome. Here we make available the following files: a) Sheep_epigenome_predicted_features.tar.gz: contains the final reciprocal best alignment from ENCODE proximal as well as chromHMM ROADMAP features. The result of reciprocal liftOver. b) liftOver_sheep_temporary_files.tar.gz: We have added a new tar file with liftOver temporary files containing i) LiftOver temporary files mapping human to sheep, ii) LiftOver temporary files mapping sheep back to human and iii) Dictionary files containing the link between human to sheep coordinates for exact best-reciprocal files.

    Lineage: Building a comparative sheep functional annotation. Our approach exploited the wealth of functional annotation data generated by the Epigenome Roadmap and ENCODE studies. We performed reciprocal liftOver (minMatch=0.1), meaning elements that mapped to sheep also needed to map in the reverse direction back to human with high quality. This bi-directional comparative mapping approach was applied to 12 chromatin states defined using 5 core histone modification marks, H3K4me3, H3K4me1, H3K36me3, H3K9me3, H3K27me3. Mapping success is given in Supplementary Table 9. The same approach was applied to ENCODE marks derived from 94 cell types (https://www.encodeproject.org/data/annotations/v2/) with DNase-seq and TF ChIP-seq.

  10. c

    Data from: Slovenian Word in Context dataset SloWiC 1.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781?locale-attribute=en
    Explore at:
    Dataset updated
    Mar 23, 2023
    Authors
    Timotej Knez; Slavko Žitnik
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

    Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example

  11. DWUG DE: Diachronic Word Usage Graphs for German

    • zenodo.org
    zip
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg; Barbara McGillivray; Barbara McGillivray; Simon Hengchen; Simon Hengchen; Haim Dubossarsky; Haim Dubossarsky; Nina Tahmasebi; Nina Tahmasebi (2025). DWUG DE: Diachronic Word Usage Graphs for German [Dataset]. http://doi.org/10.5281/zenodo.5543724
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg; Barbara McGillivray; Barbara McGillivray; Simon Hengchen; Simon Hengchen; Haim Dubossarsky; Haim Dubossarsky; Nina Tahmasebi; Nina Tahmasebi
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains diachronic Word Usage Graphs (WUGs) for German. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    We provide additional data under misc/:

    • semeval: a larger list of words and (noisy) change scores assembled in the pre-annotation phase for SemEval-2020 Task 1.

    Please find more information on the provided data in the paper referenced below.

    Version: 1.0.0, 30.9.2021.

    Reference

    Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages.

  12. DiscoWUG: Discovered Diachronic Word Usage Graphs for German

    • zenodo.org
    zip
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Schlechtweg; Dominik Schlechtweg; Sinan Kurtyigit; Maike Park; Maike Park; Jonas Kuhn; Sabine Schulte im Walde; Sabine Schulte im Walde; Sinan Kurtyigit; Jonas Kuhn (2025). DiscoWUG: Discovered Diachronic Word Usage Graphs for German [Dataset]. http://doi.org/10.5281/zenodo.14028592
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Schlechtweg; Dominik Schlechtweg; Sinan Kurtyigit; Maike Park; Maike Park; Jonas Kuhn; Sabine Schulte im Walde; Sabine Schulte im Walde; Sinan Kurtyigit; Jonas Kuhn
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains discovered diachronic Word Usage Graphs (WUGs) for German. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    Note:

    • The date given for each word use does not correspond to the exact date of the document from which the use was sampled but only to the midpoint of the respective time period (1800-1899, 1946-1990), as the exact date was not available in the SemEval corpora.

    Please find more information on the provided data in the papers referenced below.

    Reference

    Sinan Kurtyigit, Maike Park, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde. 2021. Lexical Semantic Change Discovery. Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.

    Dominik Schlechtweg, Pierluigi Cassotti, Bill Noble, David Alfter, Sabine Schulte im Walde, Nina Tahmasebi. More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

  13. Annotated GMB Corpus

    • kaggle.com
    zip
    Updated Oct 7, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shoumik (2018). Annotated GMB Corpus [Dataset]. https://www.kaggle.com/shoumikgoswami/annotated-gmb-corpus
    Explore at:
    zip(473318 bytes)Available download formats
    Dataset updated
    Oct 7, 2018
    Authors
    Shoumik
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Named Entity Recognition for annotated corpus using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

    Content

    The dataset an extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed. The attached dataset is in tab separated format, the goal is to create a good model to classify the Tag column. The data is labelled using the IOB tagging system. Here are the following classes in the dataset - geo = Geographical Entity org = Organization per = Person gpe = Geopolitical Entity tim = Time indicator art = Artifact eve = Event nat = Natural Phenomenon

    Acknowledgements

    The dataset is a subset of the original dataset shared here - https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/kernels

    Inspiration

    The data can be used by anyone who is starting off with NER in NLP.

  14. Z

    ActiveHuman Part 1

    • data.niaid.nih.gov
    Updated Nov 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charalampos Georgiadis (2023). ActiveHuman Part 1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8359765
    Explore at:
    Dataset updated
    Nov 14, 2023
    Dataset provided by
    Aristotle University of Thessaloniki (AUTh)
    Authors
    Charalampos Georgiadis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is Part 1/2 of the ActiveHuman dataset! Part 2 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

    Folder configuration The dataset consists of 3 folders:

    JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

    Essential Terminology

    Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

    Dataset Data The dataset includes 4 types of JSON annotation files files:

    annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

    id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

    Most Labelers generate different annotation specifications in the spec key-value pair:

    BoundingBox2DLabeler/BoundingBox3DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

    template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

    label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

    label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

    label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

    captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

    id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

    sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

    ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

    id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

    Each Labeler generates different annotation specifications in the values key-value pair:

    BoundingBox2DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

    label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

    label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

    index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

    The SemanticSegmentationLabeler does not contain a values list.

    egos.json: Contains collections of key-value pairs for each ego. These include:

    id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

    id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

    Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

    e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.

  15. K

    TheSu XML Schema Definition, ns 1.0

    • rdr.kuleuven.be
    • data.europa.eu
    txt, xml
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele Morrone; Daniele Morrone (2023). TheSu XML Schema Definition, ns 1.0 [Dataset]. http://doi.org/10.48804/KD8QPO
    Explore at:
    xml(796287), txt(5295)Available download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    KU Leuven RDR
    Authors
    Daniele Morrone; Daniele Morrone
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset includes a replica of the current version of the TheSu XML Schema Definition (XSD) within the namespace 1.0, as available at the URL: https://alchemeast.eu/thesu/ns/1.0/TheSu.xsd. TheSu XML, an acronym for 'Thesis-Support', is an XML annotation schema for the digital analysis, indexing, and mapping of ideas and their contexts of enunciation in any source. It is tailored to assist research in the history of ideas, philosophy, science, and technology. The complete documentation for this XML Schema Definition is accessible at the URL: https://alchemeast.eu/thesu/ns/1.0/documentation/TheSu.html.

  16. Z

    Anatomical similarity annotations v0.2

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niknejad, Anne; Bastian, Frederic (2022). Anatomical similarity annotations v0.2 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6778883
    Explore at:
    Dataset updated
    Jun 29, 2022
    Dataset provided by
    SIB Swiss Institute of Bioinformatics
    Authors
    Niknejad, Anne; Bastian, Frederic
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The anatomical similarity annotations are used to define evolutionary relations between anatomical entities described in the Uberon ontology. The annotations are currently focused toward the concept of historical homology, meaning that they try to capture which structures are believed to derive from a common ancestral structure. These annotations follow practices similar to the Gene Ontology consortium guidelines to capture evidence lines. Each statement about homology is captured as a single annotation, providing reference, UBERON or Cell Ontology term to capture the anatomical entity, NCBI Taxonomy term the ancestral taxon, ECO term the evidence type, and Confidence Information Ontology term the confidence in the annotation.

  17. CCIHP dataset

    • kaggle.com
    zip
    Updated Oct 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelique Loesch (2022). CCIHP dataset [Dataset]. https://www.kaggle.com/angeliqueloesch/ccihp-characterized-crowd-instancelevel-hp
    Explore at:
    zip(24041 bytes)Available download formats
    Dataset updated
    Oct 29, 2022
    Authors
    Angelique Loesch
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Characterized Crowd Instance-level Human Parsing (CCIHP) dataset

    CCIHP dataset is devoted to fine-grained description of people in the wild with localized & characterized semantic attributes. It contains 20 attribute classes and 20 characteristic classes split into 3 categories (size, pattern and color). The annotations were made with Pixano, an opensource, smart annotation tool for computer vision applications: https://pixano.cea.fr/

    CCIHP dataset provides pixelwise image annotations for:

    • human segmentation,
    • semantic attribute segmentation and
    • semantic attribute characterization.

    Dataset description

    • Images:
      The image data are the same as CIHP dataset (see Section Related work) proposed at the LIP (Look Into Person) challenge. They are available at google drive and baidu drive. (Baidu link does not need access right).

    • Annotations:
      Please download and unzip the CCIHP_icip.zip file. The CCIHP annotations can be found in the Training and Validation sub-folders of CCIHP_icip2021/dataset/ folder. They correspond to, respectively, 28,280 training images and 5,000 validation images. Annotations consist of:

      • Human_ids: person instance labels
      • Instance_ids: attribute instance labels
      • Category_ids: attribute category labels
      • Size_ids: size category labels
      • Pattern_ids: pattern category labels
      • Color_ids: color category labels

      Label meaning for semantic attribute/body parts:

      1. Hat: Hat, helmet, cap, hood, veil, headscarf, part covering the skull and hair of a hood/balaclava, crown...
      2. Hair
      3. Glove
      4. Sunglasses/Glasses: Sunglasses, eyewear, protective glasses...
      5. UpperClothes: T-shirt, shirt, tank top, sweater under a coat, top of a dress...
      6. Face Mask: Protective mask, surgical mask, carnival mask, facial part of a balaclava, visor of a helmet...
      7. Coat: Coat, jacket worn without anything on it, vest with nothing on it, a sweater with nothing on it...
      8. Socks
      9. Pants: Pants, shorts, tights, leggings, swimsuit bottoms... (clothing with 2 legs)
      10. Torso-skin
      11. Scarf: Scarf, bow tie, tie...
      12. Skirt: Skirt, kilt, bottom of a dress...
      13. Face
      14. Left-arm (naked part)
      15. Right-arm (naked part)
      16. Left-leg (naked part)
      17. Right-leg (naked part)
      18. Left-shoe
      19. Right-shoe
      20. Bag: Backpack, shoulder bag, fanny pack... (bag carried on oneself)
      21. Others: Jewelry, tags, bibs, belts, ribbons, pins, head decorations, headphones...

      Label meaning for size characterization:

      1. Short: Small, short, narrow
      2. Long: Long, large, big
      3. Undetermined: If the attribute is partially hidden
      4. Sparse/bald: For hair attribute only

      Label meaning for pattern characterization:

      1. Solid
      2. Geometrical: Stripes, Checks, Dots...
      3. Fancy: Flowers, Military...
      4. Letters: Letters, numbers, symbols...

      Label meaning for color characterization:

      1. Dark: No dominant color, includes black, navy blue
      2. Medium: No dominant color, including gray
      3. Light: No dominant color, including white
      4. Brown
      5. Red
      6. Pink
      7. Yellow
      8. Orange
      9. Green
      10. Blue
      11. Purple
      12. Multicolor: When there is a pattern with several colors

    Related work

    Our work is based on CIHP image dataset from: Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang and Liang Lin, "Instance-level Human Parsing via Part Grouping Network", ECCV 2018.

    Evaluation

    To evaluate the predictions given by a Human Parsing with Characteristics model, you can run the python scripts in CCIHP_icip2021/evaluation/ folder.

    Requirements

    • python==3.6+
    • opencv-python
    • pillow

    Evaluation steps

    1. Run generate_characteristic_instance_part_ccihp.py
    2. Run eval_test_characteristic_inst_part_ap_ccihp.py for mean Average Precision based on characterized region (AP^(cr)_(vol)). It evaluates the prediction of characteristic (class & score) relative to each instanced and characterized attribute mask, independently of the attribute class prediction.
    3. Run metric_ccihp_miou_evaluation.py for a mIoU performance evaluation of semantic predictions (attribute or characteristics).

    License

    Data annotations are under Creative Commons Attribution Non Commercial 4.0 license (see LICENSE file).

    Evaluation codes are under MIT license.

    Citation

    A. Loesch and R. Audigier, "Describe Me If You Can! Characterized Instance-Level Human Parsing," 2021 IEEE International Conference on Image Processing (ICIP), 2021, pp. 2528-2532, doi: 10.1109/ICIP42928.2021.9506509.

    @INPROCEEDINGS{ccihp_dataset_2021,
     author={Loesch, Angelique and Audigier, Romaric},
     booktitle={2021 IEEE International Conference on Image Processing (ICIP)}, 
     title={Describe Me If You Can! Characterized Instance-Level Human Parsing}, 
     year={2021},
     volume={},
     number={},
     pages={2528-2532},
     doi={10.1109/ICIP42928.2021.9506509}},
    

    Contact

    If you have any question about this dataset, you can contact us by email at: ccihp-dataset@cea.fr

  18. R

    Microsoft Coco Dataset

    • universe.roboflow.com
    zip
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2025). Microsoft Coco Dataset [Dataset]. https://universe.roboflow.com/microsoft/coco/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    Microsoft
    Variables measured
    Object Bounding Boxes
    Description

    Microsoft Common Objects in Context (COCO) Dataset

    The Common Objects in Context (COCO) dataset is a widely recognized collection designed to spur object detection, segmentation, and captioning research. Created by Microsoft, COCO provides annotations, including object categories, keypoints, and more. The model it a valuable asset for machine learning practitioners and researchers. Today, many model architectures are benchmarked against COCO, which has enabled a standard system by which architectures can be compared.

    While COCO is often touted to comprise over 300k images, it's pivotal to understand that this number includes diverse formats like keypoints, among others. Specifically, the labeled dataset for object detection stands at 123,272 images.

    The full object detection labeled dataset is made available here, ensuring researchers have access to the most comprehensive data for their experiments. With that said, COCO has not released their test set annotations, meaning the test data doesn't come with labels. Thus, this data is not included in the dataset.

    The Roboflow team has worked extensively with COCO. Here are a few links that may be helpful as you get started working with this dataset:

  19. SURel: Synchronic Usage Relatedness

    • zenodo.org
    zip
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty (2025). SURel: Synchronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5543307
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    -------------------------------------
    Siehe unten für die deutsche Version.
    -------------------------------------

    Synchronic Usage Relatedness (SURel) - Test Set and Annotation Data


    This data collection supplementing the paper referenced below contains:

    - a semantic meaning shift test set with 22 German lexemes with different degrees of meaning shifts from general language to the domain of cooking. It comes as a tab-separated csv file where each line has the form

    lemma POS translations mean relatedness score frequency GEN frequency SPEC

    The 'mean relatedness score' denotes the annotation-based measure of semantic shift described in the paper. 'frequency GEN' and 'frequency SPEC' list the frequencies of the target words in the general language corpus (GEN) and the domain-specific cooking corpus (SPEC). 'translations' gives English translations for different senses, illustrating possible meaning shifts. Note that further senses might exist;

    - the full annotation tables as annotators received it filled it. The tables come in the form of a tab-separated csv file where each line has the form

    sentence 1 rating comment sentence 2;

    - the annotation guidelines in English and German (only the German version was used);
    - data visualization plots.

    Find more information in

    Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota USA 2019.

    Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.


    The resources are freely available for education, research and other non-commercial purposes. More information can be requested via email to the authors.

    -------
    Deutsch
    -------

    Synchroner Wortverwendungsbezug (SURel) - Test Set und Annotationsdaten


    Diese Datensammlung ergänzt den unten zitierten Artikel und enthält folgende Dateien:

    - ein Test Set für semantische Bedeutungsverschiebung mit 22 deutschen Lexemen, mit unteschiedlichen Graden an Bedeutungsverschiebungen von der Allgemeinsprache hin zur domänenspezifischen Sprache des Kochens. Hierbei handelt es sich um eine tab-separierte CSV-Datei, in der jede Zeile folgende Form hat:

    Lexem Wortart Übersetzungen Mean Relatedness Score Freqeunz GEN Frequenz SPEC

    Der 'Mean Realtedness Score' bezeichnet das annotationsbasierte Maß für Bedeutungsverschiebungen wie im Paper beschrieben. 'Frequenz GEN' und 'Frequenz SPEC' listen die Häufigkeiten der Zielwörter im allgemeinsprachlichen Korpus (GEN) und im domänenspezifischen Korpus (SPEC) auf. 'Übersetzungen' enthält englische Übersetzungen für mögliche Bedeutungen um die Bedeutungsverschiebung zu illustrieren. Beachten Sie dass auch andere Bedeutungen exitieren können;

    - Die Annotationstabellen, wie sie die Annotatoren erhalten aus ausgefüllt haben. Die Ergebnistabellen sind tab-separierte CSV-Dateien, in der jede Zeile folgende Form hat:

    Satz 1 Bewertung Kommentar Satz 2

    - die Annotationsrichtlinien auf Deutsch und Englisch (nur die deutsche Version wurde genutzt);
    - Visualisierungsplots der Daten.

    Mehr Informationen finden Sie in

    Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota USA 2019.

    Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.

    Die Ressourcen sind frei verfügbar für Lehre, Forschung sowie andere nicht-kommerzielle Zwecke. Für weitere Informationen schreiben Sie bitte eine E-Mail an die Autoren.

  20. NLM Chem Corpus

    • kaggle.com
    zip
    Updated Mar 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2021). NLM Chem Corpus [Dataset]. https://www.kaggle.com/saurabhshahane/nlm-chem-corpus
    Explore at:
    zip(2562967 bytes)Available download formats
    Dataset updated
    Mar 26, 2021
    Authors
    Saurabh Shahane
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects, and interactions with diseases, genes, and other chemicals.

    We, therefore, present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. Using this corpus, we built a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API.

    Content

    NLM-Chem corpus consists of 150 full-text articles from the PubMed Central Open Access dataset, comprising 67 different chemical journals, aiming to cover a general distribution of usage of chemical names in the biomedical literature. Articles were selected so that human annotation was most valuable (meaning that they were rich in bio-entities, and current state-of-the-art named entity recognition systems disagreed on bio-entity recognition.

    Ten indexing experts at the National Library of Medicine manually annotated the corpus using the TeamTat annotation system that allows swift annotation project management. The corpus was annotated in three batches and each batch of articles was annotated in three annotation rounds. Annotators were randomly paired for each article, and pairings were randomly shuffled for each subsequent batch. In this manner, the workload was distributed fairly. To control for bias, annotator identities were hidden the first two annotation rounds. In the final annotation rounds, annotators worked collaboratively to resolve the final few annotation disagreements and reach a 100% consensus.

    The full-text articles were fully annotated for all chemical name occurrences in text, and the chemicals were mapped to Medical Subject Heading (MeSH) entries to facilitate indexing and other downstream article processing tasks at the National Library of Medicine. MeSH is part of the UMLS and as such, chemical entities can be mapped to other standard vocabularies.

    The data has been evaluated for high annotation quality, and its use as training data has already improved chemical named entity recognition in PubMed. The newly improved system has already been incorporated in the PubTator API tools (https://www.ncbi.nlm.nih.gov/research/pubtator/api.html).

    Acknowledgements

    Islamaj, Rezarta; Leaman, Robert; Lu, Zhiyong (2021), NLMChem a new resource for chemical entity recognition in PubMed full-text literature, Dryad, Dataset, https://doi.org/10.5061/dryad.3tx95x6dz

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8361113

ActiveHuman Part 2

Explore at:
Dataset updated
Nov 14, 2023
Dataset provided by
Aristotle University of Thessaloniki (AUTh)
Authors
Charalampos Georgiadis
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration The dataset consists of 3 folders:

JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology

Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data The dataset includes 4 types of JSON annotation files files:

annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:

BoundingBox2DLabeler/BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:

BoundingBox2DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

The SemanticSegmentationLabeler does not contain a values list.

egos.json: Contains collections of key-value pairs for each ego. These include:

id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.

Search
Clear search
Close search
Google apps
Main menu