31 datasets found

Z
ActiveHuman Part 2
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Nov 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8361113
Explore at:
Dataset updated
Nov 14, 2023
Dataset provided by
Aristotle University of Thessaloniki (AUTh)
Authors
Charalampos Georgiadis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration The dataset consists of 3 folders:

JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology

Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data The dataset includes 4 types of JSON annotation files files:

annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:

BoundingBox2DLabeler/BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:

BoundingBox2DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

The SemanticSegmentationLabeler does not contain a values list.

egos.json: Contains collections of key-value pairs for each ego. These include:

id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
DWUG ES: Diachronic Word Usage Graphs for Spanish
zenodo.org
zip
Updated Apr 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg (2025). DWUG ES: Diachronic Word Usage Graphs for Spanish [Dataset]. http://doi.org/10.5281/zenodo.14891659
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14891659
Dataset updated
Apr 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Frank D. Zamora-Reina; Frank D. Zamora-Reina; Felipe Bravo-Marquez; Felipe Bravo-Marquez; Dominik Schlechtweg; Dominik Schlechtweg
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Please find more information on the provided data in the papers referenced below.

The annotation was funded by

ANID FONDECYT grant 11200290, U-Inicia VID Project UI-004/20,

ANID - Millennium Science Initiative Program - Code ICN17 002 and

SemRel Group (DFG Grants SCHU 2580/1 and SCHU 2580/2).

Version: 4.0.2, 7.1.2025. Full data. Quoting issues in uses resolved. Target word and target sentence indices corrected. One corrected context for word 'metro'. Judgments anonymized. Annotator 'gecsa' removed. Issues with special characters in filenames resolved. Additional removal of wrongly copied graphs.

Reference

Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics.

Dominik Schlechtweg, Tejaswi Choppa, Wei Zhao, Michael Roth. 2025. The CoMeDi Shared Task: Median Judgment Classification & Mean Disagreement Ranking with Ordinal Word-in-Context Judgments. In Proceedings of the 1st Workshop on Context and Meaning--Navigating Disagreements in NLP Annotations.
Hive Annotation Job Results - Cleaned and Audited
kaggle.com
zip
Updated Apr 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited
Explore at:
zip(471571 bytes)Available download formats
Dataset updated
Apr 28, 2021
Authors
Brendan Kelley
Description
Context

This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

Hive Data Audit Prompt

The raw data that accompanies the prompt can be found below:

Hive Annotation Job Results - Raw Data

^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

Content

Brendan Kelley April 23, 2021

Hive Data Audit Prompt Results

This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

Observation

The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

Assumptions

Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

Preparation

The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

• A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...
E
Data from: Metaphor annotations in Polish political debates from 2020 (TVP...
live.european-language-grid.eu
binary format
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
Explore at:
binary formatAvailable download formats
Dataset updated
Jun 30, 2021
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.
f
Definition of concept coverage scores for ASSESS CT manual annotation.
figshare.com
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Definition of concept coverage scores for ASSESS CT manual annotation. [Dataset]. http://doi.org/10.1371/journal.pone.0209547.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0209547.t002
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Connecticut
Description
Definition of concept coverage scores for ASSESS CT manual annotation.
g
Dataset with four years of condition monitoring technical language...
gimi9.com
Updated Jan 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Dataset with four years of condition monitoring technical language annotations from paper machine industries in northern Sweden | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-hafd-ms27/
Explore at:
Dataset updated
Jan 8, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Sweden
Description
This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title. Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']
Z
Disco-Annotation
data.niaid.nih.gov
Updated Oct 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popescu-Bellis, Andrei; Meyer, Thomas; Liyanapathirana, Jeevanthi; Cartoni, Bruno; Zufferey, Sandrine; Hajlaoui, Najeh (2020). Disco-Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4061389
Explore at:
Dataset updated
Oct 6, 2020
Dataset provided by
University of Geneva
Idiap Research Institute
Université Catholique de Louvain
Authors
Popescu-Bellis, Andrei; Meyer, Thomas; Liyanapathirana, Jeevanthi; Cartoni, Bruno; Zufferey, Sandrine; Hajlaoui, Najeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives in europarl texts.

The 8 connectives with their annotated relations are:

although (contrast|concession)

as (prep|causal|temporal|comparison|concession)

however (contrast|concession)

meanwhile (contrast|temporal)

since (causal|temporal|temporal-causal)

though (contrast|concession)

while (contrast|concession|temporal|temporal-contrast|temporal-causal)

yet (adv|contrast|concession)

For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.

If you need software for the latter, have a look at: https://github.com/idiap/DiscoConn-Classifier

Citation

Please cite the following papers if you make use of these datasets (and to know more about the annotation method):

@INPROCEEDINGS{Popescu-Belis-LREC-2012, author = {Popescu-Belis, Andrei and Meyer, Thomas and Liyanapathirana, Jeevanthi and Cartoni, Bruno and Zufferey, Sandrine}, title = {{D}iscourse-level {A}nnotation over {E}uroparl for {M}achine {T}ranslation: {C}onnectives and {P}ronouns}, booktitle = {Proceedings of the eighth international conference on Language Resources and Evaluation ({LREC})}, year = {2012}, address = {Istanbul, Turkey} }

@Article{Cartoni-DD-2013, Author = {Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas}, Title = {{Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique}}, Journal = {Dialogue & Discourse}, Volume = {4}, Number = {2}, pages = {65--86}, year = {2013} }

@ARTICLE{Meyer-TSLP-submitted, author = {Meyer, Thomas and Hajlaoui, Najeh and Popescu-Belis, Andrei}, title = {{Disambiguating Discourse Connectives for Statistical Machine Translation in Several Languages}}, journal = {IEEE/ACM Transactions of Audio, Speech, and Language Processing}, year = {submitted}, volume = {}, pages = {}, number = {} }
GMB Data
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghassen Khaled (2025). GMB Data [Dataset]. https://www.kaggle.com/datasets/ghassenkhaled/gmb-data
Explore at:
zip(3265952 bytes)Available download formats
Dataset updated
Jul 31, 2025
Authors
Ghassen Khaled
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
For this notebook, we're going to use the GMB (Groningen Meaning Bank) corpus for named entity recognition. GMB is a fairly large corpus with a lot of annotations. The data is labeled using the IOB format (short for inside, outside, beginning), which means each annotation also needs a prefix of I, O, or B.

The following classes appear in the dataset:

LOC - Geographical Entity ORG - Organization PER - Person GPE - Geopolitical Entity TIME - Time indicator ART - Artifact EVE - Event NAT - Natural Phenomenon Note: GMB is not completely human annotated, and it’s not considered 100% correct. For this exercise, classes ART, EVE, and NAT were combined into a MISC class due to small number of examples for these classes.
Preliminary functional annotation of the sheep genome
data.csiro.au
researchdata.edu.au
Updated Nov 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marina Naval Sanchez; Quan Nguyen; Sean McWilliam; Laercio Porto Neto; Toni Reverter-Gomez; James Kijas (2017). Preliminary functional annotation of the sheep genome [Dataset]. http://doi.org/10.4225/08/5a03a9c39a0ba
Explore at:
Unique identifier
https://doi.org/10.4225/08/5a03a9c39a0ba
Dataset updated
Nov 9, 2017
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Marina Naval Sanchez; Quan Nguyen; Sean McWilliam; Laercio Porto Neto; Toni Reverter-Gomez; James Kijas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2016 - Jan 1, 2017
Dataset funded by
CSIROhttp://www.csiro.au/
Description
In the absence of detailed functional annotation for any livestock genome, we used comparative genomics to predict ovine regulatory elements using human data. Reciprocal liftOver was used to predict the ovine genome location of ENCODE promoters and enhancers, along with 12 chromatin states built using 127 diverse epigenome. Here we make available the following files: a) Sheep_epigenome_predicted_features.tar.gz: contains the final reciprocal best alignment from ENCODE proximal as well as chromHMM ROADMAP features. The result of reciprocal liftOver. b) liftOver_sheep_temporary_files.tar.gz: We have added a new tar file with liftOver temporary files containing i) LiftOver temporary files mapping human to sheep, ii) LiftOver temporary files mapping sheep back to human and iii) Dictionary files containing the link between human to sheep coordinates for exact best-reciprocal files.

Lineage: Building a comparative sheep functional annotation. Our approach exploited the wealth of functional annotation data generated by the Epigenome Roadmap and ENCODE studies. We performed reciprocal liftOver (minMatch=0.1), meaning elements that mapped to sheep also needed to map in the reverse direction back to human with high quality. This bi-directional comparative mapping approach was applied to 12 chromatin states defined using 5 core histone modification marks, H3K4me3, H3K4me1, H3K36me3, H3K9me3, H3K27me3. Mapping success is given in Supplementary Table 9. The same approach was applied to ENCODE marks derived from 94 cell types (https://www.encodeproject.org/data/annotations/v2/) with DNase-seq and TF ChIP-seq.
c
Data from: Slovenian Word in Context dataset SloWiC 1.0
clarin.si
live.european-language-grid.eu
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781?locale-attribute=en
Explore at:
Dataset updated
Mar 23, 2023
Authors
Timotej Knez; Slavko Žitnik
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example
DWUG DE: Diachronic Word Usage Graphs for German
zenodo.org
zip
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Dominik Schlechtweg; Barbara McGillivray; Barbara McGillivray; Simon Hengchen; Simon Hengchen; Haim Dubossarsky; Haim Dubossarsky; Nina Tahmasebi; Nina Tahmasebi (2025). DWUG DE: Diachronic Word Usage Graphs for German [Dataset]. http://doi.org/10.5281/zenodo.5543724
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5543724
Dataset updated
Apr 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg; Barbara McGillivray; Barbara McGillivray; Simon Hengchen; Simon Hengchen; Haim Dubossarsky; Haim Dubossarsky; Nina Tahmasebi; Nina Tahmasebi
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains diachronic Word Usage Graphs (WUGs) for German. Find a description of the data format, code to process the data and further datasets on the WUGsite.

We provide additional data under misc/:

semeval: a larger list of words and (noisy) change scores assembled in the pre-annotation phase for SemEval-2020 Task 1.

Please find more information on the provided data in the paper referenced below.

Version: 1.0.0, 30.9.2021.

Reference

Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages.
DiscoWUG: Discovered Diachronic Word Usage Graphs for German
zenodo.org
zip
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Schlechtweg; Dominik Schlechtweg; Sinan Kurtyigit; Maike Park; Maike Park; Jonas Kuhn; Sabine Schulte im Walde; Sabine Schulte im Walde; Sinan Kurtyigit; Jonas Kuhn (2025). DiscoWUG: Discovered Diachronic Word Usage Graphs for German [Dataset]. http://doi.org/10.5281/zenodo.14028592
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14028592
Dataset updated
Apr 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Schlechtweg; Dominik Schlechtweg; Sinan Kurtyigit; Maike Park; Maike Park; Jonas Kuhn; Sabine Schulte im Walde; Sabine Schulte im Walde; Sinan Kurtyigit; Jonas Kuhn
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This data collection contains discovered diachronic Word Usage Graphs (WUGs) for German. Find a description of the data format, code to process the data and further datasets on the WUGsite.

Note:

The date given for each word use does not correspond to the exact date of the document from which the use was sampled but only to the midpoint of the respective time period (1800-1899, 1946-1990), as the exact date was not available in the SemEval corpora.

Please find more information on the provided data in the papers referenced below.

Reference

Sinan Kurtyigit, Maike Park, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde. 2021. Lexical Semantic Change Discovery. Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.

Dominik Schlechtweg, Pierluigi Cassotti, Bill Noble, David Alfter, Sabine Schulte im Walde, Nina Tahmasebi. More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.
Annotated GMB Corpus
kaggle.com
zip
Updated Oct 7, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shoumik (2018). Annotated GMB Corpus [Dataset]. https://www.kaggle.com/shoumikgoswami/annotated-gmb-corpus
Explore at:
zip(473318 bytes)Available download formats
Dataset updated
Oct 7, 2018
Authors
Shoumik
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

Named Entity Recognition for annotated corpus using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Content

The dataset an extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed. The attached dataset is in tab separated format, the goal is to create a good model to classify the Tag column. The data is labelled using the IOB tagging system. Here are the following classes in the dataset - geo = Geographical Entity org = Organization per = Person gpe = Geopolitical Entity tim = Time indicator art = Artifact eve = Event nat = Natural Phenomenon

Acknowledgements

The dataset is a subset of the original dataset shared here - https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/kernels

Inspiration

The data can be used by anyone who is starting off with NER in NLP.
Z
ActiveHuman Part 1
data.niaid.nih.gov
Updated Nov 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charalampos Georgiadis (2023). ActiveHuman Part 1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8359765
Explore at:
Dataset updated
Nov 14, 2023
Dataset provided by
Aristotle University of Thessaloniki (AUTh)
Authors
Charalampos Georgiadis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Part 1/2 of the ActiveHuman dataset! Part 2 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration The dataset consists of 3 folders:

JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology

Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data The dataset includes 4 types of JSON annotation files files:

annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:

BoundingBox2DLabeler/BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:

BoundingBox2DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

The SemanticSegmentationLabeler does not contain a values list.

egos.json: Contains collections of key-value pairs for each ego. These include:

id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
K
TheSu XML Schema Definition, ns 1.0
rdr.kuleuven.be
data.europa.eu
txt, xml
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniele Morrone; Daniele Morrone (2023). TheSu XML Schema Definition, ns 1.0 [Dataset]. http://doi.org/10.48804/KD8QPO
Explore at:
xml(796287), txt(5295)Available download formats
Unique identifier
https://doi.org/10.48804/KD8QPO
Dataset updated
Nov 27, 2023
Dataset provided by
KU Leuven RDR
Authors
Daniele Morrone; Daniele Morrone
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset includes a replica of the current version of the TheSu XML Schema Definition (XSD) within the namespace 1.0, as available at the URL: https://alchemeast.eu/thesu/ns/1.0/TheSu.xsd. TheSu XML, an acronym for 'Thesis-Support', is an XML annotation schema for the digital analysis, indexing, and mapping of ideas and their contexts of enunciation in any source. It is tailored to assist research in the history of ideas, philosophy, science, and technology. The complete documentation for this XML Schema Definition is accessible at the URL: https://alchemeast.eu/thesu/ns/1.0/documentation/TheSu.html.
Z
Anatomical similarity annotations v0.2
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niknejad, Anne; Bastian, Frederic (2022). Anatomical similarity annotations v0.2 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6778883
Explore at:
Dataset updated
Jun 29, 2022
Dataset provided by
SIB Swiss Institute of Bioinformatics
Authors
Niknejad, Anne; Bastian, Frederic
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The anatomical similarity annotations are used to define evolutionary relations between anatomical entities described in the Uberon ontology. The annotations are currently focused toward the concept of historical homology, meaning that they try to capture which structures are believed to derive from a common ancestral structure. These annotations follow practices similar to the Gene Ontology consortium guidelines to capture evidence lines. Each statement about homology is captured as a single annotation, providing reference, UBERON or Cell Ontology term to capture the anatomical entity, NCBI Taxonomy term the ancestral taxon, ECO term the evidence type, and Confidence Information Ontology term the confidence in the annotation.
CCIHP dataset
kaggle.com
zip
Updated Oct 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelique Loesch (2022). CCIHP dataset [Dataset]. https://www.kaggle.com/angeliqueloesch/ccihp-characterized-crowd-instancelevel-hp
Explore at:
zip(24041 bytes)Available download formats
Dataset updated
Oct 29, 2022
Authors
Angelique Loesch
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Characterized Crowd Instance-level Human Parsing (CCIHP) dataset

CCIHP dataset is devoted to fine-grained description of people in the wild with localized & characterized semantic attributes. It contains 20 attribute classes and 20 characteristic classes split into 3 categories (size, pattern and color). The annotations were made with Pixano, an opensource, smart annotation tool for computer vision applications: https://pixano.cea.fr/

CCIHP dataset provides pixelwise image annotations for:

human segmentation,

semantic attribute segmentation and

semantic attribute characterization.

Dataset description

Images:
The image data are the same as CIHP dataset (see Section Related work) proposed at the LIP (Look Into Person) challenge. They are available at google drive and baidu drive. (Baidu link does not need access right).

Annotations:
Please download and unzip the CCIHP_icip.zip file. The CCIHP annotations can be found in the Training and Validation sub-folders of CCIHP_icip2021/dataset/ folder. They correspond to, respectively, 28,280 training images and 5,000 validation images. Annotations consist of:

Human_ids: person instance labels

Instance_ids: attribute instance labels

Category_ids: attribute category labels

Size_ids: size category labels

Pattern_ids: pattern category labels

Color_ids: color category labels

Label meaning for semantic attribute/body parts:

Hat: Hat, helmet, cap, hood, veil, headscarf, part covering the skull and hair of a hood/balaclava, crown...

Hair

Glove

Sunglasses/Glasses: Sunglasses, eyewear, protective glasses...

UpperClothes: T-shirt, shirt, tank top, sweater under a coat, top of a dress...

Face Mask: Protective mask, surgical mask, carnival mask, facial part of a balaclava, visor of a helmet...

Coat: Coat, jacket worn without anything on it, vest with nothing on it, a sweater with nothing on it...

Socks

Pants: Pants, shorts, tights, leggings, swimsuit bottoms... (clothing with 2 legs)

Torso-skin

Scarf: Scarf, bow tie, tie...

Skirt: Skirt, kilt, bottom of a dress...

Face

Left-arm (naked part)

Right-arm (naked part)

Left-leg (naked part)

Right-leg (naked part)

Left-shoe

Right-shoe

Bag: Backpack, shoulder bag, fanny pack... (bag carried on oneself)

Others: Jewelry, tags, bibs, belts, ribbons, pins, head decorations, headphones...

Label meaning for size characterization:

Short: Small, short, narrow

Long: Long, large, big

Undetermined: If the attribute is partially hidden

Sparse/bald: For hair attribute only

Label meaning for pattern characterization:

Solid

Geometrical: Stripes, Checks, Dots...

Fancy: Flowers, Military...

Letters: Letters, numbers, symbols...

Label meaning for color characterization:

Dark: No dominant color, includes black, navy blue

Medium: No dominant color, including gray

Light: No dominant color, including white

Brown

Red

Pink

Yellow

Orange

Green

Blue

Purple

Multicolor: When there is a pattern with several colors

Related work

Our work is based on CIHP image dataset from: Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang and Liang Lin, "Instance-level Human Parsing via Part Grouping Network", ECCV 2018.

Evaluation

To evaluate the predictions given by a Human Parsing with Characteristics model, you can run the python scripts in CCIHP_icip2021/evaluation/ folder.

Requirements

python==3.6+

opencv-python

pillow

Evaluation steps

Run generate_characteristic_instance_part_ccihp.py

Run eval_test_characteristic_inst_part_ap_ccihp.py for mean Average Precision based on characterized region (AP^(cr)_(vol)). It evaluates the prediction of characteristic (class & score) relative to each instanced and characterized attribute mask, independently of the attribute class prediction.

Run metric_ccihp_miou_evaluation.py for a mIoU performance evaluation of semantic predictions (attribute or characteristics).

License

Data annotations are under Creative Commons Attribution Non Commercial 4.0 license (see LICENSE file).

Evaluation codes are under MIT license.

Citation

A. Loesch and R. Audigier, "Describe Me If You Can! Characterized Instance-Level Human Parsing," 2021 IEEE International Conference on Image Processing (ICIP), 2021, pp. 2528-2532, doi: 10.1109/ICIP42928.2021.9506509.

@INPROCEEDINGS{ccihp_dataset_2021, author={Loesch, Angelique and Audigier, Romaric}, booktitle={2021 IEEE International Conference on Image Processing (ICIP)}, title={Describe Me If You Can! Characterized Instance-Level Human Parsing}, year={2021}, volume={}, number={}, pages={2528-2532}, doi={10.1109/ICIP42928.2021.9506509}},

Contact

If you have any question about this dataset, you can contact us by email at: ccihp-dataset@cea.fr
R
Microsoft Coco Dataset
universe.roboflow.com
zip
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2025). Microsoft Coco Dataset [Dataset]. https://universe.roboflow.com/microsoft/coco/model/3
Explore at:
zipAvailable download formats
Dataset updated
Jul 23, 2025
Dataset authored and provided by
Microsoft
Variables measured
Object Bounding Boxes
Description
Microsoft Common Objects in Context (COCO) Dataset

The Common Objects in Context (COCO) dataset is a widely recognized collection designed to spur object detection, segmentation, and captioning research. Created by Microsoft, COCO provides annotations, including object categories, keypoints, and more. The model it a valuable asset for machine learning practitioners and researchers. Today, many model architectures are benchmarked against COCO, which has enabled a standard system by which architectures can be compared.

While COCO is often touted to comprise over 300k images, it's pivotal to understand that this number includes diverse formats like keypoints, among others. Specifically, the labeled dataset for object detection stands at 123,272 images.

The full object detection labeled dataset is made available here, ensuring researchers have access to the most comprehensive data for their experiments. With that said, COCO has not released their test set annotations, meaning the test data doesn't come with labels. Thus, this data is not included in the dataset.

The Roboflow team has worked extensively with COCO. Here are a few links that may be helpful as you get started working with this dataset:

An introduction to the COCO dataset

Weird images in COCO, and what that tells us about the utility and limits of COCO
SURel: Synchronic Usage Relatedness
zenodo.org
zip
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty (2025). SURel: Synchronic Usage Relatedness [Dataset]. http://doi.org/10.5281/zenodo.5543307
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5543307
Dataset updated
Apr 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anna Hätty; Dominik Schlechtweg; Dominik Schlechtweg; Sabine Schulte im Walde; Sabine Schulte im Walde; Anna Hätty
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
-------------------------------------
Siehe unten für die deutsche Version.
-------------------------------------

Synchronic Usage Relatedness (SURel) - Test Set and Annotation Data

This data collection supplementing the paper referenced below contains:

- a semantic meaning shift test set with 22 German lexemes with different degrees of meaning shifts from general language to the domain of cooking. It comes as a tab-separated csv file where each line has the form

lemma POS translations mean relatedness score frequency GEN frequency SPEC

The 'mean relatedness score' denotes the annotation-based measure of semantic shift described in the paper. 'frequency GEN' and 'frequency SPEC' list the frequencies of the target words in the general language corpus (GEN) and the domain-specific cooking corpus (SPEC). 'translations' gives English translations for different senses, illustrating possible meaning shifts. Note that further senses might exist;

- the full annotation tables as annotators received it filled it. The tables come in the form of a tab-separated csv file where each line has the form

sentence 1 rating comment sentence 2;

- the annotation guidelines in English and German (only the German version was used);
- data visualization plots.

Find more information in

Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota USA 2019.

Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.

The resources are freely available for education, research and other non-commercial purposes. More information can be requested via email to the authors.

-------
Deutsch
-------

Synchroner Wortverwendungsbezug (SURel) - Test Set und Annotationsdaten

Diese Datensammlung ergänzt den unten zitierten Artikel und enthält folgende Dateien:

- ein Test Set für semantische Bedeutungsverschiebung mit 22 deutschen Lexemen, mit unteschiedlichen Graden an Bedeutungsverschiebungen von der Allgemeinsprache hin zur domänenspezifischen Sprache des Kochens. Hierbei handelt es sich um eine tab-separierte CSV-Datei, in der jede Zeile folgende Form hat:

Lexem Wortart Übersetzungen Mean Relatedness Score Freqeunz GEN Frequenz SPEC

Der 'Mean Realtedness Score' bezeichnet das annotationsbasierte Maß für Bedeutungsverschiebungen wie im Paper beschrieben. 'Frequenz GEN' und 'Frequenz SPEC' listen die Häufigkeiten der Zielwörter im allgemeinsprachlichen Korpus (GEN) und im domänenspezifischen Korpus (SPEC) auf. 'Übersetzungen' enthält englische Übersetzungen für mögliche Bedeutungen um die Bedeutungsverschiebung zu illustrieren. Beachten Sie dass auch andere Bedeutungen exitieren können;

- Die Annotationstabellen, wie sie die Annotatoren erhalten aus ausgefüllt haben. Die Ergebnistabellen sind tab-separierte CSV-Dateien, in der jede Zeile folgende Form hat:

Satz 1 Bewertung Kommentar Satz 2

- die Annotationsrichtlinien auf Deutsch und Englisch (nur die deutsche Version wurde genutzt);
- Visualisierungsplots der Daten.

Mehr Informationen finden Sie in

Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, Minnesota USA 2019.

Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana USA 2018.

Die Ressourcen sind frei verfügbar für Lehre, Forschung sowie andere nicht-kommerzielle Zwecke. Für weitere Informationen schreiben Sie bitte eine E-Mail an die Autoren.
NLM Chem Corpus
kaggle.com
zip
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurabh Shahane (2021). NLM Chem Corpus [Dataset]. https://www.kaggle.com/saurabhshahane/nlm-chem-corpus
Explore at:
zip(2562967 bytes)Available download formats
Dataset updated
Mar 26, 2021
Authors
Saurabh Shahane
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects, and interactions with diseases, genes, and other chemicals.

We, therefore, present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. Using this corpus, we built a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API.

Content

NLM-Chem corpus consists of 150 full-text articles from the PubMed Central Open Access dataset, comprising 67 different chemical journals, aiming to cover a general distribution of usage of chemical names in the biomedical literature. Articles were selected so that human annotation was most valuable (meaning that they were rich in bio-entities, and current state-of-the-art named entity recognition systems disagreed on bio-entity recognition.

Ten indexing experts at the National Library of Medicine manually annotated the corpus using the TeamTat annotation system that allows swift annotation project management. The corpus was annotated in three batches and each batch of articles was annotated in three annotation rounds. Annotators were randomly paired for each article, and pairings were randomly shuffled for each subsequent batch. In this manner, the workload was distributed fairly. To control for bias, annotator identities were hidden the first two annotation rounds. In the final annotation rounds, annotators worked collaboratively to resolve the final few annotation disagreements and reach a 100% consensus.

The full-text articles were fully annotated for all chemical name occurrences in text, and the chemicals were mapped to Medical Subject Heading (MeSH) entries to facilitate indexing and other downstream article processing tasks at the National Library of Medicine. MeSH is part of the UMLS and as such, chemical entities can be mapped to other standard vocabularies.

The data has been evaluated for high annotation quality, and its use as training data has already improved chemical named entity recognition in PubMed. The newly improved system has already been incorporated in the PubTator API tools (https://www.ncbi.nlm.nih.gov/research/pubtator/api.html).

Acknowledgements

Islamaj, Rezarta; Leaman, Robert; Lu, Zhiyong (2021), NLMChem a new resource for chemical entity recognition in PubMed full-text literature, Dryad, Dataset, https://doi.org/10.5061/dryad.3tx95x6dz

Facebook

Twitter

Click to copy link

Link copied

Cite

Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8361113

ActiveHuman Part 2

Explore at:

Dataset updated

Nov 14, 2023

Dataset provided by

Aristotle University of Thessaloniki (AUTh)

Authors

Charalampos Georgiadis

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration The dataset consists of 3 folders:

JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology

Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data The dataset includes 4 types of JSON annotation files files:

annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:

BoundingBox2DLabeler/BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:

BoundingBox2DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

The SemanticSegmentationLabeler does not contain a values list.

egos.json: Contains collections of key-value pairs for each ego. These include:

id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.

Clear search

Close search

Google apps

Main menu

ActiveHuman Part 2

DWUG ES: Diachronic Word Usage Graphs for Spanish

Reference

Hive Annotation Job Results - Cleaned and Audited

Context

Content

Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

Definition of concept coverage scores for ASSESS CT manual annotation.

Dataset with four years of condition monitoring technical language...

Disco-Annotation

GMB Data

Preliminary functional annotation of the sheep genome

Data from: Slovenian Word in Context dataset SloWiC 1.0

DWUG DE: Diachronic Word Usage Graphs for German

DiscoWUG: Discovered Diachronic Word Usage Graphs for German

Reference

Annotated GMB Corpus

Context

Content

Acknowledgements

Inspiration

ActiveHuman Part 1

TheSu XML Schema Definition, ns 1.0

Anatomical similarity annotations v0.2

CCIHP dataset

Characterized Crowd Instance-level Human Parsing (CCIHP) dataset

Dataset description

Related work

Evaluation

Requirements

Evaluation steps

License

Citation

Contact

Microsoft Coco Dataset

Microsoft Common Objects in Context (COCO) Dataset

SURel: Synchronic Usage Relatedness

NLM Chem Corpus

Context

Content

Acknowledgements

ActiveHuman Part 2