55 datasets found

t
Data from: Analyzing Dataset Annotation Quality Management in the Wild
tudatalib.ulb.tu-darmstadt.de
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna (2023). Analyzing Dataset Annotation Quality Management in the Wild [Dataset]. http://doi.org/10.48328/tudatalib-1220
Explore at:
Unique identifier
https://doi.org/10.48328/tudatalib-1220
Dataset updated
Sep 7, 2023
Authors
Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.
d
Q-CAT Corpus Annotation Tool 1.4 - Dataset - B2FIND
b2find.dkrz.de
Updated Oct 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Q-CAT Corpus Annotation Tool 1.4 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/6ac0304a-3000-560b-bd6b-16512c3cf199
Explore at:
Dataset updated
Oct 24, 2023
Description
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations)
c
Q-CAT Corpus Annotation Tool 1.3
clarin.si
live.european-language-grid.eu
Updated Oct 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janez Brank (2023). Q-CAT Corpus Annotation Tool 1.3 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1493?show=full
Explore at:
Dataset updated
Oct 8, 2023
Authors
Janez Brank
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system.

Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI).
Data from: X-ray CT data with semantic annotations for the paper "A workflow...
catalog.data.gov
s.cnmilf.com
+2more
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2024). X-ray CT data with semantic annotations for the paper "A workflow for segmenting soil and plant X-ray CT images with deep learning in Google’s Colaboratory" [Dataset]. https://catalog.data.gov/dataset/x-ray-ct-data-with-semantic-annotations-for-the-paper-a-workflow-for-segmenting-soil-and-p-d195a
Explore at:
Dataset updated
May 2, 2024
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
Leaves from genetically unique Juglans regia plants were scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA). Soil samples were collected in Fall of 2017 from the riparian oak forest located at the Russell Ranch Sustainable Agricultural Institute at the University of California Davis. The soil was sieved through a 2 mm mesh and was air dried before imaging. A single soil aggregate was scanned at 23 keV using the 10x objective lens with a pixel resolution of 650 nanometers on beamline 8.3.2 at the ALS. Additionally, a drought stressed almond flower bud (Prunus dulcis) from a plant housed at the University of California, Davis, was scanned using a 4x lens with a pixel resolution of 1.72 µm on beamline 8.3.2 at the ALS Raw tomographic image data was reconstructed using TomoPy. Reconstructions were converted to 8-bit tif or png format using ImageJ or the PIL package in Python before further processing. Images were annotated using Intel’s Computer Vision Annotation Tool (CVAT) and ImageJ. Both CVAT and ImageJ are free to use and open source. Leaf images were annotated in following Théroux-Rancourt et al. (2020). Specifically, Hand labeling was done directly in ImageJ by drawing around each tissue; with 5 images annotated per leaf. Care was taken to cover a range of anatomical variation to help improve the generalizability of the models to other leaves. All slices were labeled by Dr. Mina Momayyezi and Fiona Duong.To annotate the flower bud and soil aggregate, images were imported into CVAT. The exterior border of the bud (i.e. bud scales) and flower were annotated in CVAT and exported as masks. Similarly, the exterior of the soil aggregate and particulate organic matter identified by eye were annotated in CVAT and exported as masks. To annotate air spaces in both the bud and soil aggregate, images were imported into ImageJ. A gaussian blur was applied to the image to decrease noise and then the air space was segmented using thresholding. After applying the threshold, the selected air space region was converted to a binary image with white representing the air space and black representing everything else. This binary image was overlaid upon the original image and the air space within the flower bud and aggregate was selected using the “free hand” tool. Air space outside of the region of interest for both image sets was eliminated. The quality of the air space annotation was then visually inspected for accuracy against the underlying original image; incomplete annotations were corrected using the brush or pencil tool to paint missing air space white and incorrectly identified air space black. Once the annotation was satisfactorily corrected, the binary image of the air space was saved. Finally, the annotations of the bud and flower or aggregate and organic matter were opened in ImageJ and the associated air space mask was overlaid on top of them forming a three-layer mask suitable for training the fully convolutional network. All labeling of the soil aggregate and soil aggregate images was done by Dr. Devin Rippner. These images and annotations are for training deep learning models to identify different constituents in leaves, almond buds, and soil aggregates Limitations: For the walnut leaves, some tissues (stomata, etc.) are not labeled and only represent a small portion of a full leaf. Similarly, both the almond bud and the aggregate represent just one single sample of each. The bud tissues are only divided up into buds scales, flower, and air space. Many other tissues remain unlabeled. For the soil aggregate annotated labels are done by eye with no actual chemical information. Therefore particulate organic matter identification may be incorrect. Resources in this dataset:Resource Title: Annotated X-ray CT images and masks of a Forest Soil Aggregate. File Name: forest_soil_images_masks_for_testing_training.zipResource Description: This aggregate was collected from the riparian oak forest at the Russell Ranch Sustainable Agricultural Facility. The aggreagate was scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 0,0,0; pores spaces have a value of 250,250, 250; mineral solids have a value= 128,0,0; and particulate organic matter has a value of = 000,128,000. These files were used for training a model to segment the forest soil aggregate and for testing the accuracy, precision, recall, and f1 score of the model.Resource Title: Annotated X-ray CT images and masks of an Almond bud (P. Dulcis). File Name: Almond_bud_tube_D_P6_training_testing_images_and_masks.zipResource Description: Drought stressed almond flower bud (Prunis dulcis) from a plant housed at the University of California, Davis, was scanned by X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 4x lens with a pixel resolution of 1.72 µm using. For masks, the background has a value of 0,0,0; air spaces have a value of 255,255, 255; bud scales have a value= 128,0,0; and flower tissues have a value of = 000,128,000. These files were used for training a model to segment the almond bud and for testing the accuracy, precision, recall, and f1 score of the model.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads Resource Title: Annotated X-ray CT images and masks of Walnut leaves (J. Regia) . File Name: 6_leaf_training_testing_images_and_masks_for_paper.zipResource Description: Stems were collected from genetically unique J. regia accessions at the 117 USDA-ARS-NCGR in Wolfskill Experimental Orchard, Winters, California USA to use as scion, and were grafted by Sierra Gold Nursery onto a commonly used commercial rootstock, RX1 (J. microcarpa × J. regia). We used a common rootstock to eliminate any own-root effects and to simulate conditions for a commercial walnut orchard setting, where rootstocks are commonly used. The grafted saplings were repotted and transferred to the Armstrong lathe house facility at the University of California, Davis in June 2019, and kept under natural light and temperature. Leaves from each accession and treatment were scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 170,170,170; Epidermis value= 85,85,85; Mesophyll value= 0,0,0; Bundle Sheath Extension value= 152,152,152; Vein value= 220,220,220; Air value = 255,255,255.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads
d
Q-CAT Corpus Annotation Tool 1.0 - Dataset - B2FIND
b2find.dkrz.de
Updated Oct 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Q-CAT Corpus Annotation Tool 1.0 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/1a4688b1-aa61-5f4a-8ba3-d691cb7133c5
Explore at:
Dataset updated
Oct 23, 2023
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system
E
Q-CAT Corpus Annotation Tool 1.5
live.european-language-grid.eu
clarin.si
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Q-CAT Corpus Annotation Tool 1.5 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/22955
Explore at:
Dataset updated
Jun 2, 2023
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system.

Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations) Version 1.5 supports listening to audio recordings (provided in the # sound_url comment line in CONLL-U)
f
DataSheet1_Benchmarking automated cell type annotation tools for single-cell...
figshare.com
docx
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuge Wang; Xingzhi Sun; Hongyu Zhao (2023). DataSheet1_Benchmarking automated cell type annotation tools for single-cell ATAC-seq data.docx [Dataset]. http://doi.org/10.3389/fgene.2022.1063233.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.1063233.s001
Dataset updated
Jun 21, 2023
Dataset provided by
Frontiers
Authors
Yuge Wang; Xingzhi Sun; Hongyu Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As single-cell chromatin accessibility profiling methods advance, scATAC-seq has become ever more important in the study of candidate regulatory genomic regions and their roles underlying developmental, evolutionary, and disease processes. At the same time, cell type annotation is critical in understanding the cellular composition of complex tissues and identifying potential novel cell types. However, most existing methods that can perform automated cell type annotation are designed to transfer labels from an annotated scRNA-seq data set to another scRNA-seq data set, and it is not clear whether these methods are adaptable to annotate scATAC-seq data. Several methods have been recently proposed for label transfer from scRNA-seq data to scATAC-seq data, but there is a lack of benchmarking study on the performance of these methods. Here, we evaluated the performance of five scATAC-seq annotation methods on both their classification accuracy and scalability using publicly available single-cell datasets from mouse and human tissues including brain, lung, kidney, PBMC, and BMMC. Using the BMMC data as basis, we further investigated the performance of these methods across different data sizes, mislabeling rates, sequencing depths and the number of cell types unique to scATAC-seq. Bridge integration, which is the only method that requires additional multimodal data and does not need gene activity calculation, was overall the best method and robust to changes in data size, mislabeling rate and sequencing depth. Conos was the most time and memory efficient method but performed the worst in terms of prediction accuracy. scJoint tended to assign cells to similar cell types and performed relatively poorly for complex datasets with deep annotations but performed better for datasets only with major label annotations. The performance of scGCN and Seurat v3 was moderate, but scGCN was the most time-consuming method and had the most similar performance to random classifiers for cell types unique to scATAC-seq.
Data from: AnnoSM: An Automated Annotation Tool for Determining the...
figshare.com
acs.figshare.com
xlsx
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xing Wang; An-qi Guo; Rui Wang; Wen Gao; Hua Yang (2024). AnnoSM: An Automated Annotation Tool for Determining the Substituent Modes on the Parent Skeleton Based on a Characteristic MS/MS Fragment Ion Library [Dataset]. http://doi.org/10.1021/acs.analchem.3c04946.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.3c04946.s002
Dataset updated
Feb 22, 2024
Dataset provided by
ACS Publications
Authors
Xing Wang; An-qi Guo; Rui Wang; Wen Gao; Hua Yang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Mass spectrometry (MS) is a powerful technology for the structural elucidation of known or unknown small molecules. However, the accuracy of MS-based structure annotation is still limited due to the presence of numerous isomers in complex matrices. There are still challenges in automatically interpreting the fine structure of molecules, such as the types and positions of substituents (substituent modes, SMs) in the structure. In this study, we employed flavones, flavonols, and isoflavones as examples to develop an automated annotation method for identifying the SMs on the parent molecular skeleton based on a characteristic MS/MS fragment ion library. Importantly, user-friendly software AnnoSM was built for the convenience of researchers with limited computational backgrounds. It achieved 76.87% top-1 accuracy on the 148 authentic standards. Among them, 22 sets of flavonoid isomers were successfully differentiated. Moreover, the developed method was successfully applied to complex matrices. One such example is the extract of Ginkgo biloba L. (EGB), in which 331 possible flavonoids with SM candidates were annotated. Among them, 23 flavonoids were verified by authentic standards. The correct SMs of 13 flavonoids were ranked first on the candidate list. In the future, this software can also be extrapolated to other classes of compounds.
E
Q-CAT Corpus Annotation Tool 1.2
live.european-language-grid.eu
clarin.si
Updated Jul 13, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Q-CAT Corpus Annotation Tool 1.2 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20153
Explore at:
Dataset updated
Jul 13, 2021
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system.

Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags.
National Coral Reef Monitoring Program: Benthic cover derived from analysis...
accession.nodc.noaa.gov
cmr.earthdata.nasa.gov
+1more
html
Updated Apr 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA Pacific Islands Fisheries Science Center, Ecosystem Sciences Division (2024). National Coral Reef Monitoring Program: Benthic cover derived from analysis of images collected from climate stations across American Samoa [Dataset]. http://doi.org/10.7289/v5z31wzh
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.7289/v5z31wzh
Dataset updated
Apr 16, 2024
Dataset provided by
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
National Marine Fisheries Servicehttps://www.fisheries.noaa.gov/
Authors
NOAA Pacific Islands Fisheries Science Center, Ecosystem Sciences Division
Time period covered
Feb 15, 2015 - Present
Area covered
Pacific Ocean, American Samoa, Pacific Ocean, Pacific Ocean, American Samoa, Pacific Ocean, Pacific Ocean, Pacific Ocean, Pacific Ocean, Pacific Ocean, Pacific Ocean, Pacific Ocean, American Samoa
Description
The benthic cover data in this collection result from the analysis of images produced during benthic photo-quadrat surveys conducted along transects at climate stations and permanent sites across American Samoa. These sites were identified by the Ocean and Climate Change team and the ongoing National Coral Reef Monitoring Program. Benthic habitat imagery were quantitatively analyzed using a web-based annotation tool called CoralNet (Beijbom et al. 2015). In general, images are analyzed to produce three functional group levels of benthic cover: Tier 1 (e.g., hard coral, soft coral, macroalgae, turf algae, etc.), Tier 2 (e.g., Hard Coral = massive, branching, foliose, encrusting, etc.; Macroalgae = upright macroalgae, encrusting macroalgae, bluegreen macroalgae, and Halimeda, etc.), and Tier 3 (e.g., Hard Coral = Astreopora sp, Favia sp, Pocillopora, etc.; Macroalgae = Caulerpa sp, Dictyosphaeria sp, Padina sp, etc.). The imagery analyzed in order to produce the benthic cover data is also included in this collection.
d
Data from: On-Reliability-of-Annotations-in-Contextual-Emotion-Imagery
search.dataone.org
figshare.com
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martínez-Miwa, Carlos; Castelán, Mario (2023). On-Reliability-of-Annotations-in-Contextual-Emotion-Imagery [Dataset]. https://search.dataone.org/view/sha256%3A007681f621e7dad5a5d6faf2f4290a80810b26787cff647af1f74b98687fb291
Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Martínez-Miwa, Carlos; Castelán, Mario
Description
Source code for replicating the results presented in our paper "On Reliability of Annotations in Contextual Emotion Imagery". This code was computed under Matlab 2022. For best results, try to use this or subsequent versions. Authors: Carlos A. Martínez-Miwa; Mario Castelán. Cinvestav, México.
Z
MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish
data.niaid.nih.gov
zenodo.org
Updated Oct 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4612274
Explore at:
Dataset updated
Oct 28, 2021
Dataset provided by
Krallinger, Martin
Antonio, Miranda
Gasco, Luis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/

Introduction: These corpora contain the data for each of the subtracks of MESINESP2 shared-task:

[Subtrack 1] MESINESP-L – Scientific Literature :

Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:

Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.

Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.

Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:

213 articles were annotated by more than one annotator. We have selected de union between annotations.

852 articles were annotated by only one of the three selected annotators with better performance.

Test set: We provide a test set containing 10179 abstract without DeCS codes (not annotated) from LILACS and IBECS. Participants will have to predict the DecS codes for each of the abstracts in the entire dataset. However, the evaluation of the systems will only be made on the set of 500 expert-annotated abstracts that will be published as Gold Standard after finishing the evaluation period.

[Subtrack 2] MESINESP-T- Clinical Trials:

Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.

Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.

Test set: The test dataset contains a collection of 8919 items. Out of this subset, there are 461 clinical trials coming from REEC and 8458 clinical trials artificially constructed from drug datasheets that have a similar structure to REEC documents. The evaluation of the systems will be performed on a set of 250 items annotated by DeCS experts following the same protocol as in subtrack 1. Similarly, these items will be published as Gold Standard after completion of the task.

[Subtrack 3] MESINESP-P – Patents:

Development set: We provide a Development set manually indexed by expert annotators. This dataset includes 115 patents in Spanish extracted from Google Patents which have the IPC code “A61P” and “A61K31”. We have selected these patents based on semantic similarity to the MESINESP-L training set to facilitate model generation and to try to improve model performance.

Test set: We provide a test set containing 68404 records that correspond to the total number of patents published in Spanish with the IPC codes “A61P” and “A61K31”. From this set, 150 will be selected and indexed by DeCS experts under the protocol defined in subtask 1, which will be used to evaluate the quality of the developed systems. Similarly to the development set, we selected these 150 records based on semantic similarity to the MESINESP-L training set.

Additional data:

We provide this information to the participants as additional data in the “Additional Data” folder. For each training, development, and test set there is an additional JSON file with the structure shown here. Each file contains entities related to medications, diseases, symptoms, and medical procedures extrated with the BSC NERs.

Files structure:

Subtrack1-Scientific_Literature.zip contains the corpora generated for subtrack 1. Content:

Subtrack1:

Train

training_set_track1_all.json: Full training set for subtrack 1.

training_set_track1_only_articles.json: Articles training set for subtrack 1.

Development

development_set_subtrack1.json: Manually annotated development set for subtrack 1.

Test

test_set_subtrack1.json: Test set for subtrack 1.

Subtrack2-Clinical_Trials.zip contains the corpora generated for subtrack 2. Content:

Subtrack2:

Train

training_set_subtrack2.json: Training set for subtrack 2.

Development

development_set_subtrack2.json: Manually annotated development set for subtrack 2.

Test

test_set_subtrack2.json: Test set for subtrack 2.

Subtrack3-Patents.zip contains the corpora generated for subtrack 3. Content:

Subtrack3:

Development

development_set_subtrack3.json: Manually annotated development set for subtrack 3.

Test

test_set_subtrack3.json: Test set for subtrack 3.

Additional data.zip contains the corpora with additional data for each subtrack of MESINESP2.

DeCS2020.tsv contains a DeCS table with the following structure:

DeCS code

Preferred descriptor (the preferred label in the Latin Spanish DeCS 2020 set)

List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020 set, separated by pipes.

DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.

*Note: The obo and tsv files with DeCS2020 descriptors contain some additional COVID19 descriptors that will be included in future versions of DeCS. These items were provided by the Pan American Health Organization (PAHO), which has kindly shared this content to improve the results of the task by taking these descriptors into account.

For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at lgasco@bsc.es
c
Annotations for The Clinical Proteomic Tumor Analysis Consortium Pancreatic...
cancerimagingarchive.net
csv, dicom, n/a
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive, Annotations for The Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/BW9V-BX61
Explore at:
csv, dicom, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/BW9V-BX61
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
Jul 24, 2023
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
This dataset contains image annotations derived from "The Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma Collection (CPTAC-PDA)”. This dataset was generated as part of a National Cancer Institute project to augment images from The Cancer Imaging Archive with tumor annotations that will improve their value for cancer researchers and artificial intelligence experts.
Annotation Protocol
For each patient, all scans were reviewed to identify and annotate the clinically relevant time points and sequences/series. Scans were initially annotated by an international team of radiologists holding MBBS degrees or higher, which were then reviewed by US-based board-certified radiologists to ensure accuracy. In a typical patient all available time points were annotated. The following annotation rules were followed:

PERCIST criteria was followed for PET imaging. Specifically, the lesions estimated to have the most elevated SUVmax were annotated.

RECIST 1.1 was otherwise generally followed for MR and CT imaging. A maximum of 5 lesions were annotated per patient scan (timepoint); no more than 2 per organ. The same 5 lesions were annotated at each time point. Lymph nodes were annotated if >1 cm in short axis. Other lesions were annotated if >1 cm. If the primary lesion measures < 1 cm, it was still annotated.

Three-dimensional segmentations of lesions were created in the axial plane. If no axial plane was available, lesions were annotated in the coronal plane.

MRIs were annotated using axial T1-weighted post contrast sequences that best demonstrated the tumor.

CTs were annotated using all axial post contrast series. If not available, the axial non-contrast series were annotated.

PET/CTs were annotated on the CT and attenuation corrected PET images, unless there was a diagnostic CT from the same time point, in which case the CT portion of the PET/CT was not annotated.

Lesions were labeled separately.

Seed points were automatically generated, but reviewed by a radiologist.

A “negative” annotation was created for any exam without findings.

At each time point:

Volume calculations were performed for each segmented structure. These calculations are provided in the Annotation Metadata CSV.

A seed point (kernel) was created for each segmented structure. The seed points for each segmentation are provided in a separate DICOM RTSTRUCT file.

SNOMED-CT “Anatomic Region Sequence” and “Segmented Property Category Code Sequence” and codes were inserted for all segmented structures.

Imaging time point codes were inserted to help identify each annotation in the context of the clinical trial assessment protocol.

“Clinical Trial Time Point ID” was used to encode time point type using one of the following strings as applicable: “pre-dose” or “post-chemotherapy”.

Content Item in “Acquisition Context Sequence” was added containing "Time Point Type" using Concept Code Sequence (0040,A168) selected from:

(255235001, SCT, “Pre-dose”)

(719864002, SCT, "Post-cancer treatment monitoring")

Important supplementary information and sample code

A spreadsheet containing key details about the annotations is available in the Data Access section below.

A Jupyter notebook demonstrating how to use the NBIA Data Retriever Command-Line Interface application and the REST API to access these data can be found in the Additional Resources section below.
t
Image annotation data from baited camera lander deployments on the Cabo...
service.tib.eu
Updated Nov 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Image annotation data from baited camera lander deployments on the Cabo Verde Abyssal Plain - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-961314
Explore at:
Dataset updated
Nov 30, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A total of eight deployments of an autonomous baited camera lander were conducted at the Cabo Verde Abyssal Plain (tropical East Atlantic, Lat. 14.72, Lon. -25.19, Water depth ~4200 m) using either Atlantic mackerel (Scomber scombrus, n=4) or Patagonian squid (Doryteuthis gahi, n=4) bait, to photograph organisms attracted to the bait over roughly 24 hours. The deployments took place during the iMirabilis2 campaign in August 2021 from the research vessel Sarmiento de Gamboa. A deep-sea time lapse camera system with an oblique view of the bait plate (12 cm x 45 cm) and surroundings took a picture every 150 seconds. The bar attached to the bait plate is 6 cm wide. The camera was located about 120 cm above the seafloor with an oblique view of 40 degrees (assuming straight down in 0 degrees). Annotations were performed in BIIGLE software (Langenkämper et al. 2017) on every second photograph, providing the morphospecies group label (or 'No ID' if to morphospecies level was not possible) and the taxonomic hierarchy to a level of best confidence for each annotation. Annotations were rectangular in shape, enclosing each individual so that the centre of the annotation was roughly the centre of mass, and the points of each rectangle corner are provided in pixels (x,y) where the lower left corner of the picture is 0,0. Images were 6000 pixels in width and 4000 pixels in height.
o
Drug-interaction annotations over a large set of drug product labels
explore.openaire.eu
data.niaid.nih.gov
Updated Aug 27, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard D. Richard D. Boyce (2018). Drug-interaction annotations over a large set of drug product labels [Dataset]. http://doi.org/10.5281/zenodo.1404312
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1404312
Dataset updated
Aug 27, 2018
Authors
Richard D. Richard D. Boyce
Description
Drug interaction annotations over a large set of drug product labels. These annotations were done as part of the National Library of Medicine funded research project "Addressing gaps in clinically useful evidence on drug-drug interactions" (R01LM011838) National Library of Medicine R01LM011838
c
Amazon Mechanical Turk: Sentence annotation experiments
datacatalogue.cessda.eu
Updated Mar 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lau, J; Lappin, S (2025). Amazon Mechanical Turk: Sentence annotation experiments [Dataset]. http://doi.org/10.5255/UKDA-SN-851337
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-851337
Dataset updated
Mar 25, 2025
Dataset provided by
King
Authors
Lau, J; Lappin, S
Time period covered
Oct 1, 2012 - Sep 30, 2015
Area covered
United Kingdom, United States
Variables measured
Individual
Measurement technique
Amazon Mechanical Turk crowd sourcing
Description
This data collection consists of two .csv files containing lists of sentences with individual and mean sentence ratings (crowd sourced judgements) on three modes of presentation.

This research holds out the prospect of important impact in two areas. First, it can shed light on the relationship between the representation and acquisition of linguistic knowledge on one hand, and learning and the encoding of knowledge in other cognitive domains. This work can, in turn, help to clarify the respective roles of biologically conditioned learning biases and data driven learning in human cognition.

Second, this work can contribute to the development of more effective language technology by providing insight, from a computational perspective, into the way in which humans represent the syntactic properties of sentences in their language. To the extent that natural language processing systems take account of this class of representations they will provide more efficient tools for parsing and interpreting text and speech.
In the past twenty-five years work in natural language technology has made impressive progress across a wide range of tasks, which include, among others, information retrieval and extraction, text interpretation and summarization, speech recognition, morphological analysis, syntactic parsing, word sense identification, and machine translation. Much of this progress has been due to the successful application of powerful techniques for probabilistic modeling and statistical analysis to large corpora of linguistic data. These methods have given rise to a set of engineering tools that are rapidly shaping the digital environment in which we access and process most of the information that we use.

In recent work (Lappin and Shieber (2007), Clark and Lappin (2011a), Clark and Lappin (2011b)) my co-authors and I have argued that the machine learning methods that are driving the expansion of natural language technology are also directly relevant to understanding central features of human language acquisition. When these methods are used to construct carefully specified formal models and implementations of the grammar induction task, they yield striking insights into the limits and possibility of human learning on the basis of the primary linguistic data to which children are exposed. These models indicate that language learning can be achieved without the sorts of strong innate learning biases that have been posited by traditional theories of universal grammar. Weak biases, some derivable from non-linguistic cognitive domains, and domain general learning procedures are sufficient to support efficient data driven learning of plausible systems of grammatical representation.

In the current research I am focussing on the problem of how to specify the class of representations that encode human knowledge of the syntax of natural languages. I am pursuing the hypothesis that a representation in this class is best expressed as an enriched statistical language model that assigns probability values to the sentences of a language. A central part of the enrichment of the model consists of a procedure for determining the acceptability (grammaticality) of a sentence as a graded value, relative to the properties of that sentence and the language of which it is a part. This procedure avoids the simple reduction of the grammaticality of a string to its estimated probability of occurrence, while still characterizing grammaticality in probabilistic terms. An enriched model of this kind will provide a straightforward explanation for the fact that individual native speakers generally judge the well formedness of sentences along a continuum, rather than through the imposition of a sharp boundary between acceptable and unacceptable sentences. The pervasiveness of gradedness in the linguistic knowledge of individual speakers poses a serious problem for classical theories of syntax, which partition strings of words into the grammatical sentences of a language and ill formed strings of words.
Biolinks, datasets and algorithms supporting semantic-based distribution and...
zenodo.org
data.niaid.nih.gov
bin, tsv, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leyla Jael Garcia Castro; Rafael Berlanga; Alexander Garcia; Leyla Jael Garcia Castro; Rafael Berlanga; Alexander Garcia (2020). Biolinks, datasets and algorithms supporting semantic-based distribution and similarity for scientific publications [Dataset]. http://doi.org/10.5281/zenodo.290371
Explore at:
bin, zip, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.290371
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Leyla Jael Garcia Castro; Rafael Berlanga; Alexander Garcia; Leyla Jael Garcia Castro; Rafael Berlanga; Alexander Garcia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Finding articles related to a publication of interest remains a challenge in the Life Sciences domain as the number of scientific publications grows day by day. Publication repositories such as PubMed and Elsevier provides a list of similar articles. There, similarity is commonly calculated based on title, abstract and some keywords assigned to articles. Here we present the datasets and algorithms used in Biolinks. Biolinks uses ontological concepts extracted from publication and makes it possible to calculate a distribution score according to semantic groups as well as a semantic similarity based on either all identified annotations or narrowed to one or more particular semantic groups.

Materials: In a previous work [1], 4,240 articles from the TREC-05 collection [2] were selected. The title-and-abstract for those 4,240 articles were annotated with Unified Medical Language System (UMLS) concepts, such annotations are refer to as our TA-dataset and correspond to the JSON files under the pubmed folder in the JSON-LD.zip file. From those 4,240 articles, full-text was available for only 62. The title-and-abstract annotations for those 62 articles, TAFT-dataset, are located under the pubmed-pmc folder in the JSON-LD.zip file, which also contains the full-text annotations under the folder pmc, FT-dataset. The list corresponding to articles with title-and-abstract is found in the genomics.qrels.large.pubmed.onlyRelevants.titleAndAbstract.tsv file, while those with full-text are recorded in the genomics.qrels.large.pmc.onlyRelevants.fullContent.tsv file.

Methods: The TA-dataset was used to calculate the Information Gain (IG) according to the UMLS semantic groups, see IG_umls_groups.PMID.xlsx. A new grouping is proposed for Biolinks, see biolinks_groups.tsv. The IG was calculated for Biolinks groups as well, IG_biolinks_groups.PMID.xlsx, showing a improvement around 5%.

Biolinks groups were used to calculate a semantic group distribution score for each article in all our datasets. A semantic similarity metric based on PubMed related articles [3] is also provided; the Biolinks groups can be used to narrow the similarity to one or more selected groups. All the corresponding algorithms are open-access and available on GitHub under the license Apache-2.0, a frozen version, biotea-io-parser-master.zip, is provided here. In order to facilitate the analysis of our datasets based on the annotations as well as the distribution and similarity scores, some web-based visualization components were created. All of them open-access and available in GitHub under the license Apache-2.0; frozen versions are provided here, see files biotea-vis-annotation-master.zip, biotea-vis-similarity-master.zip, biotea-vis-tooltip-master.zip and biotea-vis-topicDistribution-master.zip. These components are brought together by biotea-vis-biolinks-master.zip. A demo is provided at http://ljgarcia.github.io/biotea-biolinks/; this demo was built on top of GitHub pages, a frozen version of the gh-pages branch is provided here, see biotea-biolinks-gh-pages.zip.

Conclusions: Biolinks assigns a weight to each semantic group based on the annotations extracted from either title-and-abstract or full-text articles. It also measures similarity for a pair of documents using the semantic information. The distribution and similarity metrics can be narrowed to a subset of the semantic groups, enabling researchers to focus on what is more relevant to them.

[1] Garcia Castro, L.J., R. Berlanga, and A. Garcia, In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access. Journal of Biomedical Informatics, 2015. 57: p. 204-218

[2] Text Retrieval Conference 2005 - Genomics Track. TREC-05 Genomics Track ad hoc relevance judgement. 2005 [cited 2016 23rd August]; Available from: http://trec.nist.gov/data/genomics/05/genomics.qrels.large.txt

[3] Lin, J. and W.J. Wilbur, PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics, 2007. 8(1): p. 423
Top 15 metabolic processes identified among the differentially expressed...
plos.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horst Joachim Schirra; Cameron G. Anderson; William J. Wilson; Linda Kerr; David J. Craik; Michael J. Waters; Agnieszka M. Lichanska (2023). Top 15 metabolic processes identified among the differentially expressed metabolic genes using DAVID Functional Annotation Tool. [Dataset]. http://doi.org/10.1371/journal.pone.0002764.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0002764.t002
Dataset updated
Jun 7, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Horst Joachim Schirra; Cameron G. Anderson; William J. Wilson; Linda Kerr; David J. Craik; Michael J. Waters; Agnieszka M. Lichanska
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top 15 metabolic processes identified among the differentially expressed metabolic genes using DAVID Functional Annotation Tool.
d
Sharks and rays swimming in a large public aquarium
search.dataone.org
data.niaid.nih.gov
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TomÃ¡s Lopes (2024). Sharks and rays swimming in a large public aquarium [Dataset]. http://doi.org/10.5061/dryad.2rbnzs7tg
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2rbnzs7tg
Dataset updated
Feb 14, 2024
Dataset provided by
Dryad Digital Repository
Authors
TomÃ¡s Lopes
Time period covered
Jan 1, 2023
Description
Large Public Aquaria are complex ecosystems that require constant monitoring to detect and correct anomalies that may affect the habitat and their species. Many of those anomalies can be directly or indirectly spotted by monitoring the behavior of fish. This can be a quite laborious task to be done by biologists alone. Automated fish tracking methods, specially of the non-intrusive type, can help biologists in the timely detection of such events. These systems require annotated data of fish to be trained. We used footage collected from the main aquarium of OceanÃ¡rio de Lisboa to create a novel dataset with fish annotations from the shark and ray species. The dataset has the following characteristics:

66 shark training tracks with a total of 15812 bounding boxes 88 shark testing tracks with a total of 15978 bounding boxes 133 ray training tracks with a total of 28168 bounding boxes 192 ray testing tracks with a total of 31529 bounding boxes

The training set corresponds to a calm enviro..., The dataset was collected using a stationary camera positioned outside the main tank of OceanÃ¡rio de Lisboa aiming at the fish. Additionally, this data was processed using the CVAT annotation tool to create the sharks and rays annotations., , # Sharks and rays swimming in a large public aquarium

Each set has 2 folders: gt and img1. The gt folder contains 3 txt files: gt, gt_out and labels. The gt and gt_out files contain the bounding box annotations sorted in two distinct ways. The former has the annotations sorted by frame number, while the latter is sorted by the track ID. Each line of the ground truth files represents one bounding box of a fish trajectory. The bounding boxes are represented with the following format: frame id, track id, x, y, w, h, not ignored, class id, visibility. The folder img1 contains all the annotated frames.

Description of the bounding boxes variables:

frame id points to the frame where the bounding box was obtained;

track id identifies the track of a fish with which the bonding box is associated;

x and y are the pixel coordinates of the top left corner of the bounding box;

w and h are the width and height of the bounding box respectively. These variables are measured in terms of pixels o...
d
An improved reference of the PN40024 grapevine genome assembly (PN40024.v4)...
b2find.dkrz.de
Updated Nov 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). An improved reference of the PN40024 grapevine genome assembly (PN40024.v4) and annotations - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/a658f8e6-e54c-55f6-8d68-9609525cc26d
Explore at:
Dataset updated
Nov 3, 2023
Description
Here, we provide an improved version of the PN40024 genome assembly, called PN40024.v4, which combines the top-quality Sanger contigs from the 12X version with Pacific Biosciences long reads (Sequel SMRT). Along with this new assembly, we also provide a new version of the gene annotation, called PN40024.v4.1 based on a newly developed annotation workflow, RNA-Seq datasets and manual curation of a set of genes of functional interest to the community.

Facebook

Twitter

Click to copy link

Link copied

Cite

Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna (2023). Analyzing Dataset Annotation Quality Management in the Wild [Dataset]. http://doi.org/10.48328/tudatalib-1220

Data from: Analyzing Dataset Annotation Quality Management in the Wild

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.48328/tudatalib-1220

Dataset updated

Sep 7, 2023

Authors

Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.

Clear search

Close search

Google apps

Main menu

Data from: Analyzing Dataset Annotation Quality Management in the Wild

Q-CAT Corpus Annotation Tool 1.4 - Dataset - B2FIND

Q-CAT Corpus Annotation Tool 1.3

Data from: X-ray CT data with semantic annotations for the paper "A workflow...

Q-CAT Corpus Annotation Tool 1.0 - Dataset - B2FIND

Q-CAT Corpus Annotation Tool 1.5

DataSheet1_Benchmarking automated cell type annotation tools for single-cell...

Data from: AnnoSM: An Automated Annotation Tool for Determining the...

Q-CAT Corpus Annotation Tool 1.2

National Coral Reef Monitoring Program: Benthic cover derived from analysis...

Data from: On-Reliability-of-Annotations-in-Contextual-Emotion-Imagery

MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

Annotations for The Clinical Proteomic Tumor Analysis Consortium Pancreatic...

Annotation Protocol

Important supplementary information and sample code

Image annotation data from baited camera lander deployments on the Cabo...

Drug-interaction annotations over a large set of drug product labels

Amazon Mechanical Turk: Sentence annotation experiments

Biolinks, datasets and algorithms supporting semantic-based distribution and...

Top 15 metabolic processes identified among the differentially expressed...

Sharks and rays swimming in a large public aquarium

Description of the bounding boxes variables:

An improved reference of the PN40024 grapevine genome assembly (PN40024.v4)...

Data from: Analyzing Dataset Annotation Quality Management in the WildSee More Versions

Data from: Analyzing Dataset Annotation Quality Management in the Wild