Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system.
Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI).
Leaves from genetically unique Juglans regia plants were scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA). Soil samples were collected in Fall of 2017 from the riparian oak forest located at the Russell Ranch Sustainable Agricultural Institute at the University of California Davis. The soil was sieved through a 2 mm mesh and was air dried before imaging. A single soil aggregate was scanned at 23 keV using the 10x objective lens with a pixel resolution of 650 nanometers on beamline 8.3.2 at the ALS. Additionally, a drought stressed almond flower bud (Prunus dulcis) from a plant housed at the University of California, Davis, was scanned using a 4x lens with a pixel resolution of 1.72 µm on beamline 8.3.2 at the ALS Raw tomographic image data was reconstructed using TomoPy. Reconstructions were converted to 8-bit tif or png format using ImageJ or the PIL package in Python before further processing. Images were annotated using Intel’s Computer Vision Annotation Tool (CVAT) and ImageJ. Both CVAT and ImageJ are free to use and open source. Leaf images were annotated in following Théroux-Rancourt et al. (2020). Specifically, Hand labeling was done directly in ImageJ by drawing around each tissue; with 5 images annotated per leaf. Care was taken to cover a range of anatomical variation to help improve the generalizability of the models to other leaves. All slices were labeled by Dr. Mina Momayyezi and Fiona Duong.To annotate the flower bud and soil aggregate, images were imported into CVAT. The exterior border of the bud (i.e. bud scales) and flower were annotated in CVAT and exported as masks. Similarly, the exterior of the soil aggregate and particulate organic matter identified by eye were annotated in CVAT and exported as masks. To annotate air spaces in both the bud and soil aggregate, images were imported into ImageJ. A gaussian blur was applied to the image to decrease noise and then the air space was segmented using thresholding. After applying the threshold, the selected air space region was converted to a binary image with white representing the air space and black representing everything else. This binary image was overlaid upon the original image and the air space within the flower bud and aggregate was selected using the “free hand” tool. Air space outside of the region of interest for both image sets was eliminated. The quality of the air space annotation was then visually inspected for accuracy against the underlying original image; incomplete annotations were corrected using the brush or pencil tool to paint missing air space white and incorrectly identified air space black. Once the annotation was satisfactorily corrected, the binary image of the air space was saved. Finally, the annotations of the bud and flower or aggregate and organic matter were opened in ImageJ and the associated air space mask was overlaid on top of them forming a three-layer mask suitable for training the fully convolutional network. All labeling of the soil aggregate and soil aggregate images was done by Dr. Devin Rippner. These images and annotations are for training deep learning models to identify different constituents in leaves, almond buds, and soil aggregates Limitations: For the walnut leaves, some tissues (stomata, etc.) are not labeled and only represent a small portion of a full leaf. Similarly, both the almond bud and the aggregate represent just one single sample of each. The bud tissues are only divided up into buds scales, flower, and air space. Many other tissues remain unlabeled. For the soil aggregate annotated labels are done by eye with no actual chemical information. Therefore particulate organic matter identification may be incorrect. Resources in this dataset:Resource Title: Annotated X-ray CT images and masks of a Forest Soil Aggregate. File Name: forest_soil_images_masks_for_testing_training.zipResource Description: This aggregate was collected from the riparian oak forest at the Russell Ranch Sustainable Agricultural Facility. The aggreagate was scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 0,0,0; pores spaces have a value of 250,250, 250; mineral solids have a value= 128,0,0; and particulate organic matter has a value of = 000,128,000. These files were used for training a model to segment the forest soil aggregate and for testing the accuracy, precision, recall, and f1 score of the model.Resource Title: Annotated X-ray CT images and masks of an Almond bud (P. Dulcis). File Name: Almond_bud_tube_D_P6_training_testing_images_and_masks.zipResource Description: Drought stressed almond flower bud (Prunis dulcis) from a plant housed at the University of California, Davis, was scanned by X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 4x lens with a pixel resolution of 1.72 µm using. For masks, the background has a value of 0,0,0; air spaces have a value of 255,255, 255; bud scales have a value= 128,0,0; and flower tissues have a value of = 000,128,000. These files were used for training a model to segment the almond bud and for testing the accuracy, precision, recall, and f1 score of the model.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads Resource Title: Annotated X-ray CT images and masks of Walnut leaves (J. Regia) . File Name: 6_leaf_training_testing_images_and_masks_for_paper.zipResource Description: Stems were collected from genetically unique J. regia accessions at the 117 USDA-ARS-NCGR in Wolfskill Experimental Orchard, Winters, California USA to use as scion, and were grafted by Sierra Gold Nursery onto a commonly used commercial rootstock, RX1 (J. microcarpa × J. regia). We used a common rootstock to eliminate any own-root effects and to simulate conditions for a commercial walnut orchard setting, where rootstocks are commonly used. The grafted saplings were repotted and transferred to the Armstrong lathe house facility at the University of California, Davis in June 2019, and kept under natural light and temperature. Leaves from each accession and treatment were scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 170,170,170; Epidermis value= 85,85,85; Mesophyll value= 0,0,0; Bundle Sheath Extension value= 152,152,152; Vein value= 220,220,220; Air value = 255,255,255.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system.
Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations) Version 1.5 supports listening to audio recordings (provided in the # sound_url comment line in CONLL-U)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As single-cell chromatin accessibility profiling methods advance, scATAC-seq has become ever more important in the study of candidate regulatory genomic regions and their roles underlying developmental, evolutionary, and disease processes. At the same time, cell type annotation is critical in understanding the cellular composition of complex tissues and identifying potential novel cell types. However, most existing methods that can perform automated cell type annotation are designed to transfer labels from an annotated scRNA-seq data set to another scRNA-seq data set, and it is not clear whether these methods are adaptable to annotate scATAC-seq data. Several methods have been recently proposed for label transfer from scRNA-seq data to scATAC-seq data, but there is a lack of benchmarking study on the performance of these methods. Here, we evaluated the performance of five scATAC-seq annotation methods on both their classification accuracy and scalability using publicly available single-cell datasets from mouse and human tissues including brain, lung, kidney, PBMC, and BMMC. Using the BMMC data as basis, we further investigated the performance of these methods across different data sizes, mislabeling rates, sequencing depths and the number of cell types unique to scATAC-seq. Bridge integration, which is the only method that requires additional multimodal data and does not need gene activity calculation, was overall the best method and robust to changes in data size, mislabeling rate and sequencing depth. Conos was the most time and memory efficient method but performed the worst in terms of prediction accuracy. scJoint tended to assign cells to similar cell types and performed relatively poorly for complex datasets with deep annotations but performed better for datasets only with major label annotations. The performance of scGCN and Seurat v3 was moderate, but scGCN was the most time-consuming method and had the most similar performance to random classifiers for cell types unique to scATAC-seq.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Mass spectrometry (MS) is a powerful technology for the structural elucidation of known or unknown small molecules. However, the accuracy of MS-based structure annotation is still limited due to the presence of numerous isomers in complex matrices. There are still challenges in automatically interpreting the fine structure of molecules, such as the types and positions of substituents (substituent modes, SMs) in the structure. In this study, we employed flavones, flavonols, and isoflavones as examples to develop an automated annotation method for identifying the SMs on the parent molecular skeleton based on a characteristic MS/MS fragment ion library. Importantly, user-friendly software AnnoSM was built for the convenience of researchers with limited computational backgrounds. It achieved 76.87% top-1 accuracy on the 148 authentic standards. Among them, 22 sets of flavonoid isomers were successfully differentiated. Moreover, the developed method was successfully applied to complex matrices. One such example is the extract of Ginkgo biloba L. (EGB), in which 331 possible flavonoids with SM candidates were annotated. Among them, 23 flavonoids were verified by authentic standards. The correct SMs of 13 flavonoids were ranked first on the candidate list. In the future, this software can also be extrapolated to other classes of compounds.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system.
Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags.
The benthic cover data in this collection result from the analysis of images produced during benthic photo-quadrat surveys conducted along transects at climate stations and permanent sites across American Samoa. These sites were identified by the Ocean and Climate Change team and the ongoing National Coral Reef Monitoring Program. Benthic habitat imagery were quantitatively analyzed using a web-based annotation tool called CoralNet (Beijbom et al. 2015). In general, images are analyzed to produce three functional group levels of benthic cover: Tier 1 (e.g., hard coral, soft coral, macroalgae, turf algae, etc.), Tier 2 (e.g., Hard Coral = massive, branching, foliose, encrusting, etc.; Macroalgae = upright macroalgae, encrusting macroalgae, bluegreen macroalgae, and Halimeda, etc.), and Tier 3 (e.g., Hard Coral = Astreopora sp, Favia sp, Pocillopora, etc.; Macroalgae = Caulerpa sp, Dictyosphaeria sp, Padina sp, etc.). The imagery analyzed in order to produce the benthic cover data is also included in this collection.
Source code for replicating the results presented in our paper "On Reliability of Annotations in Contextual Emotion Imagery". This code was computed under Matlab 2022. For best results, try to use this or subsequent versions. Authors: Carlos A. Martínez-Miwa; Mario Castelán. Cinvestav, México.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/
Introduction: These corpora contain the data for each of the subtracks of MESINESP2 shared-task:
[Subtrack 1] MESINESP-L – Scientific Literature :
Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:
Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.
Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.
Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:
213 articles were annotated by more than one annotator. We have selected de union between annotations.
852 articles were annotated by only one of the three selected annotators with better performance.
Test set: We provide a test set containing 10179 abstract without DeCS codes (not annotated) from LILACS and IBECS. Participants will have to predict the DecS codes for each of the abstracts in the entire dataset. However, the evaluation of the systems will only be made on the set of 500 expert-annotated abstracts that will be published as Gold Standard after finishing the evaluation period.
[Subtrack 2] MESINESP-T- Clinical Trials:
Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.
Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.
Test set: The test dataset contains a collection of 8919 items. Out of this subset, there are 461 clinical trials coming from REEC and 8458 clinical trials artificially constructed from drug datasheets that have a similar structure to REEC documents. The evaluation of the systems will be performed on a set of 250 items annotated by DeCS experts following the same protocol as in subtrack 1. Similarly, these items will be published as Gold Standard after completion of the task.
[Subtrack 3] MESINESP-P – Patents:
Development set: We provide a Development set manually indexed by expert annotators. This dataset includes 115 patents in Spanish extracted from Google Patents which have the IPC code “A61P” and “A61K31”. We have selected these patents based on semantic similarity to the MESINESP-L training set to facilitate model generation and to try to improve model performance.
Test set: We provide a test set containing 68404 records that correspond to the total number of patents published in Spanish with the IPC codes “A61P” and “A61K31”. From this set, 150 will be selected and indexed by DeCS experts under the protocol defined in subtask 1, which will be used to evaluate the quality of the developed systems. Similarly to the development set, we selected these 150 records based on semantic similarity to the MESINESP-L training set.
Additional data:
We provide this information to the participants as additional data in the “Additional Data” folder. For each training, development, and test set there is an additional JSON file with the structure shown here. Each file contains entities related to medications, diseases, symptoms, and medical procedures extrated with the BSC NERs.
Files structure:
Subtrack1-Scientific_Literature.zip contains the corpora generated for subtrack 1. Content:
Subtrack1:
Train
training_set_track1_all.json: Full training set for subtrack 1.
training_set_track1_only_articles.json: Articles training set for subtrack 1.
Development
development_set_subtrack1.json: Manually annotated development set for subtrack 1.
Test
test_set_subtrack1.json: Test set for subtrack 1.
Subtrack2-Clinical_Trials.zip contains the corpora generated for subtrack 2. Content:
Subtrack2:
Train
training_set_subtrack2.json: Training set for subtrack 2.
Development
development_set_subtrack2.json: Manually annotated development set for subtrack 2.
Test
test_set_subtrack2.json: Test set for subtrack 2.
Subtrack3-Patents.zip contains the corpora generated for subtrack 3. Content:
Subtrack3:
Development
development_set_subtrack3.json: Manually annotated development set for subtrack 3.
Test
test_set_subtrack3.json: Test set for subtrack 3.
Additional data.zip contains the corpora with additional data for each subtrack of MESINESP2.
DeCS2020.tsv contains a DeCS table with the following structure:
DeCS code
Preferred descriptor (the preferred label in the Latin Spanish DeCS 2020 set)
List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020 set, separated by pipes.
DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.
*Note: The obo and tsv files with DeCS2020 descriptors contain some additional COVID19 descriptors that will be included in future versions of DeCS. These items were provided by the Pan American Health Organization (PAHO), which has kindly shared this content to improve the results of the task by taking these descriptors into account.
For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at lgasco@bsc.es
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
This dataset contains image annotations derived from "The Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma Collection (CPTAC-PDA)”. This dataset was generated as part of a National Cancer Institute project to augment images from The Cancer Imaging Archive with tumor annotations that will improve their value for cancer researchers and artificial intelligence experts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A total of eight deployments of an autonomous baited camera lander were conducted at the Cabo Verde Abyssal Plain (tropical East Atlantic, Lat. 14.72, Lon. -25.19, Water depth ~4200 m) using either Atlantic mackerel (Scomber scombrus, n=4) or Patagonian squid (Doryteuthis gahi, n=4) bait, to photograph organisms attracted to the bait over roughly 24 hours. The deployments took place during the iMirabilis2 campaign in August 2021 from the research vessel Sarmiento de Gamboa. A deep-sea time lapse camera system with an oblique view of the bait plate (12 cm x 45 cm) and surroundings took a picture every 150 seconds. The bar attached to the bait plate is 6 cm wide. The camera was located about 120 cm above the seafloor with an oblique view of 40 degrees (assuming straight down in 0 degrees). Annotations were performed in BIIGLE software (Langenkämper et al. 2017) on every second photograph, providing the morphospecies group label (or 'No ID' if to morphospecies level was not possible) and the taxonomic hierarchy to a level of best confidence for each annotation. Annotations were rectangular in shape, enclosing each individual so that the centre of the annotation was roughly the centre of mass, and the points of each rectangle corner are provided in pixels (x,y) where the lower left corner of the picture is 0,0. Images were 6000 pixels in width and 4000 pixels in height.
Drug interaction annotations over a large set of drug product labels. These annotations were done as part of the National Library of Medicine funded research project "Addressing gaps in clinically useful evidence on drug-drug interactions" (R01LM011838) National Library of Medicine R01LM011838
This data collection consists of two .csv files containing lists of sentences with individual and mean sentence ratings (crowd sourced judgements) on three modes of presentation.
This research holds out the prospect of important impact in two areas. First, it can shed light on the relationship between the representation and acquisition of linguistic knowledge on one hand, and learning and the encoding of knowledge in other cognitive domains. This work can, in turn, help to clarify the respective roles of biologically conditioned learning biases and data driven learning in human cognition.
Second, this work can contribute to the development of more effective language technology by providing insight, from a computational perspective, into the way in which humans represent the syntactic properties of sentences in their language. To the extent that natural language processing systems take account of this class of representations they will provide more efficient tools for parsing and interpreting text and speech.
In the past twenty-five years work in natural language technology has made impressive progress across a wide range of tasks, which include, among others, information retrieval and extraction, text interpretation and summarization, speech recognition, morphological analysis, syntactic parsing, word sense identification, and machine translation. Much of this progress has been due to the successful application of powerful techniques for probabilistic modeling and statistical analysis to large corpora of linguistic data. These methods have given rise to a set of engineering tools that are rapidly shaping the digital environment in which we access and process most of the information that we use.
In recent work (Lappin and Shieber (2007), Clark and Lappin (2011a), Clark and Lappin (2011b)) my co-authors and I have argued that the machine learning methods that are driving the expansion of natural language technology are also directly relevant to understanding central features of human language acquisition. When these methods are used to construct carefully specified formal models and implementations of the grammar induction task, they yield striking insights into the limits and possibility of human learning on the basis of the primary linguistic data to which children are exposed. These models indicate that language learning can be achieved without the sorts of strong innate learning biases that have been posited by traditional theories of universal grammar. Weak biases, some derivable from non-linguistic cognitive domains, and domain general learning procedures are sufficient to support efficient data driven learning of plausible systems of grammatical representation.
In the current research I am focussing on the problem of how to specify the class of representations that encode human knowledge of the syntax of natural languages. I am pursuing the hypothesis that a representation in this class is best expressed as an enriched statistical language model that assigns probability values to the sentences of a language. A central part of the enrichment of the model consists of a procedure for determining the acceptability (grammaticality) of a sentence as a graded value, relative to the properties of that sentence and the language of which it is a part. This procedure avoids the simple reduction of the grammaticality of a string to its estimated probability of occurrence, while still characterizing grammaticality in probabilistic terms. An enriched model of this kind will provide a straightforward explanation for the fact that individual native speakers generally judge the well formedness of sentences along a continuum, rather than through the imposition of a sharp boundary between acceptable and unacceptable sentences. The pervasiveness of gradedness in the linguistic knowledge of individual speakers poses a serious problem for classical theories of syntax, which partition strings of words into the grammatical sentences of a language and ill formed strings of words.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Finding articles related to a publication of interest remains a challenge in the Life Sciences domain as the number of scientific publications grows day by day. Publication repositories such as PubMed and Elsevier provides a list of similar articles. There, similarity is commonly calculated based on title, abstract and some keywords assigned to articles. Here we present the datasets and algorithms used in Biolinks. Biolinks uses ontological concepts extracted from publication and makes it possible to calculate a distribution score according to semantic groups as well as a semantic similarity based on either all identified annotations or narrowed to one or more particular semantic groups.
Materials: In a previous work [1], 4,240 articles from the TREC-05 collection [2] were selected. The title-and-abstract for those 4,240 articles were annotated with Unified Medical Language System (UMLS) concepts, such annotations are refer to as our TA-dataset and correspond to the JSON files under the pubmed folder in the JSON-LD.zip file. From those 4,240 articles, full-text was available for only 62. The title-and-abstract annotations for those 62 articles, TAFT-dataset, are located under the pubmed-pmc folder in the JSON-LD.zip file, which also contains the full-text annotations under the folder pmc, FT-dataset. The list corresponding to articles with title-and-abstract is found in the genomics.qrels.large.pubmed.onlyRelevants.titleAndAbstract.tsv file, while those with full-text are recorded in the genomics.qrels.large.pmc.onlyRelevants.fullContent.tsv file.
Methods: The TA-dataset was used to calculate the Information Gain (IG) according to the UMLS semantic groups, see IG_umls_groups.PMID.xlsx. A new grouping is proposed for Biolinks, see biolinks_groups.tsv. The IG was calculated for Biolinks groups as well, IG_biolinks_groups.PMID.xlsx, showing a improvement around 5%.
Biolinks groups were used to calculate a semantic group distribution score for each article in all our datasets. A semantic similarity metric based on PubMed related articles [3] is also provided; the Biolinks groups can be used to narrow the similarity to one or more selected groups. All the corresponding algorithms are open-access and available on GitHub under the license Apache-2.0, a frozen version, biotea-io-parser-master.zip, is provided here. In order to facilitate the analysis of our datasets based on the annotations as well as the distribution and similarity scores, some web-based visualization components were created. All of them open-access and available in GitHub under the license Apache-2.0; frozen versions are provided here, see files biotea-vis-annotation-master.zip, biotea-vis-similarity-master.zip, biotea-vis-tooltip-master.zip and biotea-vis-topicDistribution-master.zip. These components are brought together by biotea-vis-biolinks-master.zip. A demo is provided at http://ljgarcia.github.io/biotea-biolinks/; this demo was built on top of GitHub pages, a frozen version of the gh-pages branch is provided here, see biotea-biolinks-gh-pages.zip.
Conclusions: Biolinks assigns a weight to each semantic group based on the annotations extracted from either title-and-abstract or full-text articles. It also measures similarity for a pair of documents using the semantic information. The distribution and similarity metrics can be narrowed to a subset of the semantic groups, enabling researchers to focus on what is more relevant to them.
[1] Garcia Castro, L.J., R. Berlanga, and A. Garcia, In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access. Journal of Biomedical Informatics, 2015. 57: p. 204-218
[2] Text Retrieval Conference 2005 - Genomics Track. TREC-05 Genomics Track ad hoc relevance judgement. 2005 [cited 2016 23rd August]; Available from: http://trec.nist.gov/data/genomics/05/genomics.qrels.large.txt
[3] Lin, J. and W.J. Wilbur, PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics, 2007. 8(1): p. 423
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top 15 metabolic processes identified among the differentially expressed metabolic genes using DAVID Functional Annotation Tool.
Large Public Aquaria are complex ecosystems that require constant monitoring to detect and correct anomalies that may affect the habitat and their species. Many of those anomalies can be directly or indirectly spotted by monitoring the behavior of fish. This can be a quite laborious task to be done by biologists alone. Automated fish tracking methods, specially of the non-intrusive type, can help biologists in the timely detection of such events. These systems require annotated data of fish to be trained. We used footage collected from the main aquarium of Oceanário de Lisboa to create a novel dataset with fish annotations from the shark and ray species. The dataset has the following characteristics:
66 shark training tracks with a total of 15812 bounding boxes 88 shark testing tracks with a total of 15978 bounding boxes 133 ray training tracks with a total of 28168 bounding boxes 192 ray testing tracks with a total of 31529 bounding boxes
The training set corresponds to a calm enviro..., The dataset was collected using a stationary camera positioned outside the main tank of Oceanário de Lisboa aiming at the fish. Additionally, this data was processed using the CVAT annotation tool to create the sharks and rays annotations., , # Sharks and rays swimming in a large public aquarium
Each set has 2 folders: gt and img1. The gt folder contains 3 txt files: gt, gt_out and labels. The gt and gt_out files contain the bounding box annotations sorted in two distinct ways. The former has the annotations sorted by frame number, while the latter is sorted by the track ID. Each line of the ground truth files represents one bounding box of a fish trajectory. The bounding boxes are represented with the following format: frame id, track id, x, y, w, h, not ignored, class id, visibility. The folder img1 contains all the annotated frames.
frame id points to the frame where the bounding box was obtained;
track id identifies the track of a fish with which the bonding box is associated;
x and y are the pixel coordinates of the top left corner of the bounding box;
w and h are the width and height of the bounding box respectively. These variables are measured in terms of pixels o...
Here, we provide an improved version of the PN40024 genome assembly, called PN40024.v4, which combines the top-quality Sanger contigs from the 12X version with Pacific Biosciences long reads (Sequel SMRT). Along with this new assembly, we also provide a new version of the gene annotation, called PN40024.v4.1 based on a newly developed annotation workflow, RNA-Seq datasets and manual curation of a set of genes of functional interest to the community.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.