92 datasets found
  1. t

    Data from: Analyzing Dataset Annotation Quality Management in the Wild

    • tudatalib.ulb.tu-darmstadt.de
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna (2023). Analyzing Dataset Annotation Quality Management in the Wild [Dataset]. http://doi.org/10.48328/tudatalib-1220
    Explore at:
    Dataset updated
    Sep 7, 2023
    Authors
    Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.

  2. E

    A Corpus of Online Drug Usage Guideline Documents Annotated with Type of...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    tsv
    Updated Sep 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). A Corpus of Online Drug Usage Guideline Documents Annotated with Type of Advice [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7399
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Sep 8, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: The goal of this dataset is to aid NLP research on recognizing safety critical information from drug usage guideline or patient handout data. This dataset contains annotated advice statements from 90 online DUG documents that corresponds to 90 drugs or medications that are used in the prescriptions of patients suffering from one or more chronic diseases. The advice statements are annotated in eight safety-critical categories: activity or lifestyle related, disease or symptom related, drug administration related, exercise related, food or beverage related, other drug related, pregnancy related, and temporal.

    Data Collection: The data was collected from MedScape. It is one of the most widely used reference for health care providers. At first, 34 real anonymized prescriptions of patients suffering from one or more chronic diseases are collected. These prescriptions contains 165 drugs that are used to treat chronic diseases. Then, MedScape was crawled to collect the drug user guideline (DUG) / patient handout for these 165 drugs. But, MedScape does not have DUG document for all drugs. We found DUG document for 90 drugs in MedScape.

    Data Annotation tool: The data annotation tool is developed to ease the annotation process. It allows the user to select a DUG document and select a position from the document in terms of line number. It stores the user log from the annotator and loads the most recent position from the log when the application is launched. It supports annotating multiple files for the same drug, as often there are multiple overlapping sources of drug usage guidelines for a single drug. Often DUG documents contain formatted text. This tool aids annotation of the formatted text as well. The annotation tool is also available upon request.

    Annotated Data Description: The annotated data contains the annotation tag(s) of each advice extracted from the 90 online DUG documents. It also contains the phrases or topics in the advice statement that triggers the annotation tag, such as, activity, exercise, medication name, food or beverage name, disease name, pregnancy condition (gestational, postpartum). Sometimes disease names are not directly mentioned rather mentioned as a condition (e.g., stomach bleeding, alcohol abuse) or state of a parameter (e.g., low blood sugar, low blood pressure). The annotated data is formatted as following:
    drug name, drug number, line number of the first sentence of the advice in the DUG document, advice Text, advice tag(s), medication, food, activity, exercise, and disease names mentioned in the advice.


    Unannotated Data Description:
    The unannotated data contains the raw DUG document for 90 drugs. It also contains the drug interaction information for the 165 drugs. The drug interaction information is categorized in 4 classes, contraindicated, serious, monitor closely, and minor. This information can be utilized to automatically detect potential interaction and effect of interaction among multiple drugs.

    Citation: If you use this dataset in your work, please cite the following reference in any publication:

    @inproceedings{preum2018DUG,
    title={A Corpus of Drug Usage Guidelines Annotated with Type of Advice},
    author={Sarah Masud Preum, Md. Rizwan Parvez, Kai-Wei Chang, and John A. Stankovic},
    booktitle={ Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
    publisher = {European Language Resources Association (ELRA)},
    year={2018}
    }

  3. Data from: TRANSMAT tables data

    • dataverse.cirad.fr
    application/x-gzip +1
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Lentschat; Martin Lentschat; Patrice Buche; Patrice Buche; Luc Menut; Luc Menut; Romane Guari; Romane Guari (2023). TRANSMAT tables data [Dataset]. http://doi.org/10.18167/DVN1/GCZBC9
    Explore at:
    pdf(772042), application/x-gzip(43365), application/x-gzip(464618), application/x-gzip(31746)Available download formats
    Dataset updated
    Feb 15, 2023
    Authors
    Martin Lentschat; Martin Lentschat; Patrice Buche; Patrice Buche; Luc Menut; Luc Menut; Romane Guari; Romane Guari
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset presents data obtained from manual and automatic extraction of partial n-Ary relations encountered in table forms. The tables were obtained from on documents available on the Science Direct website. The data are related to permeability n-Ary relations, as defined in the TRANSMAT Ontology (https://ico.iate.inra.fr/atWeb/, https://doi.org/10.15454/NK24ID, http://agroportal.lirmm.fr/ontologies/TRANSMAT). The manual annotation was performed following the annotation guide available within this dataset. The tables of ten documents were manually annotated by one annotator and then cross-curated by three annotators. Each folder, one for the manual and one for the automatic annotation, contain one html file per annotated table and one .csv file containing all the data. Each line of the .csv file represent an instance of a partial n-Ary relation. The information available on each annotation are: Doc (the original document), Relation (the relation concept covering the annotated items), Result_Argument (the argument instance that defines the relation), Arguments (a list of all argument instances belonging to the relation instanec), Table (the table name in which the partial relation originated), Caption (the table caption), Segment (the name of the section to which the table belongs), Document (the article title) and DOI (the document DOI)

  4. d

    Annotation Curricula to Implicitly Train Non-Expert Annotators - Dataset -...

    • b2find.dkrz.de
    Updated Aug 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Annotation Curricula to Implicitly Train Non-Expert Annotators - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/a5f6640f-4c4c-59be-b3e9-a53b79b57c97
    Explore at:
    Dataset updated
    Aug 29, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowd sourcing scenarios where domain expertise is not required and only annotation guidelines are provided. To alleviate these issues, we propose annotation curricula, a novel approach to implicitly train annotators. We gradually introduce annotators into the task by ordering instances that are annotated according to a learning curriculum. To do so, we first formalize annotation curricula for sentence- and paragraph-level annotation tasks, define an ordering strategy, and identify well-performing heuristics and interactively trained models on three existing English datasets. We then conduct a user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. Our results show that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can provide a novel way to improve data collection. To facilitate future research, we further share our code and data consisting of 2,400 annotations.

  5. E

    Data from: KPWr annotation guidelines - events

    • live.european-language-grid.eu
    binary format
    Updated Apr 24, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). KPWr annotation guidelines - events [Dataset]. https://live.european-language-grid.eu/catalogue/ld/18724
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Apr 24, 2016
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Events annotation guidelines describing the process of manual annotation of documents in Polish Corpus of Wrocław University of Technology (KPWr)

  6. Number of concepts associated with each semantic group and extracted from...

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Number of concepts associated with each semantic group and extracted from SNOMED CT to create the terminology setting. [Dataset]. http://doi.org/10.1371/journal.pone.0209547.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of concepts associated with each semantic group and extracted from SNOMED CT to create the terminology setting.

  7. E

    Data from: KPWr annotation guidelines - keywords (1.0)

    • live.european-language-grid.eu
    binary format
    Updated Jul 26, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). KPWr annotation guidelines - keywords (1.0) [Dataset]. https://live.european-language-grid.eu/catalogue/ld/18787
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 26, 2018
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Annotation guidelines (first version) for keywords in KPWr (Polish Corpus of Wrocław University of Technology (https://clarin-pl.eu/dspace/handle/11321/270).

  8. f

    Definition of concept coverage scores for ASSESS CT manual annotation.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Definition of concept coverage scores for ASSESS CT manual annotation. [Dataset]. http://doi.org/10.1371/journal.pone.0209547.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Definition of concept coverage scores for ASSESS CT manual annotation.

  9. f

    Statistics of text (word) regions and figures with combination of...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu-Cheng Yin; Chun Yang; Wei-Yi Pei; Haixia Man; Jun Zhang; Erik Learned-Miller; Hong Yu (2023). Statistics of text (word) regions and figures with combination of categories. [Dataset]. http://doi.org/10.1371/journal.pone.0126200.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Xu-Cheng Yin; Chun Yang; Wei-Yi Pei; Haixia Man; Jun Zhang; Erik Learned-Miller; Hong Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistics of text (word) regions and figures with combination of categories.

  10. d

    OntoNotes Release 5.0

    • search.dataone.org
    • borealisdata.ca
    • +1more
    Updated Dec 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weischedel, Ralph; Palmer, Martha; Marcus, Mitchell; Eduard, Hovy; Pradhan, Sameer; Ramshaw, Lance; Xue, Nianwen; Taylor, Ann; Kaufman, Jeff; Franchini, Michelle; El-Bachouti, Mohammed; Belvin, Robert; Houston, Ann (2023). OntoNotes Release 5.0 [Dataset]. https://search.dataone.org/view/sha256%3Ac07aab0589f14607d562d25e27bc8ab3f91209a9f516c051b03433364380d00e
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Weischedel, Ralph; Palmer, Martha; Marcus, Mitchell; Eduard, Hovy; Pradhan, Sameer; Ramshaw, Lance; Xue, Nianwen; Taylor, Ann; Kaufman, Jeff; Franchini, Michelle; El-Bachouti, Mohammed; Belvin, Robert; Houston, Ann
    Description

    Introduction OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project was to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes Release 5.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04, OntoNotes Release 3.0 LDC2009T24 and OntoNotes Release 4.0 LDC2011T03 -- and adds source data from and/or additional annotations for, newswire (News), broadcast news (BN), broadcast conversation (BC), telephone conversation (Tele) and web data (Web) in English and Chinese and newswire data in Arabic. Also contained is English pivot text (Old Testament and New Testament text). This cumulative publication consists of 2.9 million words with counts shown in the table below. Arabic English Chinese News 300k 625k 250k BN n/a 200k 250k BC n/a 200k 150k Web n/a 300k 150k Tele n/a 120k 100k Pivot n/a n/a 300k The OntoNotes project built on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation includes word sense disambiguation for nouns and verbs, with some word senses connected to an ontology, and coreference. Data Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database (ontonotes-v5.0.sql.gz) with a Python API to provide convenient cross-layer access. It is a known issue that this release contains some non-validating XML files. The included tools, however, use a non-validating XML parser to parse the .xml files and load the appropriate values. Tools This release includes OntoNotes DB Tool v0.999 beta, the tool used to assemble the database from the original annotation files. It can be found in the directory tools/ontonotes-db-tool-v0.999b. This tool can be used to derive various views of the data from the database, and it provides an API that can implement new queries or views. Licensing information for the OntoNotes DB Tool package is included in its source directory. Samples Please view these samples: Chinese Arabic English Updates Additional documentation was added on December 11, 2014 and is included in downloads after that date. Acknowledgment This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred. Copyright Portions © 2006 Abu Dhabi TV, © 2006 Agence France Presse, © 2006 Al-Ahram, © 2006 Al Alam News Channel, © 2006 Al Arabiya, © 2006 Al Hayat, © 2006 Al Iraqiyah, © 2006 Al Quds-Al Arabi, © 2006 Anhui TV, © 2002, 2006 An Nahar, © 2006 Asharq-al-Awsat, © 2010 Bible League International, © 2005 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2006 China Central TV, © 2006 China Military Online, © 2000-2001 China National Radio, © 2006 Chinanews.com, © 2000-2001 China Television System, © 1989 Dow Jones & Company, Inc., © 2006 Dubai TV, © 2006 Guangming Daily, © 2006 Kuwait TV, © 2005-2006 National Broadcasting Company, Inc., © 2006 New Tang Dynasty TV, © 2006 Nile TV, © 2006 Oman TV, © 2006 PAC Ltd, © 2006 Peoples Daily Online, © 2005-2006 Phoenix TV, © 2000-2001 Sinorama Magazine, © 2006 Syria TV, © 1996-1998, 2006 Xinhua News Agency, © 1996, 1997, 2005, 2007, 2008, 2009, 2011, 2013 Trustees of the University of Pennsylvania

  11. F

    STEM-ECR-v1.0

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). STEM-ECR-v1.0 [Dataset]. https://data.uni-hannover.de/dataset/stem-ecr-v1-0
    Explore at:
    zip(10224852), zip(1469015)Available download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

    The STEM ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises annotations for scientific entities in scientific Abstracts drawn from 10 disciplines in Science, Technology, Engineering, and Medicine. The annotated entities are further grounded to Wikipedia and Wiktionary, respectively.

    What this repository contains?

    The dataset is organized in the following folders:

    • Scientific Entity Annotations: Contains annotations for Process, Material, Method, and Data scientific entities in the STEM dataset.
    • Scientific Entity Resolution: Annotations for the STEM dataset scientific entities with Entity Linking (EL) annotations to Wikipedia and Word Sense Disambiguation (WSD) annotations to Wiktionary.

    Annotation Guidelines

    The annotation guidelines that supported the creation of this corpus can be found here.

    Supporting Publication

    D'Souza, J., Hoppe, A., Brack, A., Jaradeh, M., Auer, S., & Ewerth, R. (2020). The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 2192–2203). European Language Resources Association.

    Useful Links

  12. c

    Data from: Sentiment Annotated Dataset of Croatian News

    • clarin.si
    • live.european-language-grid.eu
    Updated Sep 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andraž Pelicon; Marko Pranjić; Dragana Miljković; Blaž Škrlj; Senja Pollak (2020). Sentiment Annotated Dataset of Croatian News [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1342
    Explore at:
    Dataset updated
    Sep 15, 2020
    Authors
    Andraž Pelicon; Marko Pranjić; Dragana Miljković; Blaž Škrlj; Senja Pollak
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    We present a collection of sentiment annotations for news articles (article links) in Croatian language. A set of 2025 news articles was gathered from 24sata, one of the leading media companies in Croatia with the highest circulation. 6 annotators annotated the articles on the document level using a five-level Likert scale (1—very negative, 2—negative, 3—neutral, 4—positive, and 5—very positive). The final sentiment of an instance was defined as the average of the sentiment scores given by the different annotators. An instance was labeled as negative, if the average of given scores was less than or equal to 2.4; neutral, if the average of given scores was between 2.4 and 3.6; or positive, if the average of given scores was greater than or equal to 3.6. The annotation guidelines correspond to the Slovenian sentiment-annotated collection of news SentiNews 1.0 (http://hdl.handle.net/11356/1110).

    If you use the dataset, please cite the following paper (which contains also the details on the dataset creation, and on monolingual and cross-lingual sentiment classification experiments): Pelicon, A.; Pranjić, M.; Miljković, D.; Škrlj, B.; Pollak, S. Zero-Shot Learning for Cross-Lingual News Sentiment Classification. Appl. Sci. 2020, 10, 5993. https://doi.org/10.3390/app10175993

  13. d

    Agrilus planipennis community manual annotations

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Dec 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2024). Agrilus planipennis community manual annotations [Dataset]. https://catalog.data.gov/dataset/agrilus-planipennis-community-manual-annotations
    Explore at:
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Agricultural Research Service
    Description

    Manual annotation at the i5k Workspace@NAL (https://i5k.nal.usda.gov) is the review and improvement of gene models derived from computational gene prediction. Community curators compare an existing gene model to evidence such as RNA-Seq or protein alignments from the same or closely related species and modify the structure or function of the gene accordingly, typically following the i5k Workspace@NAL manual annotation guidelines (https://i5k.nal.usda.gov/content/rules-web-apollo-annotation-i5k-pilot-project). If a gene model is missing, the annotator can also use this evidence to create a new gene model. Because manual annotation, by definition, improves or creates gene models where computational methods have failed, it can be a powerful tool to improve computational gene sets, which often serve as foundational datasets to facilitate research on a species.Here, community curators used manual annotation at the i5k Workspace@NAL to improve computational gene predictions from the dataset Agrilus planipennis genome annotations v0.5.3. The i5k Workspace@NAL set up the Apollo v1 manual annotation software and multiple evidence tracks to facilitate manual annotation. From 2014-10-20 to 2018-07-12, five community curators updated 263 genes, including developmental genes; cytochrome P450s; cathepsin peptidases; cuticle proteins; glycoside hydrolases; and polysaccharide lyases. For this dataset, we used the program LiftOff v1.6.3 to map the manual annotations to the genome assembly GCF_000699045.2. We computed overlaps with annotations from the RefSeq database using gff3_merge from the GFF3toolkit software v2.1.0. FASTA sequences were generated using gff3_to_fasta from the same toolkit. These improvements should facilitate continued research on Agrilus planipennis, or emerald ash borer (EAB), which is an invasive insect pest.While these manual annotations will not be integrated with other computational gene sets, they are available to view at the i5k Workspace@NAL (https://i5k.nal.usda.gov) to enhance future research on Agrilus planipennis.

  14. d

    Morpho-syntactically annotated corpora provided for the PARSEME Shared Task...

    • b2find.dkrz.de
    Updated Oct 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Morpho-syntactically annotated corpora provided for the PARSEME Shared Task on Semi-Supervised Identification of Verbal Multiword Expressions (edition 1.2) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/4344cffe-e784-5a08-913c-3d7499114838
    Explore at:
    Dataset updated
    Oct 19, 2023
    Description

    This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). These corpora were meant to serve as additional "raw" corpora, to help discovering unseen verbal MWEs. The corpora are provided in CONLL-U (https://universaldependencies.org/format.html) format. They contain morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies). Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2.x) or from automatic parsers trained on UD v2.x treebanks (e.g., UDPipe). VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format. Morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.2 (2020). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.2

  15. f

    Manual annotation guidelines and examples.

    • plos.figshare.com
    xls
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zasim Azhar Siddiqui; Maryam Pathan; Sabina Nduaguba; Traci LeMasters; Virginia G. Scott; Usha Sambamoorthi; Jay S. Patel (2025). Manual annotation guidelines and examples. [Dataset]. http://doi.org/10.1371/journal.pdig.0000765.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    PLOS Digital Health
    Authors
    Zasim Azhar Siddiqui; Maryam Pathan; Sabina Nduaguba; Traci LeMasters; Virginia G. Scott; Usha Sambamoorthi; Jay S. Patel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background The use of social media platforms in health research is increasing, yet their application in studying rare diseases is limited. Hodgkin’s lymphoma (HL) is a rare malignancy with a high incidence in young adults. This study evaluates the feasibility of using social media data to study the disease and treatment characteristics of HL. Methods We utilized the X (formerly Twitter) API v2 developer portal to download posts (formerly tweets) from January 2010 to October 2022. Annotation guidelines were developed from literature and a manual review of limited posts was performed to identify the class and attributes (characteristics) of HL discussed on X, and create a gold standard dataset. This dataset was subsequently employed to train, test, and validate a Named Entity Recognition (NER) Natural Language Processing (NLP) application. Results After data preparation, 80,811 posts were collected: 500 for annotation guideline development, 2,000 for NLP application development, and the remaining 78,311 for deploying the application. We identified nine classes related to HL, such as HL classification, etiopathology, stages and progression, and treatment. The treatment class and HL stages and progression were the most frequently discussed, with 20,013 (25.56%) posts mentioning HL’s treatments and 17,177 (21.93%) mentioning HL stages and progression. The model exhibited robust performance, achieving 86% accuracy and an 87% F1 score. The etiopathology class demonstrated excellent performance, with 93% accuracy and a 95% F1 score. Discussion The NLP application displayed high efficacy in extracting and characterizing HL-related information from social media posts, as evidenced by the high F1 score. Nonetheless, the data presented limitations in distinguishing between patients, providers, and caregivers and in establishing the temporal relationships between classes and attributes. Further research is necessary to bridge these gaps. Conclusion Our study demonstrated potential of using social media as a valuable preliminary research source for understanding the characteristics of rare diseases such as Hodgkin’s Lymphoma.

  16. Data from: NLM-Gene, a richly annotated gold standard dataset for gene...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Jul 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rezarta Islamaj; Zhiyong Lu (2021). NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition [Dataset]. http://doi.org/10.5061/dryad.dv41ns1wt
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 9, 2021
    Dataset provided by
    United States National Library of Medicine
    Authors
    Rezarta Islamaj; Zhiyong Lu
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The NLM-Gene corpus is a high-quality manually annotated corpus for genes, covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per article, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed articles from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each article to control for bias. The annotators worked in three annotation rounds until they reached a complete agreement. Using the new resource, we developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at Dryad and at https://www.ncbi.nlm.nih.gov/research/bionlp/. The gene finding results of applying this tool to the entire PubMed/PMC are freely accessible through our web-based tool PubTator.

    Methods Data Selection: Our goal was to identify articles where manual curation is useful for tool improvement, otherwise known as difficult articles, where exciting automated tools do not produce accurate results. These articles have these characteristics: they contain more gene mentions than average, they mention genes from a variety of organisms, and often more than one organism, they contain ambiguous gene mentions, they discuss genes in relation with other biomedical topics such as diseases, chemicals, mutations, etc.

    Data has been doubly annotated in three rounds until annotators achieved 100% agreement.

    Annotation load was distributed so that all annotators annotated a similar number of documents, and a similar number of entities. Annotators did not know the identity of their partners until the very end. All pairings were made at the document level, so each annotator was paired with every other annotator. There were six annotators who were attached to the project from the beginning to end.

    Inter-annotator agreement (IAA) was measured for Gene ID annotations, since annotators had almost perfect agreement for mention recognition.

    IAA was 74% for the first round of annotations, 86% for the second round, and 100% after collaborative discussions.

    NLM-Gene is available in BioC XML and has been partitioned into training and testing set. The training set consists of 450 articles, and the testing set consists of 100 articles.

    For annotation details, please refer to the annotation guidelines. For methodology, gene recognition results, and corpus characteristics and further details, please refer to the manuscript.

    We believe this resource can be of significant value to researchers in both life sciences and informatics communities. Specifically, people involved in data curation, and biomedical tool development will find the availability of this corpus very useful.

    The corpus can be used in combination with GNorm+ corpus, and BioCreative Gene annotated corpora, to create a richer dataset. NLM-Gene, being richer in the number of species, and more complex in terms of bio-entities, should provide an invaluable resource to test hard to predict cases, and build algorithms that can address harder named entity recognition issues.

  17. Example Data for "Tutorial: annotation and interpretation of mammalian...

    • zenodo.org
    application/gzip
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dustin Sokolowski; Dustin Sokolowski; Zoe Clarke; Zoe Clarke (2025). Example Data for "Tutorial: annotation and interpretation of mammalian genomes" [Dataset]. http://doi.org/10.5281/zenodo.14962941
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dustin Sokolowski; Dustin Sokolowski; Zoe Clarke; Zoe Clarke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 3, 2025
    Description

    This dataset is used to perform end-to-end genome annotation using the Genome Annotation Tutorial (https://github.com/BaderLab/GenomeAnnotationTutorial) in a manner that is less resource and time intensive than annotating an entire genome assembly. End-to-end, this pipeline should take ~48 hours to run without any parallelization of jobs. With parallelization (i.e., using a Snakemake pipeline), the workflow with this example data will be under 24 hours.

    Abstract from: Tutorial: annotation and interpretation of mammalian genomes.

    As DNA sequencing technologies improve, it is becoming easier to sequence and assemble the genomes of non-model organisms. However, before a genome sequence can be used as a reference it must be annotated with genes and other features, which can now be performed by individual labs using public software. Modern genome annotations integrate gene predictions from the assembled DNA sequence with gene homology information from a high-quality reference and functional evidence (e.g. protein sequences, RNA sequencing). Many genome annotation pipelines exist that vary in accuracy and how resource-intensive and user-friendly they are. This tutorial covers a streamlined genome annotation pipeline that can create high-quality mammalian genome annotations in-lab. Our pipeline integrates existing state-of-the-art genome-annotation tools capable of annotating protein-coding and non-coding RNA genes. This tutorial also guides the user on assigning gene symbols and annotating repeat regions. We lastly describe additional tools to assess annotation quality and combine and format the results.

    This dataset contains a small chromosome from a recent naked mole rat assembly, short read RNA-seq of three tissues, and ISO-seq data from two tissues. This example dataset was generated to allow users to complete the five major steps of genome annotation: (1) identifying repetitive elements and masking repeats that can interfere with gene finding, (2) identifying protein-coding/messenger RNA (mRNA) gene models, (3) optimizing gene models using multiple lines of evidence, (4) adding non-coding RNA (ncRNA) gene models (5) labelling gene models with the likely gene identity (i.e. gene symbol).

    The naked mole-rat assembly and ISO-seq data are derived from:

    Sokolowski, D. J., Miclăuș, M., Nater, A., Faykoo-Martinez, M., Hoekzema, K., Zuzarte, P., ... & Wilson, M. D. (2024). An updated reference genome sequence and annotation reveals gene losses and gains underlying naked mole-rat biology. bioRxiv, 2024-11.

    The short read RNA-seq data are derived from:

    Bens, M., Szafranski, K., Holtze, S., Sahm, A., Groth, M., Kestler, H. A., ... & Platzer, M. (2018). Naked mole-rat transcriptome signatures of socially suppressed sexual maturation and links of reproduction to aging. BMC biology, 16, 1-13.

  18. Z

    POLIcy design ANNotAtions (POLIANNA): Towards understanding policy design...

    • data.niaid.nih.gov
    Updated Dec 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fride Sigurdsson (2023). POLIcy design ANNotAtions (POLIANNA): Towards understanding policy design through text-as-data approaches [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7569273
    Explore at:
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    Fride Sigurdsson
    Sebastian Sewerin
    Onerva Martikainen
    Fabian Hafner
    Alisha Esshaki
    Joel Küttel
    Lynn H. Kaack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The POLIANNA dataset is a collection of legislative texts from the European Union (EU) that have been annotated based on theoretical concepts of policy design. The dataset consists of 20,577 annotated spans in 412 articles, drawn from 18 EU climate change mitigation and renewable energy laws, and can be used to develop supervised machine learning approaches for scaling policy analysis. The dataset includes a novel coding scheme for annotating text spans, and you find a description of the annotated corpus, an analysis of inter-annotator agreement, and a discussion of potential applications in the paper accompanying this dataset. The objective of this dataset to build tools that assist with manual coding of policy texts by automatically identifying relevant paragraphs.

    Detailed instructions and further guidance about the dataset as well as all the code used for this project can be found in the accompanying paper and on the GitHub project page. The repository also contains useful code to calculate various inter-annotator agreement measures and can be used to process text annotations generated by INCEpTION.

    Dataset Description

    We provide the dataset in 3 different formats:JSON: Each article corresponds to a folder, where the Tokens and Spans are stored in a separate JSON file. Each article-folder further contains the raw policy-text as in a text file and the metadata about the policy. This is the most human-readable format.

    JSONL: Same folder structure as the JSON format, but the Spans and Tokens are stored in a JSONL file, where each line is a valid JSON document.

    Pickle: We provide the dataset as a Python object. This is the recommended method when using our own Python framework that is provided on GitHub. For more information, check out the GitHub project page.

    License

    The POLIANNA dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. If you use the POLIANNA dataset in your research in any form, please cite the dataset.

    Citation

    Sewerin, S., Kaack, L.H., Küttel, J. et al. Towards understanding policy design through text-as-data approaches: The policy design annotations (POLIANNA) dataset. Sci Data10, 896 (2023). https://doi.org/10.1038/s41597-023-02801-z

  19. NLUCat

    • zenodo.org
    • huggingface.co
    • +1more
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NLUCat

    Dataset Description

    Dataset Summary

    NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

    The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

    The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

    The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

    This dataset can be used to train models for intent classification, spans identification and examples generation.

    This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

    In this repository you'll find the following items:

    • NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
    • NLUCat_dataset.json: the completed NLUCat dataset
    • NLUCat_stats.tsv: statistics about de NLUCat dataset
    • dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers
    • reports: folder with the reports done as feedback to the annotators during the annotation process

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

    Supported Tasks and Leaderboards

    Intent classification, spans identification and examples generation.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    Data Instances

    Three JSON files, one for each split.

    Data Fields

    • example: `str`. Example
    • annotation: `dict`. Annotation of the example
    • intent: `str`. Intent tag
    • slots: `list`. List of slots
    • Tag:`str`. tag to the slot
    • Text:`str`. Text of the slot
    • Start_char: `int`. First character of the span
    • End_char: `int`. Last character of the span

    Example


    An example looks as follows:

    {
    "example": "Demana una ambulància; la meva dona està de part.",
    "annotation": {
    "intent": "call_emergency",
    "slots": [
    {
    "Tag": "service",
    "Text": "ambulància",
    "Start_char": 11,
    "End_char": 21
    },
    {
    "Tag": "situation",
    "Text": "la meva dona està de part",
    "Start_char": 23,
    "End_char": 48
    }
    ]
    }
    },


    Data Splits

    • NLUCat.train: 9128 examples
    • NLUCat.dev: 1441 examples
    • NLUCat.test: 1441 examples

    Dataset Creation

    Curation Rationale

    We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

    When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

    Source Data

    Initial Data Collection and Normalization

    We commissioned a company to create fictitious examples for the creation of this dataset.

    Who are the source language producers?

    We commissioned the writing of the examples to the company m47 labs.

    Annotations

    Annotation process

    The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
    * First step: translation or elaboration of the instructions given to the annotators to write the examples.
    * Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
    * Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

    Who are the annotators?

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

    Personal and Sensitive Information

    No personal or sensitive information included.

    The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

    Considerations for Using the Data

    Social Impact of Dataset

    We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

    Discussion of Biases

    When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
    Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
    Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

  20. f

    Statistics of text (word) regions with orientation attributes.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu-Cheng Yin; Chun Yang; Wei-Yi Pei; Haixia Man; Jun Zhang; Erik Learned-Miller; Hong Yu (2023). Statistics of text (word) regions with orientation attributes. [Dataset]. http://doi.org/10.1371/journal.pone.0126200.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Xu-Cheng Yin; Chun Yang; Wei-Yi Pei; Haixia Man; Jun Zhang; Erik Learned-Miller; Hong Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistics of text (word) regions with orientation attributes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna (2023). Analyzing Dataset Annotation Quality Management in the Wild [Dataset]. http://doi.org/10.48328/tudatalib-1220

Data from: Analyzing Dataset Annotation Quality Management in the Wild

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 7, 2023
Authors
Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.

Search
Clear search
Close search
Google apps
Main menu