100+ datasets found
  1. Global Data Annotation Outsourcing Market Size By Annotation Type, By...

    • verifiedmarketresearch.com
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2024). Global Data Annotation Outsourcing Market Size By Annotation Type, By Industry Vertical, By Deployment Model, By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/data-annotation-outsourcing-market/
    Explore at:
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Verified Market Researchhttps://www.verifiedmarketresearch.com/
    Authors
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2024 - 2031
    Area covered
    Global
    Description

    Data Annotation Outsourcing Market size was valued at USD 0.8 Billion in 2023 and is projected to reach USD 3.6 Billion by 2031, growing at a CAGR of 33.2%during the forecasted period 2024 to 2031.

    Global Data Annotation Outsourcing Market Drivers

    The market drivers for the Data Annotation Outsourcing Market can be influenced by various factors. These may include:

    Fast Growth in AI and Machine Learning Applications: The need for data annotation services has increased as a result of the need for huge amounts of labeled data for training AI and machine learning models. Companies can focus on their core skills by outsourcing these processes and yet receive high-quality annotated data.

    Growing Need for High-Quality Labeled Data: The efficacy of AI models depends on precise data labeling. In order to achieve accurate and reliable data labeling, businesses are outsourcing their annotation responsibilities to specialist service providers, which is propelling market expansion.

    Global Data Annotation Outsourcing Market Restraints

    Several factors can act as restraints or challenges for the Data Annotation Outsourcing Market. These may include:

    Data Privacy and Security Issues: It can be difficult to guarantee data privacy and security. Strict rules and guidelines must be followed by businesses in order to protect sensitive data, which can be expensive and complicated.

    Problems with Quality Control: It can be difficult to maintain consistent and high-quality data annotation when working with numerous vendors. The effectiveness of AI and machine learning models might be impacted by inconsistent or inaccurate data annotations.

  2. T

    Guidelines for Data Annotation

    • dataverse.tdl.org
    pdf
    Updated Sep 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kate Mesh; Kate Mesh (2020). Guidelines for Data Annotation [Dataset]. http://doi.org/10.18738/T8/FWOOJQ
    Explore at:
    pdf(167426), pdf(2472574)Available download formats
    Dataset updated
    Sep 15, 2020
    Dataset provided by
    Texas Data Repository
    Authors
    Kate Mesh; Kate Mesh
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Included here are a coding manual and supplementary examples of gesture forms (in still images and video recordings) that informed the coding of the first author (Kate Mesh) and four project reliability coders.

  3. f

    Apriori model code, data files and annotation guide

    • datasetcatalog.nlm.nih.gov
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslam, Kousar (2024). Apriori model code, data files and annotation guide [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001435891
    Explore at:
    Dataset updated
    Nov 8, 2024
    Authors
    Aslam, Kousar
    Description

    This study explores discussions related to whistleblowing in the software indusry on Reddit.

  4. Qualitative analysis of manual annotations of clinical text with SNOMED CT

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). Qualitative analysis of manual annotations of clinical text with SNOMED CT [Dataset]. http://doi.org/10.1371/journal.pone.0209547
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SNOMED CT provides about 300,000 codes with fine-grained concept definitions to support interoperability of health data. Coding clinical texts with medical terminologies it is not a trivial task and is prone to disagreements between coders. We conducted a qualitative analysis to identify sources of disagreements on an annotation experiment which used a subset of SNOMED CT with some restrictions. A corpus of 20 English clinical text fragments from diverse origins and languages was annotated independently by two domain medically trained annotators following a specific annotation guideline. By following this guideline, the annotators had to assign sets of SNOMED CT codes to noun phrases, together with concept and term coverage ratings. Then, the annotations were manually examined against a reference standard to determine sources of disagreements. Five categories were identified. In our results, the most frequent cause of inter-annotator disagreement was related to human issues. In several cases disagreements revealed gaps in the annotation guidelines and lack of training of annotators. The reminder issues can be influenced by some SNOMED CT features.

  5. Z

    Annotation guidelines and data for complex known-item search requests for...

    • data.niaid.nih.gov
    • ordo.open.ac.uk
    • +1more
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bogers, Toine; Gäde, Maria; Hall, Mark; Koolen, Marijn; Petras, Vivien; Skov, Mette (2025). Annotation guidelines and data for complex known-item search requests for leisure domains [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14808370
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Humboldt-Universität zu Berlin
    Aalborg University
    Royal Netherlands Academy of Arts and Sciences
    IT University of Copenhagen
    The Open University
    Authors
    Bogers, Toine; Gäde, Maria; Hall, Mark; Koolen, Marijn; Petras, Vivien; Skov, Mette
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of the forum threads that have been annotated for known-item search. The post containing the search request and the post containing the correct answer have been marked, as well as the post where the original poster confirms the correct has been given. The guidelines explain the annotation process.

    This dataset is used for and described in the ACM CHIIR 2025 paper entitled "Exploring the Zero-Shot Known-Item Retrieval\ Capabilities of LLMs for Casual Leisure Information Needs".

  6. Self-Annotated Wearable Activity Data

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Hölzemann; Alexander Hölzemann; Kristof Van Laerhoven; Kristof Van Laerhoven (2024). Self-Annotated Wearable Activity Data [Dataset]. http://doi.org/10.3389/fcomp.2024.1379788
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander Hölzemann; Alexander Hölzemann; Kristof Van Laerhoven; Kristof Van Laerhoven
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our dataset contains 2 weeks of approx. 8-9 hours of acceleration data per day from 11 participants wearing a Bangle.js Version 1 smartwatch with our firmware installed.

    The dataset contains annotations from 4 different commonly used annotation methods utilized in user studies that focus on in-the-wild data. These methods can be grouped in user-driven, in situ annotations - which are performed before or during the activity is recorded - and recall methods - where participants annotate their data in hindsight at the end of the day.

    The participants had the task to label their activities using (1) a button located on the smartwatch, (2) the activity tracking app Strava, (3) a (hand)written diary and (4) a tool to visually inspect and label activity data, called MAD-GUI. Methods (1)-(3) are used in both weeks, however method (4) is introduced in the beginning of the second study week.

    The accelerometer data is recorded with 25 Hz, a sensitivity of ±8g and is stored in a csv format. Labels and raw data are not yet combined. You can either write your own script to label the data or follow the instructions in our corresponding Github repository.

    The following unique classes are included in our dataset:

    laying, sitting, walking, running, cycling, bus_driving, car_driving, vacuum_cleaning, laundry, cooking, eating, shopping, showering, yoga, sport, playing_games, desk_work, guitar_playing, gardening, table_tennis, badminton, horse_riding.

    However, many activities are very participant specific and therefore only performed by one of the participants.

    The labels are also stored as a .csv file and have the following columns:

    week_day, start, stop, activity, layer

    Example:

    week2_day2,10:30:00,11:00:00,vacuum_cleaning,d

    The layer columns specifies which annotation method was used to set this label.

    The following identifiers can be found in the column:

    b: in situ button

    a: in situ app

    d: self-recall diary

    g: time-series recall labelled with a the MAD-GUI

    The corresponding publication is currently under review.

  7. e

    Data from: MT@BZ annotation guidelines v1.0

    • clarin.eurac.edu
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Chiocchetti; Flavia De Camillis (2024). MT@BZ annotation guidelines v1.0 [Dataset]. https://clarin.eurac.edu/repository/xmlui/handle/20.500.12124/62?show=full
    Explore at:
    Dataset updated
    Jan 18, 2024
    Authors
    Elena Chiocchetti; Flavia De Camillis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MT@BZ annotation guidelines are guidelines for legal Italian-German machine translation quality assessment. Particularly, they cover the South Tyrolean German variety. They are based on version 1.3.3 of the Annotation Guidelines for English-Dutch Machine Translation Quality Assessment (https://www.lt3.ugent.be/publications/annotation-guidelines-for-english-dutch-machine-tr/). The guidelines also include specific instructions on how to annotate errors in WebAnno/INCEpTION and which sources to consult when assessing the correctness of a translation.

  8. E

    Data from: KPWr annotation guidelines - spatial expressions (1.0)

    • live.european-language-grid.eu
    binary format
    Updated Apr 24, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). KPWr annotation guidelines - spatial expressions (1.0) [Dataset]. https://live.european-language-grid.eu/catalogue/ld/18723
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Apr 24, 2016
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Spatial expressions annotation guidelines describing the process of manual annotation of documents in Polish Corpus of Wrocław University of Technology (KPWr)

  9. f

    The number of concepts from each semantic group that were used by each...

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg (2023). The number of concepts from each semantic group that were used by each annotator and the reference standard for coding the 20 medical text snippets. [Dataset]. http://doi.org/10.1371/journal.pone.0209547.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jose Antonio Miñarro-Giménez; Catalina Martínez-Costa; Daniel Karlsson; Stefan Schulz; Kirstine Rosenbeck Gøeg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The number of concepts from each semantic group that were used by each annotator and the reference standard for coding the 20 medical text snippets.

  10. E

    A Corpus of Online Drug Usage Guideline Documents Annotated with Type of...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    tsv
    Updated Sep 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). A Corpus of Online Drug Usage Guideline Documents Annotated with Type of Advice [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7399
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Sep 8, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: The goal of this dataset is to aid NLP research on recognizing safety critical information from drug usage guideline or patient handout data. This dataset contains annotated advice statements from 90 online DUG documents that corresponds to 90 drugs or medications that are used in the prescriptions of patients suffering from one or more chronic diseases. The advice statements are annotated in eight safety-critical categories: activity or lifestyle related, disease or symptom related, drug administration related, exercise related, food or beverage related, other drug related, pregnancy related, and temporal.

    Data Collection: The data was collected from MedScape. It is one of the most widely used reference for health care providers. At first, 34 real anonymized prescriptions of patients suffering from one or more chronic diseases are collected. These prescriptions contains 165 drugs that are used to treat chronic diseases. Then, MedScape was crawled to collect the drug user guideline (DUG) / patient handout for these 165 drugs. But, MedScape does not have DUG document for all drugs. We found DUG document for 90 drugs in MedScape.

    Data Annotation tool: The data annotation tool is developed to ease the annotation process. It allows the user to select a DUG document and select a position from the document in terms of line number. It stores the user log from the annotator and loads the most recent position from the log when the application is launched. It supports annotating multiple files for the same drug, as often there are multiple overlapping sources of drug usage guidelines for a single drug. Often DUG documents contain formatted text. This tool aids annotation of the formatted text as well. The annotation tool is also available upon request.

    Annotated Data Description: The annotated data contains the annotation tag(s) of each advice extracted from the 90 online DUG documents. It also contains the phrases or topics in the advice statement that triggers the annotation tag, such as, activity, exercise, medication name, food or beverage name, disease name, pregnancy condition (gestational, postpartum). Sometimes disease names are not directly mentioned rather mentioned as a condition (e.g., stomach bleeding, alcohol abuse) or state of a parameter (e.g., low blood sugar, low blood pressure). The annotated data is formatted as following:
    drug name, drug number, line number of the first sentence of the advice in the DUG document, advice Text, advice tag(s), medication, food, activity, exercise, and disease names mentioned in the advice.


    Unannotated Data Description:
    The unannotated data contains the raw DUG document for 90 drugs. It also contains the drug interaction information for the 165 drugs. The drug interaction information is categorized in 4 classes, contraindicated, serious, monitor closely, and minor. This information can be utilized to automatically detect potential interaction and effect of interaction among multiple drugs.

    Citation: If you use this dataset in your work, please cite the following reference in any publication:

    @inproceedings{preum2018DUG,
    title={A Corpus of Drug Usage Guidelines Annotated with Type of Advice},
    author={Sarah Masud Preum, Md. Rizwan Parvez, Kai-Wei Chang, and John A. Stankovic},
    booktitle={ Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
    publisher = {European Language Resources Association (ELRA)},
    year={2018}
    }

  11. 142-Birds-Species-Object-Detection-V1

    • kaggle.com
    zip
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Sanjay Kottakota (2024). 142-Birds-Species-Object-Detection-V1 [Dataset]. https://www.kaggle.com/datasets/saisanjaykottakota/142-birds-species-object-detection-v1
    Explore at:
    zip(1081589024 bytes)Available download formats
    Dataset updated
    Oct 17, 2024
    Authors
    Sai Sanjay Kottakota
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data Annotation for Computer Vision using Web Scraping and CVAT

    Introduction

    This project demonstrates the process of creating a labeled dataset for computer vision tasks using web scraping and the CVAT annotation tool. Web scraping was employed to gather images from the web, and CVAT was utilized to annotate these images with bounding boxes around objects of interest. This dataset can then be used to train object detection models.

    Dataset Creation

    1. Web Scraping: Images of 142 bird species were collected using web scraping techniques. Libraries such as requests and Beautiful Soup were likely used for this task.
    2. CVAT Annotation: The collected images were uploaded to CVAT, where bounding boxes were manually drawn around each bird instance in the images. This created a labeled dataset ready for training computer vision models.

    Usage

    This dataset can be used to train object detection models for bird species identification. It can also be used to evaluate the performance of existing object detection models on a specific dataset.

    Code

    The code used for this project is available in the attached notebook. It demonstrates how to perform the following tasks:

    • Download the dataset.
    • Install necessary libraries.
    • Upload the dataset to Kaggle.
    • Create a dataset in Kaggle and upload the data.

    Conclusion

    This project provides a comprehensive guide to data annotation for computer vision tasks. By combining web scraping and CVAT, we were able to create a high-quality labeled dataset for training object detection models. Sources github.com/cvat-ai/cvat opencv.org/blog/data-annotation/

    Sample manifest.jsonl metadata

    {"version":"1.1"}
    {"type":"images"}
    {"name":"Spot-billed_Pelican_-_Pelecanus_philippensis_-_Media_Search_-_Macaulay_Library_and_eBirdMacaulay_Library_logoMacaulay_Library_lo/10001","extension":".jpg","width":480,"height":360,"meta":{"related_images":[]}}
    {"name":"Spot-billed_Pelican_-_Pelecanus_philippensis_-_Media_Search_-_Macaulay_Library_and_eBirdMacaulay_Library_logoMacaulay_Library_lo/10002","extension":".jpg","width":480,"height":320,"meta":{"related_images":[]}}
    
  12. E

    Data from: KPWr annotation guidelines - events

    • live.european-language-grid.eu
    binary format
    Updated Apr 24, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). KPWr annotation guidelines - events [Dataset]. https://live.european-language-grid.eu/catalogue/ld/18724
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Apr 24, 2016
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Events annotation guidelines describing the process of manual annotation of documents in Polish Corpus of Wrocław University of Technology (KPWr)

  13. R

    Training and development dataset for information extraction in plant...

    • entrepot.recherche.data.gouv.fr
    zip
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MaIAGE; Plateforme ESV; MaIAGE; Plateforme ESV (2025). Training and development dataset for information extraction in plant epidemiomonitoring [Dataset]. http://doi.org/10.57745/ZDNOGF
    Explore at:
    zip(479001)Available download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    MaIAGE; Plateforme ESV; MaIAGE; Plateforme ESV
    License

    https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/ZDNOGFhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/ZDNOGF

    Dataset funded by
    INRAE
    PIA DATAIA
    Agence nationale de la recherche
    Description

    The “Training and development dataset for information extraction in plant epidemiomonitoring” is the annotation set of the “Corpus for the epidemiomonitoring of plant”. The annotations include seven entity types (e.g. species, locations, disease), their normalisation by the NCBI taxonomy and GeoNames and binary (seven) and ternary relationships. The annotations refer to character positions within the documents of the corpus. The annotation guidelines give their definitions and representative examples. Both datasets are intended for the training and validation of information extraction methods.

  14. R

    Guide d'annotation des Bulletins de Santé du Végétal

    • entrepot.recherche.data.gouv.fr
    pdf
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marine Courtin; Marine Courtin; Catherine Roussey; Catherine Roussey; Robert Bossy; Robert Bossy; Stéphan Bernard; Stéphan Bernard (2024). Guide d'annotation des Bulletins de Santé du Végétal [Dataset]. http://doi.org/10.57745/I5YVJH
    Explore at:
    pdf(710334)Available download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Marine Courtin; Marine Courtin; Catherine Roussey; Catherine Roussey; Robert Bossy; Robert Bossy; Stéphan Bernard; Stéphan Bernard
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Dataset funded by
    Agence nationale de la recherche
    Description

    Ce document présente le schéma d'annotation et les instructions à suivre pour l'annotation manuelle du corpus des Bulletins de Santé du Végétal (BSV). Il est destiné aux experts qui réalisent l'annotation, ainsi qu'aux personnes qui développent des méthodes de traitement automatique pour la prédiction et l'évaluation des annotations. L'annotation concerne les entités nommées, leur normalisation à l'aide de référentiels, ainsi que les relations sémantiques entre entités. Les instructions sont illustrées par des exemples tirés du corpus des BSV.

  15. Data from: Annotation of epidemiological information in animal...

    • dataverse.cirad.fr
    docx, ods, tsv
    Updated Aug 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CIRAD Dataverse (2022). Annotation of epidemiological information in animal disease-related news articles: guidelines and manually labelled corpus [Dataset]. http://doi.org/10.18167/DVN1/YGAKNB
    Explore at:
    docx(32281), tsv(5778), ods(116701)Available download formats
    Dataset updated
    Aug 7, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains two files: (i) An annotated corpus ("epi_info_corpus‧xlsx") containing 486 manually annotated sentences extracted from 32 animal disease-related news articles. These news articles were obtained from the database of an event-based biosurveillance system dedicated to animal health surveillance, PADI-web (https://padi-web.cirad.fr/en/). The first sheet (‘article_metadata’) provides metadata about the news articles : (1) id_article, the unique id of a news article, (2) title, the title of the news article, (3) source, the name of the news article website, (3) publication_date, the publication date of the news article (mm-dd-yyyy) and (4) URL, the web URL of the news article. The second sheet (‘annot_sentences’) contains the annotated sentences: each row corresponds to a sentence from a news article. Each sentence has two distinct labels, Event type and Information type. The set of columns is : (1) id_article, the id of the news article to which the sentence belongs, (2) id_sentence, the unique id of the sentence, indicating its position in the news content (integer ranging from 1 to n, n being the total number of sentences in the news article), (3) sentence_text, the sentence textual content, (4) event_type, the Event type label and (5) information_type, the Information type label. Event type labels indicate the relation between the sentence and the epidemiological context, i‧e. current event (CE), risk event (RE), old event (OE), general (G) and irrelevant (IR). Information type labels indicate the type of epidemiological information, i‧e descriptive epidemiology (DE), distribution (DI), preventive and control measures (PCM), economic and political consequences (EPC), transmission pathway (TP), concern and risk factors (CRF), general epidemiology (GE) and irrelevant (IR). (ii) The annotation guidelines ("epi_info_guidelines‧doc") providing a detailed description of each category.

  16. Z

    Data from: Annotated Dataset for Uncertainty Mining : Gold Standard

    • data.niaid.nih.gov
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gutehrlé, Nicolas; Ningrum, Panggih Kusuma; Atanassova, Iana (2024). Annotated Dataset for Uncertainty Mining : Gold Standard [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14134214
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Centre de Recherches Interdisciplinaires et Transculturelles
    Authors
    Gutehrlé, Nicolas; Ningrum, Panggih Kusuma; Atanassova, Iana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of the dataset

    In order to study the expression of uncertainty in scientific articles, we have put together an interdisciplinary corpus of journals in the fields of Science, Technology and Medicine (STM) and the Humanities and Social Sciences (SHS). The selection of journals in our corpus is based on the Scimago Journal and Country Rank (SJR) classification, which is based on Scopus, the largest academic database available online. We have selected journals covering various disciplines, such as medicine, biochemistry, genetics and molecular biology, computer science, social sciences, environmental sciences, psychology, arts and humanities. For each discipline, we selected the five highest-ranked journals. In addition, we have included the journals PLoS ONE and Nature, both of which are interdisciplinary and highly ranked.

    Based on the corpus of articles from different disciplines described above, we created a set of annotated sentences as follows:

    593 were pre-selected automatically, by studying the occurrences of the lists of uncertainty indices proposed by Bongelli et al. (2019), Chen et al. (2018) and Hyland (1996).

    The remaining sentences were extracted from a subset of articles, consisting of two randomly selected articles per journal. These articles were examined by two human annotators to identify sentences containing uncertainty and to annotate them.

    600 sentences not expressing scientific uncertainty were manually identified and reviewed by two annotators

    The sentences were annotated by two independent annotators following the annotation guide proposed by Ningrum and Atanassova (2024). The annotators were trained on the basis of an annotation guide and previously annotated sentences in order to guarantee the consistency of the annotations. Each sentence was annotated as expressing or not expressing uncertainty (Uncertainty and No Uncertainty).Sentences expressing uncertainty were then annotated along five dimensions: Reference , Nature, Context , Timeline and Expression. The annotators reached an average agreement score of 0.414 according to Cohen's Kappa test, which shows the difficulty of the task of annotating scientific uncertainty.Finally, conflicting annotations were resolved by a third independent annotator.

    Our final corpus thus consists of a total of 1 840 sentences from 496 articles in 21 English-language journals from 8 different disciplines.The columns of the table are as follows:

    journal: name of the journal from where the article originates

    article_title: title of the article from where the sentence is extracted

    publication_year: year of publication of the article

    sentence_text: text of the sentence expressing or not expressing uncertainty

    uncertainty: 1 if the sentence expresses uncertainty and 0 otherwise;

    ref, nature, context, timeline, expression: annotations of the type of uncertainty according to the annotation framework proposed by Ningrum and Atanassova (2023). The annotation of each dimension in this dataset are in numeric format rather than textual. The mapping betwen textual and numeric labels is presented in the Table below.

    Dimension 1 2 3 4 5

    Reference Author Former Both

    Nature Epistemic Aleatory Both

    Context Background Methods Res&Disc Conclusion Others

    Timeline Past Present Future

    Expression Quantified Unquantified

    This gold standard has been produced as part of the ANR InSciM (Modelling Uncertainty in Science) project.

    References

    Bongelli, R., Riccioni, I., Burro, R., & Zuczkowski, A. (2019). Writers’ uncertainty in scientific and popular biomedical articles. A comparative analysis of the British Medical Journal and Discover Magazine [Publisher: Public Library of Science]. PLoS ONE, 14 (9). https://doi.org/10.1371/journal.pone.0221933

    Chen, C., Song, M., & Heo, G. E. (2018). A scalable and adaptive method for finding semantically equivalent cue words of uncertainty. Journal of Informetrics, 12 (1), 158–180. https://doi.org/10.1016/j.joi.2017.12.004

    Hyland, K. E. (1996). Talking to the academy forms of hedging in science research articles [Publisher: SAGE Publications Inc.]. Written Communication, 13 (2), 251–281. https://doi.org/10.1177/0741088396013002004

    Ningrum, P. K., & Atanassova, I. (2023). Scientific Uncertainty: An Annotation Framework and Corpus Study in Different Disciplines. 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2023). https://doi.org/10.5281/zenodo.8306035

    Ningrum, P. K., & Atanassova, I. (2024). Annotation of scientific uncertainty using linguistic patterns. Scientometrics. https://doi.org/10.1007/s11192-024-05009-z

  17. Data from: dopanim: A Dataset of Doppelganger Animals with Noisy Annotations...

    • zenodo.org
    json, zip
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marek Herde; Marek Herde; Denis Huseljic; Denis Huseljic; Lukas Rauch; Lukas Rauch; Bernhard Sick; Bernhard Sick (2024). dopanim: A Dataset of Doppelganger Animals with Noisy Annotations from Multiple Humans [Dataset]. http://doi.org/10.5281/zenodo.14016659
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Nov 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marek Herde; Marek Herde; Denis Huseljic; Denis Huseljic; Lukas Rauch; Lukas Rauch; Bernhard Sick; Bernhard Sick
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Profile

    • The dopanim dataset features about 15,750 animal images of 15 classes, organized into four groups of doppelganger animals and collected together with ground truth labels from iNaturalist. For approximately 10,500 of these images, 20 humans provided over 52,000 annotations with an accuracy of circa 67%.
    • Key attributes include the challenging task of classifying doppelganger animals, human-estimated likelihoods per image-annotator pair, and annotator metadata.
    • The dataset's broad research scope covers noisy label learning, multi-annotator learning, active learning, and learning beyond hard labels.
    • Further information is given in the associated article and our GitHub repository for using the data.

    File Descriptions

    • task_data.json contains data, e.g., the ground truth class labels, for each image classification task. Thereby, each task record is indexed by the iNaturalist observation index. A description of each record's entries is given in the supplementary material of the associated article.
    • annotation_data.json contains data, e.g., likelihoods per animal class, for each obtained image annotation. Thereby, each annotation record has a unique identifier. A description of each record's entries is given in the supplementary material of the associated article.
    • annotator_metadata.json contains metadata, e.g., self-assessed levels of knowledge and interest regarding animals, for each annotator. Thereby, each metadata record is indexed by the anonymous identifier of an annotator. A description of each record's entries is given in the supplementary material of the associated article.
    • train.zip, valid.zip, and test.zip contain the training, validation, and test images organized into directories of the 15 animal classes.

    Licenses

    • Images and their associated metadata are collected as observations from iNaturalist. Thereby, we constrained the collection to images and metadata with CC0, CC-BY, CC-BY-SA, CC-BY-NC, or CC-BY-NC-SA licenses. The information about these licenses is given by the fields license_code and photo_license_code in each record of task_data.json. The links to each image and observation are given for further reference.
    • We collected the data in the files annotation_data.json and annotator_metadata.json in an annotation campaign via LabelStudio and distribute them under the license CC-BY-NC 4.0.

    Contact

    • If you have questions or issues relevant to other dataset users, we ask you to create a corresponding issue at our GitHub repository.
    • In all other cases, you can contact the dataset collectors via the e-mail marek.herde@uni-kassel.de.

    Acknowledgements

    This work was funded by the ALDeep and CIL projects at the University of Kassel. Moreover, we thank Franz Götz-Hahn for his insightful comments on improving our annotation campaign. Finally, we thank the iNaturalist community for their many observations that help explore our nature's biodiversity and our annotators for their dedicated efforts in making the annotation campaign via LabelStudio possible.

    Disclaimer

    • We carefully selected and composed this dataset's content. If you believe that any of this content violates licensing agreements or infringes on intellectual property rights, please contact us immediately (cf. contact information). In such a case, we will promptly investigate the issue and remove the implicated data records from our dataset if necessary.
    • Users are responsible for ensuring that their use of the dataset complies with all licenses, applicable laws, regulations, and ethical guidelines. We make no representations or warranties of any kind and accept no responsibility in the case of violations.

  18. Z

    C# Dataset of Data Class, Feature Envy and Refused Bequest code smells

    • nde-dev.biothings.io
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luburić, Nikola (2024). C# Dataset of Data Class, Feature Envy and Refused Bequest code smells [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10475431
    Explore at:
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    Grujić, Katarina-Glorija
    Slivka, Jelena
    Kovačević, Aleksandar
    Luburić, Nikola
    Prokić, Simona
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes open-source projects written in C# programming language, annotated for the presence of Data Class, Feature Envy and Refused Bequest code smells. Each code snippet was manually annotated by at least two annotators.

    The dataset contains three excel datasheets:

    DataSet_Data_Class.xlsx - C# classes annotated for the Data Class code smell

    DataSet_Feature_Envy.xlsx - C# methods annotated for the Feature Envy code smell

    DataSet_Refused_Bequest.xlsx - C# classes annotated for the Refused Bequest code smell

    The columns in the datasheet represent:

    Code Snippet ID - the full name of the code snippet.

    for classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).

    for methods, this is the full name of the class and the method's signature (e.g., namespace.class.method(param1Type, param2Type))

    Link - the Github link to the code snippet, including the commit and the start and end LOC.

    Code Smell - code smell for which the code snippet is examined (Data Class, Feature Envy or Refused Bequest)

    Project Link - the link to the version of the code repository that was annotated

    Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 31 class-level metrics for Data Class and Refused Bequest detection and 19 method-level metrics for Feature Envy detection. The list of metrics and their definitions is available here.

    Final annotation – a single severity score calculated by a majority vote.

    Annotators – each annotator's (1, 2, or 3) assigned severity score.

    To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in three separate excel datasheets:

    DataClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Data Class code smell.

    FeatureEnvy_Heuristics.xlsx - C# methods annotated for the presence of heuristics relevant for the Feature Envy code smell.

    RefusedBequest_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Refused Bequest code smell.

    The columns of these two datasheets are:

    Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Data_Class.xlsx, DataSet_Feature_Envy.xlsx and DataSet_Refused_Bequest.xlsx)

    Annotators – heuristics labelled by each of the annotators (1, 2, or 3).

    Heuristics – whether the heuristic is applicable to the examined code snippet or not

    Annotators annotated the dataset based on the annotation procedure and guidelines available here.

  19. Data from: TRANSMAT Gold Standard

    • dataverse.cirad.fr
    application/x-gzip +2
    Updated Oct 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Lentschat; Martin Lentschat; Patrice Buche; Patrice Buche; Luc Menut; Luc Menut (2023). TRANSMAT Gold Standard [Dataset]. http://doi.org/10.18167/DVN1/U7HK8J
    Explore at:
    tsv(24301), tsv(382623), tsv(402471), pdf(297359), tsv(31878), application/x-gzip(4980)Available download formats
    Dataset updated
    Oct 25, 2023
    Authors
    Martin Lentschat; Martin Lentschat; Patrice Buche; Patrice Buche; Luc Menut; Luc Menut
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset presents a Gold Standard of data annotated on documents from the Science Direct website. The entities annotated are the ones related to permeability n-Ary relations, as defined in the TRANSMAT Ontology (https://ico.iate.inra.fr/atWeb/, https://doi.org/10.15454/NK24ID, http://agroportal.lirmm.fr/ontologies/TRANSMAT) and following the annotation guide also available here. The annotations were performed by three annotators on a WebAnno (doi: 10.3115/v1/P14-5016) server. The four files present (one per annotator, plus a merged version with priority to annotator 1 in case of conflicts on annotated items) were obtained from the output files of the WebAnno tool. They are presented in table format, without reproducing the full text, for copyright purposes. The information available on each annotation are: Doc (the original document), Target (the generic concept covering the annotated item), Original_Value (the annotated item), Attached_Value (an annotated secondary item for disambiguation), Type (the category of the annotated entity, symbolic, quantitative or additimentionnal) and Annotator (the annotator that performed the annotation). The code of the project for wich this Gold Standard was designed is available here: https://github.com/Eskode/ARTEXT4LOD

  20. Open Assistant

    • kaggle.com
    zip
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Open Assistant [Dataset]. https://www.kaggle.com/datasets/thedevastator/multilingual-conversation-dataset/code
    Explore at:
    zip(34307768 bytes)Available download formats
    Dataset updated
    Nov 23, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Open Assistant

    Over 10,000 Annotated Trees in 35 Languages

    By Huggingface Hub [source]

    About this dataset

    OpenAssistant Conversations (OASST1) is a remarkable conversation corpus created with the help of over 13,500 volunteers and containing over 10,000 fully annotated conversation trees across 35 languages. It contains 161,443 messages that have all been human-annotated with 461,292 quality ratings for quality assurance purposes.

    This dataset offers an incredible resource to researchers and developers alike who want to explore conversational AI technology. With the immense breadth of language options supported by OASST1, projects can be built that engage users from all over the world in natural language conversations. Additionally, since every message has undergone extensive human annotation and review, its accuracy can be trusted implicitly when building your own bots or related applications.

    Whatever your goals may be in working with Natural Language Processing (NLP) technologies, OASST1 promises you a versatile platform filled with esteemed possibilities!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Guide to Using the Multilingual Conversation (OASST1) Dataset

    Introduction

    This guide is to help you understand and use the OpenAssistant Conversations (OASST1) dataset. It covers important terms and topics related to the dataset, provides an overview of how it is structured, and outlines a step-by-step approach for utilizing its features. The conversational data included in this dataset can be used to train cognitive assistants across multiple languages, as well as evaluating language recognition systems.

    What Is Included in the Dataset?

    The OASST1 dataset includes 161,443 messages spread across 35 different languages. These messages are annotated with 461,292 quality ratings from human annotators across 13,500 volunteers covering 10 thousand fully annotated conversation trees. The messages are made up of text content along with associated labels for context such as role (user or assistant), language used, synthetic data identification information (machine generated), review results (positive/negative/neutral), and detoxification flag if appropriate. Emojis may also be noted within message text where appropriate.

    ## Structure of Data The structure of data within each conversation tree is organized by a combination of fields listed in validation and training datasets:

    • Role - Speaker role identified as user or assistant
    • Text - Text content provided by speaker
    • Language – Language spoken or written
    • Review Count – Number of reviews indicated by human raters Constructed Message Sets - Deleted Flag – Whether message was deleted or not Savior Model Name – Name indicating playful transformer model used for synthetic message generation Synthetic Indicator– Boolean indicating synthetic vs real messages Review Results – Positive/Negative/Neutral designation given by humans based on current assertions Detoxification Flag- Boolean flag; indicates detoxification has been applied Tree State– Depicts internal conversation tree progression Rank– Rating given from 1-5 assigned by human raters Labels— Contextual tags attached to identify other topic areas besides default language provided Emojis- Nonverbal comments which can be recognized visually Created Date— Date upon which message was created Model Name– Type name referencing particular machine learning model utilized when generating synthetically derived conversations

      How Can This Data Be Used?

      This conversational data can be used in numerous ways both practically and academically depending on your project goals. It supports evaluation

    Research Ideas

    • Natural language understanding machine learning tasks such as intent classification or sentiment analysis
    • Training chatbot models with state-of-the-art performance in multiple languages
    • Language usage studies and AI research with a corpus of human conversations in 35 languages

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Informat...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
VERIFIED MARKET RESEARCH (2024). Global Data Annotation Outsourcing Market Size By Annotation Type, By Industry Vertical, By Deployment Model, By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/data-annotation-outsourcing-market/
Organization logo

Global Data Annotation Outsourcing Market Size By Annotation Type, By Industry Vertical, By Deployment Model, By Geographic Scope And Forecast

Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License

https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

Time period covered
2024 - 2031
Area covered
Global
Description

Data Annotation Outsourcing Market size was valued at USD 0.8 Billion in 2023 and is projected to reach USD 3.6 Billion by 2031, growing at a CAGR of 33.2%during the forecasted period 2024 to 2031.

Global Data Annotation Outsourcing Market Drivers

The market drivers for the Data Annotation Outsourcing Market can be influenced by various factors. These may include:

Fast Growth in AI and Machine Learning Applications: The need for data annotation services has increased as a result of the need for huge amounts of labeled data for training AI and machine learning models. Companies can focus on their core skills by outsourcing these processes and yet receive high-quality annotated data.

Growing Need for High-Quality Labeled Data: The efficacy of AI models depends on precise data labeling. In order to achieve accurate and reliable data labeling, businesses are outsourcing their annotation responsibilities to specialist service providers, which is propelling market expansion.

Global Data Annotation Outsourcing Market Restraints

Several factors can act as restraints or challenges for the Data Annotation Outsourcing Market. These may include:

Data Privacy and Security Issues: It can be difficult to guarantee data privacy and security. Strict rules and guidelines must be followed by businesses in order to protect sensitive data, which can be expensive and complicated.

Problems with Quality Control: It can be difficult to maintain consistent and high-quality data annotation when working with numerous vendors. The effectiveness of AI and machine learning models might be impacted by inconsistent or inaccurate data annotations.

Search
Clear search
Close search
Google apps
Main menu