82 datasets found
  1. t

    Data from: Analyzing Dataset Annotation Quality Management in the Wild

    • tudatalib.ulb.tu-darmstadt.de
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna (2023). Analyzing Dataset Annotation Quality Management in the Wild [Dataset]. http://doi.org/10.48328/tudatalib-1220
    Explore at:
    Dataset updated
    Sep 7, 2023
    Authors
    Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.

  2. D

    Data Collection and Labelling Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Mar 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMA Research & Media LLP (2025). Data Collection and Labelling Report [Dataset]. https://www.marketresearchforecast.com/reports/data-collection-and-labelling-33030
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    AMA Research & Media LLP
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The data collection and labeling market is experiencing robust growth, fueled by the escalating demand for high-quality training data in artificial intelligence (AI) and machine learning (ML) applications. The market, estimated at $15 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% over the forecast period (2025-2033), reaching approximately $75 billion by 2033. This expansion is primarily driven by the increasing adoption of AI across diverse sectors, including healthcare (medical image analysis, drug discovery), automotive (autonomous driving systems), finance (fraud detection, risk assessment), and retail (personalized recommendations, inventory management). The rising complexity of AI models and the need for more diverse and nuanced datasets are significant contributing factors to this growth. Furthermore, advancements in data annotation tools and techniques, such as active learning and synthetic data generation, are streamlining the data labeling process and making it more cost-effective. However, challenges remain. Data privacy concerns and regulations like GDPR necessitate robust data security measures, adding to the cost and complexity of data collection and labeling. The shortage of skilled data annotators also hinders market growth, necessitating investments in training and upskilling programs. Despite these restraints, the market’s inherent potential, coupled with ongoing technological advancements and increased industry investments, ensures sustained expansion in the coming years. Geographic distribution shows strong concentration in North America and Europe initially, but Asia-Pacific is poised for rapid growth due to increasing AI adoption and the availability of a large workforce. This makes strategic partnerships and global expansion crucial for market players aiming for long-term success.

  3. d

    Data from: Crowdsourced geometric morphometrics enable rapid large-scale...

    • datadryad.org
    • zenodo.org
    zip
    Updated Nov 10, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Chang; Michael E. Alfaro (2016). Crowdsourced geometric morphometrics enable rapid large-scale collection and analysis of phenotypic data [Dataset]. http://doi.org/10.5061/dryad.gh4k7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2016
    Dataset provided by
    Dryad
    Authors
    Jonathan Chang; Michael E. Alfaro
    Time period covered
    2016
    Description
    1. Advances in genomics and informatics have enabled the production of large phylogenetic trees. However, the ability to collect large phenotypic datasets has not kept pace. 2. Here, we present a method to quickly and accurately gather morphometric data using crowdsourced image-based landmarking. 3. We find that crowdsourced workers perform similarly to experienced morphologists on the same digitization tasks. We also demonstrate the speed and accuracy of our method on seven families of ray-finned fishes (Actinopterygii). 4. Crowdsourcing will enable the collection of morphological data across vast radiations of organisms, and can facilitate richer inference on the macroevolutionary processes that shape phenotypic diversity across the tree of life.
  4. D

    Data Annotation and Collection Services Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Mar 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Data Annotation and Collection Services Report [Dataset]. https://www.marketresearchforecast.com/reports/data-annotation-and-collection-services-30703
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 9, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Annotation and Collection Services market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across diverse sectors. The market, estimated at $10 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching approximately $45 billion by 2033. This significant expansion is fueled by several key factors. The surge in autonomous driving initiatives necessitates high-quality data annotation for training self-driving systems, while the burgeoning smart healthcare sector relies heavily on annotated medical images and data for accurate diagnoses and treatment planning. Similarly, the growth of smart security systems and financial risk control applications demands precise data annotation for improved accuracy and efficiency. Image annotation currently dominates the market, followed by text annotation, reflecting the widespread use of computer vision and natural language processing. However, video and voice annotation segments are showing rapid growth, driven by advancements in AI-powered video analytics and voice recognition technologies. Competition is intense, with both established technology giants like Alibaba Cloud and Baidu, and specialized data annotation companies like Appen and Scale Labs vying for market share. Geographic distribution shows a strong concentration in North America and Europe initially, but Asia-Pacific is expected to emerge as a major growth region in the coming years, driven primarily by China and India's expanding technology sectors. The market, however, faces certain challenges. The high cost of data annotation, particularly for complex tasks such as video annotation, can pose a barrier to entry for smaller companies. Ensuring data quality and accuracy remains a significant concern, requiring robust quality control mechanisms. Furthermore, ethical considerations surrounding data privacy and bias in algorithms require careful attention. To overcome these challenges, companies are investing in automation tools and techniques like synthetic data generation, alongside developing more sophisticated quality control measures. The future of the Data Annotation and Collection Services market will likely be shaped by advancements in AI and ML technologies, the increasing availability of diverse data sets, and the growing awareness of ethical considerations surrounding data usage.

  5. P

    Data from: ImageNet Dataset

    • paperswithcode.com
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li (2024). ImageNet Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet
    Explore at:
    Dataset updated
    Apr 15, 2024
    Authors
    Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li
    Description

    The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.

    Total number of non-empty WordNet synsets: 21841 Total number of images: 14197122 Number of images with bounding box annotations: 1,034,908 Number of synsets with SIFT features: 1000 Number of images with SIFT features: 1.2 million

  6. text-2-video-human-preferences-runway-alpha

    • huggingface.co
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rapidata (2025). text-2-video-human-preferences-runway-alpha [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-runway-alpha
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    Rapidata AG
    Authors
    Rapidata
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Rapidata Video Generation Runway Alpha Human Preference

    If you get value from this dataset and would like to see more in the future, please consider liking it.

    This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

      Overview
    

    In this dataset, ~30'000 human annotations were collected to evaluate Runway's Alpha video generation model on our benchmark. The up to date benchmark can… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-runway-alpha.

  7. f

    Data from: Chromosome-level genome assembly and annotation of the...

    • springernature.figshare.com
    txt
    Updated Jan 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lei Zhang; Wanting Zhang; Mijuan Shi; Xiao-Qin Xia (2025). Chromosome-level genome assembly and annotation of the gynogenetic large-scale loach (Paramisgurnus dabryanus) [Dataset]. http://doi.org/10.6084/m9.figshare.27130323.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 27, 2025
    Dataset provided by
    figshare
    Authors
    Lei Zhang; Wanting Zhang; Mijuan Shi; Xiao-Qin Xia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This datasets includes 1) the gene structure annotation, repeats elements annotation, CDS sequence and peptide sequence of Haplotype B, which is described as the complete large scale loach genome. 2) the genome assembly and gene structure annotation of Haplotype A.

  8. Data from: UnientrezDB: Large-scale Gene Ontology Annotation Dataset and...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Aug 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang; Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang (2024). UnientrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers [Dataset]. http://doi.org/10.5281/zenodo.13335548
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang; Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 17, 2024
    Description

    Our work focuses on providing a comprehensive dataset and benchmarks for evaluating gene ontology annotations using a unified system of Entrez Gene Identifiers.

  9. Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code

    • zenodo.org
    csv, txt
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.13918465
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  10. text-2-video-Rich-Human-Feedback

    • huggingface.co
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rapidata (2025). text-2-video-Rich-Human-Feedback [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-Rich-Human-Feedback
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    Rapidata AG
    Authors
    Rapidata
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Rapidata Video Generation Rich Human Feedback Dataset

    If you get value from this dataset and would like to see more in the future, please consider liking it.

    This dataset was collected in ~4 hours total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

      Overview
    

    In this dataset, ~22'000 human annotations were collected to evaluate AI-generated videos (using Sora) in 5 different categories.

    Prompt - Video… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-Rich-Human-Feedback.

  11. Z

    Data from: Improving automated annotation of benthic survey images using...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kline, David I. (2022). Data from: Improving automated annotation of benthic survey images using wide-band fluorescence [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5034901
    Explore at:
    Dataset updated
    May 31, 2022
    Dataset provided by
    Loya, Yossi
    Treibitz, Tali
    Khen, Adi
    Kline, David I.
    Kriegman, David
    Neal, Benjamin
    Mitchell, B. Greg
    Beijbom, Oscar
    Eyal, Gal
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Large-scale imaging techniques are used increasingly for ecological surveys. However, manual analysis can be prohibitively expensive, creating a bottleneck between collected images and desired data-products. This bottleneck is particularly severe for benthic surveys, where millions of images are obtained each year. Recent automated annotation methods may provide a solution, but reflectance images do not always contain sufficient information for adequate classification accuracy. In this work, the FluorIS, a low-cost modified consumer camera, was used to capture wide-band wide-field-of-view fluorescence images during a field deployment in Eilat, Israel. The fluorescence images were registered with standard reflectance images, and an automated annotation method based on convolutional neural networks was developed. Our results demonstrate a 22% reduction of classification error-rate when using both images types compared to only using reflectance images. The improvements were large, in particular, for coral reef genera Platygyra, Acropora and Millepora, where classification recall improved by 38%, 33%, and 41%, respectively. We conclude that convolutional neural networks can be used to combine reflectance and fluorescence imagery in order to significantly improve automated annotation accuracy and reduce the manual annotation bottleneck.

  12. text-2-video-human-preferences-wan2.1

    • huggingface.co
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rapidata (2025). text-2-video-human-preferences-wan2.1 [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-wan2.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    Rapidata AG
    Authors
    Rapidata
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Rapidata Video Generation Alibaba Wan2.1 Human Preference

    If you get value from this dataset and would like to see more in the future, please consider liking it.

    This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

      Overview
    

    In this dataset, ~45'000 human annotations were collected to evaluate Alibaba Wan 2.1 video generation model on our benchmark. The up to date benchmark… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-wan2.1.

  13. Data from: ProGene - A Large-scale, High-Quality Protein-Gene Annotated...

    • zenodo.org
    • live.european-language-grid.eu
    • +1more
    zip
    Updated Jun 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik Faessler; Erik Faessler; Luise Modersohn; Christina Lohr; Udo Hahn; Luise Modersohn; Christina Lohr; Udo Hahn (2020). ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus [Dataset]. http://doi.org/10.5281/zenodo.3698568
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 12, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erik Faessler; Erik Faessler; Luise Modersohn; Christina Lohr; Udo Hahn; Luise Modersohn; Christina Lohr; Udo Hahn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Pro(tein)/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.

    The goals of the annotation project were

    • to construct a consistent and (as far as possible) subdomain-independent/-comprehensive protein-annotated corpus
    • to differentiate between protein families and groups, protein complexes, protein molecules, protein variants (e.g. alleles) and elliptic enumerations of proteins.

    The corpus has the following annotation levels / entity types:

    • protein
    • protein_familiy_or_group
    • protein_complex
    • protein_variant
    • protein_enum

    For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that is found in the download package.

    To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora.
    All document are delivered as MMAX2 (http://mmax2.net/) annotation projects.

  14. f

    Relabelled MIT classes.

    • figshare.com
    xls
    Updated Apr 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Joannou; Pia Rotshtein; Uta Noppeney (2024). Relabelled MIT classes. [Dataset]. http://doi.org/10.1371/journal.pone.0301098.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 1, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Michael Joannou; Pia Rotshtein; Uta Noppeney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. Additionally, we introduce the Supervised Audiovisual Correspondence (SAVC) task whereby a classifier must discern whether audio and visual streams correspond to the same action label. We trained 6 RNNs on the SAVC task, with or without AVMIT-filtering, to explore whether AVMIT is helpful for cross-modal learning. In all RNNs, accuracy improved by 2.09-19.16% with AVMIT-filtered data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

  15. d

    Data from: Large-scale integration of single-cell transcriptomic data...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Oct 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove (2021). Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration [Dataset]. http://doi.org/10.5061/dryad.t4b8gtj34
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 22, 2021
    Dataset provided by
    Dryad
    Authors
    David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove
    Time period covered
    2021
    Description

    Additional information as well as the code used to prepare these data can be found at github.com/mckellardw/scMuscle.

    These data comprise scMuscle v1.1 Seurat/CellChat Objects (see our github repository for other version tracking information).

  16. w

    Global Crowdsourcing Platform Market Research Report: By Deployment Model...

    • wiseguyreports.com
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wWiseguy Research Consultants Pvt Ltd (2024). Global Crowdsourcing Platform Market Research Report: By Deployment Model (Cloud-based, On-premises), By Industry Vertical (IT and Telecommunications, Healthcare, Education, Financial Services, Manufacturing), By Application Type (Microtasking, Crowdfunding, Open Innovation, Design Contests, Data Annotation), By Size of Workforce (Small-scale, Large-scale, Gig Economy), By Tier (Tier 1, Tier 2, Tier 3) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/reports/crowdsourcing-platform-market
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset authored and provided by
    wWiseguy Research Consultants Pvt Ltd
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Jan 7, 2024
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2024
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20233.49(USD Billion)
    MARKET SIZE 20244.02(USD Billion)
    MARKET SIZE 203212.47(USD Billion)
    SEGMENTS COVEREDDeployment Model ,Industry Vertical ,Application Type ,Size of Workforce ,Tier ,Regional
    COUNTRIES COVEREDNorth America, Europe, APAC, South America, MEA
    KEY MARKET DYNAMICSIncreasing demand for costeffective and agile solutions Growing adoption in various industries Rise of artificial intelligence and machine learning Technological advancements and platform enhancements Emergence of new business models and subscription services
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDHiveby ,Coleman Research ,Clickworker ,CrowdSource by Microsoft ,Udemy ,Freelancer ,Amazon Mechanical Turk ,Fiverr ,Crowdsource by Google Cloud ,Skyword ,Upwork ,Toptal ,Beondeck ,99designs ,TaskRabbit
    MARKET FORECAST PERIOD2024 - 2032
    KEY MARKET OPPORTUNITIESDigital Transformation Remote Work Data Collection Artificial Intelligence Innovation
    COMPOUND ANNUAL GROWTH RATE (CAGR) 15.21% (2024 - 2032)
  17. scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation

    • zenodo.org
    bin, zip
    Updated Oct 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shanli Ding; Rui Luo; Jin Li; Shanli Ding; Rui Luo; Jin Li (2024). scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation [Dataset]. http://doi.org/10.5281/zenodo.13896150
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Oct 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shanli Ding; Rui Luo; Jin Li; Shanli Ding; Rui Luo; Jin Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Single-cell research faces challenges in accurately annotating cell types at high resolution, especially when dealing with large-scale datasets and rare cell populations. To address this, foundation models like scGPT offer flexible, scalable solutions by leveraging transformer-based architectures. This protocol provides a comprehensive guide to fine-tuning scGPT for cell-type classification in single-cell RNA sequencing (scRNA-seq) data. We demonstrate how to fine-tune scGPT on a custom retina dataset, highlighting the model’s efficiency in handling complex data and improving annotation accuracy achieving 99.5% F1-score. This protocol automates key steps, including data preprocessing, model fine-tuning, and evaluation. This protocol enables researchers to efficiently deploy scGPT for their own datasets. The provided tools, including a command-line script and Jupyter Notebook, simplify the customization and exploration of the model, proposing an accessible workflow for users with minimal Python and Linux knowledge. The protocol offers an off-the-shell solution of high-precision cell-type annotation using scGPT for researchers with intermediate bioinformatics.

  18. r

    Gene Expression Database

    • rrid.site
    • dknet.org
    • +2more
    Updated Feb 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Gene Expression Database [Dataset]. http://identifiers.org/RRID:SCR_006539
    Explore at:
    Dataset updated
    Feb 1, 2025
    Description

    Community database that collects and integrates the gene expression information in MGI with a primary emphasis on endogenous gene expression during mouse development. The data in GXD are obtained from the literature, from individual laboratories, and from large-scale data providers. All data are annotated and reviewed by GXD curators. GXD stores and integrates different types of expression data (RNA in situ hybridization; Immunohistochemistry; in situ reporter (knock in); RT-PCR; Northern and Western blots; and RNase and Nuclease s1 protection assays) and makes these data freely available in formats appropriate for comprehensive analysis. There is particular emphasis on endogenous gene expression during mouse development. GXD also maintains an index of the literature examining gene expression in the embryonic mouse. It is comprehensive and up-to-date, containing all pertinent journal articles from 1993 to the present and articles from major developmental journals from 1990 to the present. GXD stores primary data from different types of expression assays and by integrating these data, as data accumulate, GXD provides increasingly complete information about the expression profiles of transcripts and proteins in different mouse strains and mutants. GXD describes expression patterns using an extensive, hierarchically-structured dictionary of anatomical terms. In this way, expression results from assays with differing spatial resolution are recorded in a standardized and integrated manner and expression patterns can be queried at different levels of detail. The records are complemented with digitized images of the original expression data. The Anatomical Dictionary for Mouse Development has been developed by our Edinburgh colleagues, as part of the joint Mouse Gene Expression Information Resource project. GXD places the gene expression data in the larger biological context by establishing and maintaining interconnections with many other resources. Integration with MGD enables a combined analysis of genotype, sequence, expression, and phenotype data. Links to PubMed, Online Mendelian Inheritance in Man (OMIM), sequence databases, and databases from other species further enhance the utility of GXD. GXD accepts both published and unpublished data.

  19. f

    MANTI: Automated Annotation of Protein N‑Termini for Rapid Interpretation of...

    • figshare.com
    xlsx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatih Demir; Jayachandran N. Kizhakkedathu; Markus M. Rinschen; Pitter F. Huesgen (2023). MANTI: Automated Annotation of Protein N‑Termini for Rapid Interpretation of N‑Terminome Data Sets [Dataset]. http://doi.org/10.1021/acs.analchem.1c00310.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Fatih Demir; Jayachandran N. Kizhakkedathu; Markus M. Rinschen; Pitter F. Huesgen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Site-specific proteolytic processing is an important, irreversible post-translational protein modification with implications in many diseases. Enrichment of protein N-terminal peptides followed by mass spectrometry-based identification and quantification enables proteome-wide characterization of proteolytic processes and protease substrates but is challenged by the lack of specific annotation tools. A common problem is, for example, ambiguous matches of identified peptides to multiple protein entries in the databases used for identification. We developed MaxQuant Advanced N-termini Interpreter (MANTI), a standalone Perl software with an optional graphical user interface that validates and annotates N-terminal peptides identified by database searches with the popular MaxQuant software package by integrating information from multiple data sources. MANTI utilizes diverse annotation information in a multistep decision process to assign a conservative preferred protein entry for each N-terminal peptide, enabling automated classification according to the likely origin and determines significant changes in N-terminal peptide abundance. Auxiliary R scripts included in the software package summarize and visualize key aspects of the data. To showcase the utility of MANTI, we generated two large-scale TAILS N-terminome data sets from two different animal models of chemically and genetically induced kidney disease, puromycin adenonucleoside-treated rats (PAN), and heterozygous Wilms Tumor protein 1 mice (WT1). MANTI enabled rapid validation and autonomous annotation of >10 000 identified terminal peptides, revealing novel proteolytic proteoforms in 905 and 644 proteins, respectively. Quantitative analysis indicated that proteolytic activities with similar sequence specificity are involved in the pathogenesis of kidney injury and proteinuria in both models, whereas coagulation processes and complement activation were specifically induced after chemical injury.

  20. text-2-video-human-preferences-luma-ray2

    • huggingface.co
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    text-2-video-human-preferences-luma-ray2 [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-luma-ray2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    Rapidata AG
    Authors
    Rapidata
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Rapidata Video Generation Luma Ray2 Human Preference

    If you get value from this dataset and would like to see more in the future, please consider liking it.

    This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

      Overview
    

    In this dataset, ~45'000 human annotations were collected to evaluate Luma's Ray 2 video generation model on our benchmark. The up to date benchmark can be… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-luma-ray2.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna (2023). Analyzing Dataset Annotation Quality Management in the Wild [Dataset]. http://doi.org/10.48328/tudatalib-1220

Data from: Analyzing Dataset Annotation Quality Management in the Wild

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 7, 2023
Authors
Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.

Search
Clear search
Close search
Google apps
Main menu