Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The data collection and labeling market is experiencing robust growth, fueled by the escalating demand for high-quality training data in artificial intelligence (AI) and machine learning (ML) applications. The market, estimated at $15 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% over the forecast period (2025-2033), reaching approximately $75 billion by 2033. This expansion is primarily driven by the increasing adoption of AI across diverse sectors, including healthcare (medical image analysis, drug discovery), automotive (autonomous driving systems), finance (fraud detection, risk assessment), and retail (personalized recommendations, inventory management). The rising complexity of AI models and the need for more diverse and nuanced datasets are significant contributing factors to this growth. Furthermore, advancements in data annotation tools and techniques, such as active learning and synthetic data generation, are streamlining the data labeling process and making it more cost-effective. However, challenges remain. Data privacy concerns and regulations like GDPR necessitate robust data security measures, adding to the cost and complexity of data collection and labeling. The shortage of skilled data annotators also hinders market growth, necessitating investments in training and upskilling programs. Despite these restraints, the market’s inherent potential, coupled with ongoing technological advancements and increased industry investments, ensures sustained expansion in the coming years. Geographic distribution shows strong concentration in North America and Europe initially, but Asia-Pacific is poised for rapid growth due to increasing AI adoption and the availability of a large workforce. This makes strategic partnerships and global expansion crucial for market players aiming for long-term success.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Data Annotation and Collection Services market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across diverse sectors. The market, estimated at $10 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching approximately $45 billion by 2033. This significant expansion is fueled by several key factors. The surge in autonomous driving initiatives necessitates high-quality data annotation for training self-driving systems, while the burgeoning smart healthcare sector relies heavily on annotated medical images and data for accurate diagnoses and treatment planning. Similarly, the growth of smart security systems and financial risk control applications demands precise data annotation for improved accuracy and efficiency. Image annotation currently dominates the market, followed by text annotation, reflecting the widespread use of computer vision and natural language processing. However, video and voice annotation segments are showing rapid growth, driven by advancements in AI-powered video analytics and voice recognition technologies. Competition is intense, with both established technology giants like Alibaba Cloud and Baidu, and specialized data annotation companies like Appen and Scale Labs vying for market share. Geographic distribution shows a strong concentration in North America and Europe initially, but Asia-Pacific is expected to emerge as a major growth region in the coming years, driven primarily by China and India's expanding technology sectors. The market, however, faces certain challenges. The high cost of data annotation, particularly for complex tasks such as video annotation, can pose a barrier to entry for smaller companies. Ensuring data quality and accuracy remains a significant concern, requiring robust quality control mechanisms. Furthermore, ethical considerations surrounding data privacy and bias in algorithms require careful attention. To overcome these challenges, companies are investing in automation tools and techniques like synthetic data generation, alongside developing more sophisticated quality control measures. The future of the Data Annotation and Collection Services market will likely be shaped by advancements in AI and ML technologies, the increasing availability of diverse data sets, and the growing awareness of ethical considerations surrounding data usage.
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.
Total number of non-empty WordNet synsets: 21841 Total number of images: 14197122 Number of images with bounding box annotations: 1,034,908 Number of synsets with SIFT features: 1000 Number of images with SIFT features: 1.2 million
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Rapidata Video Generation Runway Alpha Human Preference
If you get value from this dataset and would like to see more in the future, please consider liking it.
This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.
Overview
In this dataset, ~30'000 human annotations were collected to evaluate Runway's Alpha video generation model on our benchmark. The up to date benchmark can… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-runway-alpha.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This datasets includes 1) the gene structure annotation, repeats elements annotation, CDS sequence and peptide sequence of Haplotype B, which is described as the complete large scale loach genome. 2) the genome assembly and gene structure annotation of Haplotype A.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our work focuses on providing a comprehensive dataset and benchmarks for evaluating gene ontology annotations using a unified system of Entrez Gene Identifiers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
Column | Description |
code_blocks_index | Global index linking code blocks to markup_data.csv. |
kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
code_block_id |
Position of the code block within the notebook. |
code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
Column | Description |
kernel_id | Identifier for the Kaggle Jupyter notebook. |
kaggle_score | Performance metric of the notebook. |
kaggle_comments | Number of comments on the notebook. |
kaggle_upvotes | Number of upvotes the notebook received. |
kernel_link | URL to the notebook. |
comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
Column | Description |
comp_name | Name of the Kaggle competition. |
description | Overview of the competition task. |
data_type | Type of data used in the competition. |
comp_type | Classification of the competition. |
subtitle | Short description of the task. |
EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
data_sources | Links to datasets used. |
metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
Column | Description |
code_block | Machine learning code block. |
too_long | Flag indicating whether the block spans multiple semantic types. |
marks | Confidence level of the annotation. |
graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id
column.comp_name
. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index
column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank
), providing additional context for evaluation.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Rapidata Video Generation Rich Human Feedback Dataset
If you get value from this dataset and would like to see more in the future, please consider liking it.
This dataset was collected in ~4 hours total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.
Overview
In this dataset, ~22'000 human annotations were collected to evaluate AI-generated videos (using Sora) in 5 different categories.
Prompt - Video… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-Rich-Human-Feedback.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Large-scale imaging techniques are used increasingly for ecological surveys. However, manual analysis can be prohibitively expensive, creating a bottleneck between collected images and desired data-products. This bottleneck is particularly severe for benthic surveys, where millions of images are obtained each year. Recent automated annotation methods may provide a solution, but reflectance images do not always contain sufficient information for adequate classification accuracy. In this work, the FluorIS, a low-cost modified consumer camera, was used to capture wide-band wide-field-of-view fluorescence images during a field deployment in Eilat, Israel. The fluorescence images were registered with standard reflectance images, and an automated annotation method based on convolutional neural networks was developed. Our results demonstrate a 22% reduction of classification error-rate when using both images types compared to only using reflectance images. The improvements were large, in particular, for coral reef genera Platygyra, Acropora and Millepora, where classification recall improved by 38%, 33%, and 41%, respectively. We conclude that convolutional neural networks can be used to combine reflectance and fluorescence imagery in order to significantly improve automated annotation accuracy and reduce the manual annotation bottleneck.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Rapidata Video Generation Alibaba Wan2.1 Human Preference
If you get value from this dataset and would like to see more in the future, please consider liking it.
This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.
Overview
In this dataset, ~45'000 human annotations were collected to evaluate Alibaba Wan 2.1 video generation model on our benchmark. The up to date benchmark… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-wan2.1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Pro(tein)/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.
The goals of the annotation project were
The corpus has the following annotation levels / entity types:
For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that is found in the download package.
To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora.
All document are delivered as MMAX2 (http://mmax2.net/) annotation projects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. Additionally, we introduce the Supervised Audiovisual Correspondence (SAVC) task whereby a classifier must discern whether audio and visual streams correspond to the same action label. We trained 6 RNNs on the SAVC task, with or without AVMIT-filtering, to explore whether AVMIT is helpful for cross-modal learning. In all RNNs, accuracy improved by 2.09-19.16% with AVMIT-filtered data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.
Additional information as well as the code used to prepare these data can be found at github.com/mckellardw/scMuscle.
These data comprise scMuscle v1.1 Seurat/CellChat Objects (see our github repository for other version tracking information).
https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
BASE YEAR | 2024 |
HISTORICAL DATA | 2019 - 2024 |
REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
MARKET SIZE 2023 | 3.49(USD Billion) |
MARKET SIZE 2024 | 4.02(USD Billion) |
MARKET SIZE 2032 | 12.47(USD Billion) |
SEGMENTS COVERED | Deployment Model ,Industry Vertical ,Application Type ,Size of Workforce ,Tier ,Regional |
COUNTRIES COVERED | North America, Europe, APAC, South America, MEA |
KEY MARKET DYNAMICS | Increasing demand for costeffective and agile solutions Growing adoption in various industries Rise of artificial intelligence and machine learning Technological advancements and platform enhancements Emergence of new business models and subscription services |
MARKET FORECAST UNITS | USD Billion |
KEY COMPANIES PROFILED | Hiveby ,Coleman Research ,Clickworker ,CrowdSource by Microsoft ,Udemy ,Freelancer ,Amazon Mechanical Turk ,Fiverr ,Crowdsource by Google Cloud ,Skyword ,Upwork ,Toptal ,Beondeck ,99designs ,TaskRabbit |
MARKET FORECAST PERIOD | 2024 - 2032 |
KEY MARKET OPPORTUNITIES | Digital Transformation Remote Work Data Collection Artificial Intelligence Innovation |
COMPOUND ANNUAL GROWTH RATE (CAGR) | 15.21% (2024 - 2032) |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single-cell research faces challenges in accurately annotating cell types at high resolution, especially when dealing with large-scale datasets and rare cell populations. To address this, foundation models like scGPT offer flexible, scalable solutions by leveraging transformer-based architectures. This protocol provides a comprehensive guide to fine-tuning scGPT for cell-type classification in single-cell RNA sequencing (scRNA-seq) data. We demonstrate how to fine-tune scGPT on a custom retina dataset, highlighting the model’s efficiency in handling complex data and improving annotation accuracy achieving 99.5% F1-score. This protocol automates key steps, including data preprocessing, model fine-tuning, and evaluation. This protocol enables researchers to efficiently deploy scGPT for their own datasets. The provided tools, including a command-line script and Jupyter Notebook, simplify the customization and exploration of the model, proposing an accessible workflow for users with minimal Python and Linux knowledge. The protocol offers an off-the-shell solution of high-precision cell-type annotation using scGPT for researchers with intermediate bioinformatics.
Community database that collects and integrates the gene expression information in MGI with a primary emphasis on endogenous gene expression during mouse development. The data in GXD are obtained from the literature, from individual laboratories, and from large-scale data providers. All data are annotated and reviewed by GXD curators. GXD stores and integrates different types of expression data (RNA in situ hybridization; Immunohistochemistry; in situ reporter (knock in); RT-PCR; Northern and Western blots; and RNase and Nuclease s1 protection assays) and makes these data freely available in formats appropriate for comprehensive analysis. There is particular emphasis on endogenous gene expression during mouse development. GXD also maintains an index of the literature examining gene expression in the embryonic mouse. It is comprehensive and up-to-date, containing all pertinent journal articles from 1993 to the present and articles from major developmental journals from 1990 to the present. GXD stores primary data from different types of expression assays and by integrating these data, as data accumulate, GXD provides increasingly complete information about the expression profiles of transcripts and proteins in different mouse strains and mutants. GXD describes expression patterns using an extensive, hierarchically-structured dictionary of anatomical terms. In this way, expression results from assays with differing spatial resolution are recorded in a standardized and integrated manner and expression patterns can be queried at different levels of detail. The records are complemented with digitized images of the original expression data. The Anatomical Dictionary for Mouse Development has been developed by our Edinburgh colleagues, as part of the joint Mouse Gene Expression Information Resource project. GXD places the gene expression data in the larger biological context by establishing and maintaining interconnections with many other resources. Integration with MGD enables a combined analysis of genotype, sequence, expression, and phenotype data. Links to PubMed, Online Mendelian Inheritance in Man (OMIM), sequence databases, and databases from other species further enhance the utility of GXD. GXD accepts both published and unpublished data.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Site-specific proteolytic processing is an important, irreversible post-translational protein modification with implications in many diseases. Enrichment of protein N-terminal peptides followed by mass spectrometry-based identification and quantification enables proteome-wide characterization of proteolytic processes and protease substrates but is challenged by the lack of specific annotation tools. A common problem is, for example, ambiguous matches of identified peptides to multiple protein entries in the databases used for identification. We developed MaxQuant Advanced N-termini Interpreter (MANTI), a standalone Perl software with an optional graphical user interface that validates and annotates N-terminal peptides identified by database searches with the popular MaxQuant software package by integrating information from multiple data sources. MANTI utilizes diverse annotation information in a multistep decision process to assign a conservative preferred protein entry for each N-terminal peptide, enabling automated classification according to the likely origin and determines significant changes in N-terminal peptide abundance. Auxiliary R scripts included in the software package summarize and visualize key aspects of the data. To showcase the utility of MANTI, we generated two large-scale TAILS N-terminome data sets from two different animal models of chemically and genetically induced kidney disease, puromycin adenonucleoside-treated rats (PAN), and heterozygous Wilms Tumor protein 1 mice (WT1). MANTI enabled rapid validation and autonomous annotation of >10 000 identified terminal peptides, revealing novel proteolytic proteoforms in 905 and 644 proteins, respectively. Quantitative analysis indicated that proteolytic activities with similar sequence specificity are involved in the pathogenesis of kidney injury and proteinuria in both models, whereas coagulation processes and complement activation were specifically induced after chemical injury.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Rapidata Video Generation Luma Ray2 Human Preference
If you get value from this dataset and would like to see more in the future, please consider liking it.
This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.
Overview
In this dataset, ~45'000 human annotations were collected to evaluate Luma's Ray 2 video generation model on our benchmark. The up to date benchmark can be… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-luma-ray2.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.