82 datasets found

t
Data from: Analyzing Dataset Annotation Quality Management in the Wild
tudatalib.ulb.tu-darmstadt.de
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna (2023). Analyzing Dataset Annotation Quality Management in the Wild [Dataset]. http://doi.org/10.48328/tudatalib-1220
Explore at:
Unique identifier
https://doi.org/10.48328/tudatalib-1220
Dataset updated
Sep 7, 2023
Authors
Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.
D
Data Collection and Labelling Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMA Research & Media LLP (2025). Data Collection and Labelling Report [Dataset]. https://www.marketresearchforecast.com/reports/data-collection-and-labelling-33030
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 13, 2025
Dataset provided by
AMA Research & Media LLP
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The data collection and labeling market is experiencing robust growth, fueled by the escalating demand for high-quality training data in artificial intelligence (AI) and machine learning (ML) applications. The market, estimated at $15 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% over the forecast period (2025-2033), reaching approximately $75 billion by 2033. This expansion is primarily driven by the increasing adoption of AI across diverse sectors, including healthcare (medical image analysis, drug discovery), automotive (autonomous driving systems), finance (fraud detection, risk assessment), and retail (personalized recommendations, inventory management). The rising complexity of AI models and the need for more diverse and nuanced datasets are significant contributing factors to this growth. Furthermore, advancements in data annotation tools and techniques, such as active learning and synthetic data generation, are streamlining the data labeling process and making it more cost-effective. However, challenges remain. Data privacy concerns and regulations like GDPR necessitate robust data security measures, adding to the cost and complexity of data collection and labeling. The shortage of skilled data annotators also hinders market growth, necessitating investments in training and upskilling programs. Despite these restraints, the market’s inherent potential, coupled with ongoing technological advancements and increased industry investments, ensures sustained expansion in the coming years. Geographic distribution shows strong concentration in North America and Europe initially, but Asia-Pacific is poised for rapid growth due to increasing AI adoption and the availability of a large workforce. This makes strategic partnerships and global expansion crucial for market players aiming for long-term success.
d
Data from: Crowdsourced geometric morphometrics enable rapid large-scale...
datadryad.org
zenodo.org
zip
Updated Nov 10, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Chang; Michael E. Alfaro (2016). Crowdsourced geometric morphometrics enable rapid large-scale collection and analysis of phenotypic data [Dataset]. http://doi.org/10.5061/dryad.gh4k7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.gh4k7
Dataset updated
Nov 10, 2016
Dataset provided by
Dryad
Authors
Jonathan Chang; Michael E. Alfaro
Time period covered
2016
Description
Advances in genomics and informatics have enabled the production of large phylogenetic trees. However, the ability to collect large phenotypic datasets has not kept pace. 2. Here, we present a method to quickly and accurately gather morphometric data using crowdsourced image-based landmarking. 3. We find that crowdsourced workers perform similarly to experienced morphologists on the same digitization tasks. We also demonstrate the speed and accuracy of our method on seven families of ray-finned fishes (Actinopterygii). 4. Crowdsourcing will enable the collection of morphological data across vast radiations of organisms, and can facilitate richer inference on the macroevolutionary processes that shape phenotypic diversity across the tree of life.
D
Data Annotation and Collection Services Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Data Annotation and Collection Services Report [Dataset]. https://www.marketresearchforecast.com/reports/data-annotation-and-collection-services-30703
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 9, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Annotation and Collection Services market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across diverse sectors. The market, estimated at $10 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching approximately $45 billion by 2033. This significant expansion is fueled by several key factors. The surge in autonomous driving initiatives necessitates high-quality data annotation for training self-driving systems, while the burgeoning smart healthcare sector relies heavily on annotated medical images and data for accurate diagnoses and treatment planning. Similarly, the growth of smart security systems and financial risk control applications demands precise data annotation for improved accuracy and efficiency. Image annotation currently dominates the market, followed by text annotation, reflecting the widespread use of computer vision and natural language processing. However, video and voice annotation segments are showing rapid growth, driven by advancements in AI-powered video analytics and voice recognition technologies. Competition is intense, with both established technology giants like Alibaba Cloud and Baidu, and specialized data annotation companies like Appen and Scale Labs vying for market share. Geographic distribution shows a strong concentration in North America and Europe initially, but Asia-Pacific is expected to emerge as a major growth region in the coming years, driven primarily by China and India's expanding technology sectors. The market, however, faces certain challenges. The high cost of data annotation, particularly for complex tasks such as video annotation, can pose a barrier to entry for smaller companies. Ensuring data quality and accuracy remains a significant concern, requiring robust quality control mechanisms. Furthermore, ethical considerations surrounding data privacy and bias in algorithms require careful attention. To overcome these challenges, companies are investing in automation tools and techniques like synthetic data generation, alongside developing more sophisticated quality control measures. The future of the Data Annotation and Collection Services market will likely be shaped by advancements in AI and ML technologies, the increasing availability of diverse data sets, and the growing awareness of ethical considerations surrounding data usage.
P
Data from: ImageNet Dataset
paperswithcode.com
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li (2024). ImageNet Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet
Explore at:
Dataset updated
Apr 15, 2024
Authors
Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li
Description
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.

Total number of non-empty WordNet synsets: 21841 Total number of images: 14197122 Number of images with bounding box annotations: 1,034,908 Number of synsets with SIFT features: 1000 Number of images with SIFT features: 1.2 million
text-2-video-human-preferences-runway-alpha
huggingface.co
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rapidata (2025). text-2-video-human-preferences-runway-alpha [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-runway-alpha
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Dataset provided by
Rapidata AG
Authors
Rapidata
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Rapidata Video Generation Runway Alpha Human Preference

If you get value from this dataset and would like to see more in the future, please consider liking it.

This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

Overview

In this dataset, ~30'000 human annotations were collected to evaluate Runway's Alpha video generation model on our benchmark. The up to date benchmark can… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-runway-alpha.
f
Data from: Chromosome-level genome assembly and annotation of the...
springernature.figshare.com
txt
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lei Zhang; Wanting Zhang; Mijuan Shi; Xiao-Qin Xia (2025). Chromosome-level genome assembly and annotation of the gynogenetic large-scale loach (Paramisgurnus dabryanus) [Dataset]. http://doi.org/10.6084/m9.figshare.27130323.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27130323.v1
Dataset updated
Jan 27, 2025
Dataset provided by
figshare
Authors
Lei Zhang; Wanting Zhang; Mijuan Shi; Xiao-Qin Xia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This datasets includes 1) the gene structure annotation, repeats elements annotation, CDS sequence and peptide sequence of Haplotype B, which is described as the complete large scale loach genome. 2) the genome assembly and gene structure annotation of Haplotype A.
Data from: UnientrezDB: Large-scale Gene Ontology Annotation Dataset and...
zenodo.org
data.niaid.nih.gov
zip
Updated Aug 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang; Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang (2024). UnientrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers [Dataset]. http://doi.org/10.5281/zenodo.13335548
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13335548
Dataset updated
Aug 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang; Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 17, 2024
Description
Our work focuses on providing a comprehensive dataset and benchmarks for evaluating gene ontology annotations using a unified system of Entrez Gene Identifiers.

Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code

zenodo.org

csv, txt

Updated Oct 23, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous; Anonymous (2024). Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.13918465

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13918465

Dataset updated

Oct 23, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous; Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

text-2-video-Rich-Human-Feedback
huggingface.co
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rapidata (2025). text-2-video-Rich-Human-Feedback [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-Rich-Human-Feedback
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 4, 2025
Dataset provided by
Rapidata AG
Authors
Rapidata
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Rapidata Video Generation Rich Human Feedback Dataset

If you get value from this dataset and would like to see more in the future, please consider liking it.

This dataset was collected in ~4 hours total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

Overview

In this dataset, ~22'000 human annotations were collected to evaluate AI-generated videos (using Sora) in 5 different categories.

Prompt - Video… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-Rich-Human-Feedback.
Z
Data from: Improving automated annotation of benthic survey images using...
data.niaid.nih.gov
zenodo.org
+1more
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kline, David I. (2022). Data from: Improving automated annotation of benthic survey images using wide-band fluorescence [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5034901
Explore at:
Dataset updated
May 31, 2022
Dataset provided by
Loya, Yossi
Treibitz, Tali
Khen, Adi
Kline, David I.
Kriegman, David
Neal, Benjamin
Mitchell, B. Greg
Beijbom, Oscar
Eyal, Gal
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Large-scale imaging techniques are used increasingly for ecological surveys. However, manual analysis can be prohibitively expensive, creating a bottleneck between collected images and desired data-products. This bottleneck is particularly severe for benthic surveys, where millions of images are obtained each year. Recent automated annotation methods may provide a solution, but reflectance images do not always contain sufficient information for adequate classification accuracy. In this work, the FluorIS, a low-cost modified consumer camera, was used to capture wide-band wide-field-of-view fluorescence images during a field deployment in Eilat, Israel. The fluorescence images were registered with standard reflectance images, and an automated annotation method based on convolutional neural networks was developed. Our results demonstrate a 22% reduction of classification error-rate when using both images types compared to only using reflectance images. The improvements were large, in particular, for coral reef genera Platygyra, Acropora and Millepora, where classification recall improved by 38%, 33%, and 41%, respectively. We conclude that convolutional neural networks can be used to combine reflectance and fluorescence imagery in order to significantly improve automated annotation accuracy and reduce the manual annotation bottleneck.
text-2-video-human-preferences-wan2.1
huggingface.co
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rapidata (2025). text-2-video-human-preferences-wan2.1 [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-wan2.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Dataset provided by
Rapidata AG
Authors
Rapidata
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Rapidata Video Generation Alibaba Wan2.1 Human Preference

If you get value from this dataset and would like to see more in the future, please consider liking it.

This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

Overview

In this dataset, ~45'000 human annotations were collected to evaluate Alibaba Wan 2.1 video generation model on our benchmark. The up to date benchmark… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-wan2.1.
Data from: ProGene - A Large-scale, High-Quality Protein-Gene Annotated...
zenodo.org
live.european-language-grid.eu
+1more
zip
Updated Jun 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erik Faessler; Erik Faessler; Luise Modersohn; Christina Lohr; Udo Hahn; Luise Modersohn; Christina Lohr; Udo Hahn (2020). ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus [Dataset]. http://doi.org/10.5281/zenodo.3698568
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3698568
Dataset updated
Jun 12, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Erik Faessler; Erik Faessler; Luise Modersohn; Christina Lohr; Udo Hahn; Luise Modersohn; Christina Lohr; Udo Hahn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Pro(tein)/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.

The goals of the annotation project were

to construct a consistent and (as far as possible) subdomain-independent/-comprehensive protein-annotated corpus

to differentiate between protein families and groups, protein complexes, protein molecules, protein variants (e.g. alleles) and elliptic enumerations of proteins.

The corpus has the following annotation levels / entity types:

protein

protein_familiy_or_group

protein_complex

protein_variant

protein_enum

For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that is found in the download package.

To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora.
All document are delivered as MMAX2 (http://mmax2.net/) annotation projects.
f
Relabelled MIT classes.
figshare.com
xls
Updated Apr 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Joannou; Pia Rotshtein; Uta Noppeney (2024). Relabelled MIT classes. [Dataset]. http://doi.org/10.1371/journal.pone.0301098.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301098.t001
Dataset updated
Apr 1, 2024
Dataset provided by
PLOS ONE
Authors
Michael Joannou; Pia Rotshtein; Uta Noppeney
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. Additionally, we introduce the Supervised Audiovisual Correspondence (SAVC) task whereby a classifier must discern whether audio and visual streams correspond to the same action label. We trained 6 RNNs on the SAVC task, with or without AVMIT-filtering, to explore whether AVMIT is helpful for cross-modal learning. In all RNNs, accuracy improved by 2.09-19.16% with AVMIT-filtered data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.
d
Data from: Large-scale integration of single-cell transcriptomic data...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove (2021). Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration [Dataset]. http://doi.org/10.5061/dryad.t4b8gtj34
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.t4b8gtj34
Dataset updated
Oct 22, 2021
Dataset provided by
Dryad
Authors
David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove
Time period covered
2021
Description
Additional information as well as the code used to prepare these data can be found at github.com/mckellardw/scMuscle.

These data comprise scMuscle v1.1 Seurat/CellChat Objects (see our github repository for other version tracking information).

Global Crowdsourcing Platform Market Research Report: By Deployment Model...

wiseguyreports.com

Updated Jul 23, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

wWiseguy Research Consultants Pvt Ltd (2024). Global Crowdsourcing Platform Market Research Report: By Deployment Model (Cloud-based, On-premises), By Industry Vertical (IT and Telecommunications, Healthcare, Education, Financial Services, Manufacturing), By Application Type (Microtasking, Crowdfunding, Open Innovation, Design Contests, Data Annotation), By Size of Workforce (Small-scale, Large-scale, Gig Economy), By Tier (Tier 1, Tier 2, Tier 3) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/reports/crowdsourcing-platform-market

Explore at:

Dataset updated

Jul 23, 2024

Dataset authored and provided by

wWiseguy Research Consultants Pvt Ltd

License

https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

Time period covered

Jan 7, 2024

Area covered

Global

Description

BASE YEAR	2024
HISTORICAL DATA	2019 - 2024
REPORT COVERAGE	Revenue Forecast, Competitive Landscape, Growth Factors, and Trends
MARKET SIZE 2023	3.49(USD Billion)
MARKET SIZE 2024	4.02(USD Billion)
MARKET SIZE 2032	12.47(USD Billion)
SEGMENTS COVERED	Deployment Model ,Industry Vertical ,Application Type ,Size of Workforce ,Tier ,Regional
COUNTRIES COVERED	North America, Europe, APAC, South America, MEA
KEY MARKET DYNAMICS	Increasing demand for costeffective and agile solutions Growing adoption in various industries Rise of artificial intelligence and machine learning Technological advancements and platform enhancements Emergence of new business models and subscription services
MARKET FORECAST UNITS	USD Billion
KEY COMPANIES PROFILED	Hiveby ,Coleman Research ,Clickworker ,CrowdSource by Microsoft ,Udemy ,Freelancer ,Amazon Mechanical Turk ,Fiverr ,Crowdsource by Google Cloud ,Skyword ,Upwork ,Toptal ,Beondeck ,99designs ,TaskRabbit
MARKET FORECAST PERIOD	2024 - 2032
KEY MARKET OPPORTUNITIES	Digital Transformation Remote Work Data Collection Artificial Intelligence Innovation
COMPOUND ANNUAL GROWTH RATE (CAGR)	15.21% (2024 - 2032)

scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation
zenodo.org
bin, zip
Updated Oct 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shanli Ding; Rui Luo; Jin Li; Shanli Ding; Rui Luo; Jin Li (2024). scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation [Dataset]. http://doi.org/10.5281/zenodo.13896150
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13896150
Dataset updated
Oct 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shanli Ding; Rui Luo; Jin Li; Shanli Ding; Rui Luo; Jin Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Single-cell research faces challenges in accurately annotating cell types at high resolution, especially when dealing with large-scale datasets and rare cell populations. To address this, foundation models like scGPT offer flexible, scalable solutions by leveraging transformer-based architectures. This protocol provides a comprehensive guide to fine-tuning scGPT for cell-type classification in single-cell RNA sequencing (scRNA-seq) data. We demonstrate how to fine-tune scGPT on a custom retina dataset, highlighting the model’s efficiency in handling complex data and improving annotation accuracy achieving 99.5% F1-score. This protocol automates key steps, including data preprocessing, model fine-tuning, and evaluation. This protocol enables researchers to efficiently deploy scGPT for their own datasets. The provided tools, including a command-line script and Jupyter Notebook, simplify the customization and exploration of the model, proposing an accessible workflow for users with minimal Python and Linux knowledge. The protocol offers an off-the-shell solution of high-precision cell-type annotation using scGPT for researchers with intermediate bioinformatics.
r
Gene Expression Database
rrid.site
dknet.org
+2more
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Gene Expression Database [Dataset]. http://identifiers.org/RRID:SCR_006539
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006539
Dataset updated
Feb 1, 2025
Description
Community database that collects and integrates the gene expression information in MGI with a primary emphasis on endogenous gene expression during mouse development. The data in GXD are obtained from the literature, from individual laboratories, and from large-scale data providers. All data are annotated and reviewed by GXD curators. GXD stores and integrates different types of expression data (RNA in situ hybridization; Immunohistochemistry; in situ reporter (knock in); RT-PCR; Northern and Western blots; and RNase and Nuclease s1 protection assays) and makes these data freely available in formats appropriate for comprehensive analysis. There is particular emphasis on endogenous gene expression during mouse development. GXD also maintains an index of the literature examining gene expression in the embryonic mouse. It is comprehensive and up-to-date, containing all pertinent journal articles from 1993 to the present and articles from major developmental journals from 1990 to the present. GXD stores primary data from different types of expression assays and by integrating these data, as data accumulate, GXD provides increasingly complete information about the expression profiles of transcripts and proteins in different mouse strains and mutants. GXD describes expression patterns using an extensive, hierarchically-structured dictionary of anatomical terms. In this way, expression results from assays with differing spatial resolution are recorded in a standardized and integrated manner and expression patterns can be queried at different levels of detail. The records are complemented with digitized images of the original expression data. The Anatomical Dictionary for Mouse Development has been developed by our Edinburgh colleagues, as part of the joint Mouse Gene Expression Information Resource project. GXD places the gene expression data in the larger biological context by establishing and maintaining interconnections with many other resources. Integration with MGD enables a combined analysis of genotype, sequence, expression, and phenotype data. Links to PubMed, Online Mendelian Inheritance in Man (OMIM), sequence databases, and databases from other species further enhance the utility of GXD. GXD accepts both published and unpublished data.
f
MANTI: Automated Annotation of Protein N‑Termini for Rapid Interpretation of...
figshare.com
xlsx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatih Demir; Jayachandran N. Kizhakkedathu; Markus M. Rinschen; Pitter F. Huesgen (2023). MANTI: Automated Annotation of Protein N‑Termini for Rapid Interpretation of N‑Terminome Data Sets [Dataset]. http://doi.org/10.1021/acs.analchem.1c00310.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.1c00310.s002
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Fatih Demir; Jayachandran N. Kizhakkedathu; Markus M. Rinschen; Pitter F. Huesgen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Site-specific proteolytic processing is an important, irreversible post-translational protein modification with implications in many diseases. Enrichment of protein N-terminal peptides followed by mass spectrometry-based identification and quantification enables proteome-wide characterization of proteolytic processes and protease substrates but is challenged by the lack of specific annotation tools. A common problem is, for example, ambiguous matches of identified peptides to multiple protein entries in the databases used for identification. We developed MaxQuant Advanced N-termini Interpreter (MANTI), a standalone Perl software with an optional graphical user interface that validates and annotates N-terminal peptides identified by database searches with the popular MaxQuant software package by integrating information from multiple data sources. MANTI utilizes diverse annotation information in a multistep decision process to assign a conservative preferred protein entry for each N-terminal peptide, enabling automated classification according to the likely origin and determines significant changes in N-terminal peptide abundance. Auxiliary R scripts included in the software package summarize and visualize key aspects of the data. To showcase the utility of MANTI, we generated two large-scale TAILS N-terminome data sets from two different animal models of chemically and genetically induced kidney disease, puromycin adenonucleoside-treated rats (PAN), and heterozygous Wilms Tumor protein 1 mice (WT1). MANTI enabled rapid validation and autonomous annotation of >10 000 identified terminal peptides, revealing novel proteolytic proteoforms in 905 and 644 proteins, respectively. Quantitative analysis indicated that proteolytic activities with similar sequence specificity are involved in the pathogenesis of kidney injury and proteinuria in both models, whereas coagulation processes and complement activation were specifically induced after chemical injury.
text-2-video-human-preferences-luma-ray2
huggingface.co
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
text-2-video-human-preferences-luma-ray2 [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-luma-ray2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2025
Dataset provided by
Rapidata AG
Authors
Rapidata
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Rapidata Video Generation Luma Ray2 Human Preference

If you get value from this dataset and would like to see more in the future, please consider liking it.

This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

Overview

In this dataset, ~45'000 human annotations were collected to evaluate Luma's Ray 2 video generation model on our benchmark. The up to date benchmark can be… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-luma-ray2.

Facebook

Twitter

Click to copy link

Link copied

Cite

Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna (2023). Analyzing Dataset Annotation Quality Management in the Wild [Dataset]. http://doi.org/10.48328/tudatalib-1220

Data from: Analyzing Dataset Annotation Quality Management in the Wild

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.48328/tudatalib-1220

Dataset updated

Sep 7, 2023

Authors

Klie, Jan-Christoph; Eckart de Castilho, Richard; Gurevych, Iryna

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.

Clear search

Close search

Google apps

Main menu

Data from: Analyzing Dataset Annotation Quality Management in the Wild

Data Collection and Labelling Report

Data from: Crowdsourced geometric morphometrics enable rapid large-scale...

Data Annotation and Collection Services Report

Data from: ImageNet Dataset

text-2-video-human-preferences-runway-alpha

Data from: Chromosome-level genome assembly and annotation of the...

Data from: UnientrezDB: Large-scale Gene Ontology Annotation Dataset and...

Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code

Code4ML 2.0 Enhancements

Applications

text-2-video-Rich-Human-Feedback

Data from: Improving automated annotation of benthic survey images using...

text-2-video-human-preferences-wan2.1

Data from: ProGene - A Large-scale, High-Quality Protein-Gene Annotated...

Relabelled MIT classes.

Data from: Large-scale integration of single-cell transcriptomic data...

Global Crowdsourcing Platform Market Research Report: By Deployment Model...

scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation

Abstract

Gene Expression Database

MANTI: Automated Annotation of Protein N‑Termini for Rapid Interpretation of...

text-2-video-human-preferences-luma-ray2

Data from: Analyzing Dataset Annotation Quality Management in the WildSee More Versions

Data from: Analyzing Dataset Annotation Quality Management in the Wild