100+ datasets found

G
Data Labeling Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Labeling Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 1, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Labeling Market Outlook

According to our latest research, the global data labeling market size reached USD 3.2 billion in 2024, driven by the explosive growth in artificial intelligence and machine learning applications across industries. The market is poised to expand at a CAGR of 22.8% from 2025 to 2033, and is forecasted to reach USD 25.3 billion by 2033. This robust growth is primarily fueled by the increasing demand for high-quality annotated data to train advanced AI models, the proliferation of automation in business processes, and the rising adoption of data-driven decision-making frameworks in both the public and private sectors.

One of the principal growth drivers for the data labeling market is the accelerating integration of AI and machine learning technologies across various industries, including healthcare, automotive, retail, and BFSI. As organizations strive to leverage AI for enhanced customer experiences, predictive analytics, and operational efficiency, the need for accurately labeled datasets has become paramount. Data labeling ensures that AI algorithms can learn from well-annotated examples, thereby improving model accuracy and reliability. The surge in demand for computer vision applicationsÂ—such as facial recognition, autonomous vehicles, and medical imagingÂ—has particularly heightened the need for image and video data labeling, further propelling market growth.

Another significant factor contributing to the expansion of the data labeling market is the rapid digitization of business processes and the exponential growth in unstructured data. Enterprises are increasingly investing in data annotation tools and platforms to extract actionable insights from large volumes of text, audio, and video data. The proliferation of Internet of Things (IoT) devices and the widespread adoption of cloud computing have further amplified data generation, necessitating scalable and efficient data labeling solutions. Additionally, the rise of semi-automated and automated labeling technologies, powered by AI-assisted tools, is reducing manual effort and accelerating the annotation process, thereby enabling organizations to meet the growing demand for labeled data at scale.

The evolving regulatory landscape and the emphasis on data privacy and security are also playing a crucial role in shaping the data labeling market. As governments worldwide introduce stringent data protection regulations, organizations are turning to specialized data labeling service providers that adhere to compliance standards. This trend is particularly pronounced in sectors such as healthcare and BFSI, where the accuracy and confidentiality of labeled data are critical. Furthermore, the increasing outsourcing of data labeling tasks to specialized vendors in emerging economies is enabling organizations to access skilled labor at lower costs, further fueling market expansion.

From a regional perspective, North America currently dominates the data labeling market, followed by Europe and the Asia Pacific. The presence of major technology companies, robust investments in AI research, and the early adoption of advanced analytics solutions have positioned North America as the market leader. However, the Asia Pacific region is expected to witness the fastest growth during the forecast period, driven by the rapid digital transformation in countries like China, India, and Japan. The growing focus on AI innovation, government initiatives to promote digitalization, and the availability of a large pool of skilled annotators are key factors contributing to the regionÂ’s impressive growth trajectory.

In the realm of security, Video Dataset Labeling for Security has emerged as a critical application area within the data labeling market. As surveillance systems become more sophisticated, the need for accurately labeled video data is paramount to ensure the effectiveness of security measures. Video dataset labeling involves annotating video frames to identify and track objects, behaviors, and anomalies, which are essential for developing intelligent security systems capable of real-time threat detection and response. This process not only enhances the accuracy of security algorithms but also aids in the training of AI models that can predict and prevent potential security breaches. The growing emphasis on public safety and
Face Detection - Face Recognition Dataset
kaggle.com
zip
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2023). Face Detection - Face Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/face-detection-photos-and-labels
Explore at:
zip(1252666206 bytes)Available download formats
Dataset updated
Nov 8, 2023
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Face Detection - Object Detection & Face Recognition Dataset

The dataset is created on the basis of Selfies and ID Dataset

The dataset is a collection of images (selfies) of people and bounding box labeling for their faces. It has been specifically curated for face detection and face recognition tasks. The dataset encompasses diverse demographics, age, ethnicities, and genders.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F01348572e2ae2836f10bc2f2da381009%2FFrame%2050%20(1).png?generation=1699439342545305&alt=media" alt="">

The dataset is a valuable resource for researchers, developers, and organizations working on age prediction and face recognition to train, evaluate, and fine-tune AI models for real-world applications. It can be applied in various domains like psychology, market research, and personalized advertising.

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

Metadata for the full dataset:

assignment_id - unique identifier of the media file

worker_id - unique identifier of the person

age - age of the person

true_gender - gender of the person

country - country of the person

ethnicity - ethnicity of the person

photo_1_extension, photo_2_extension, …, photo_15_extension - photo extensions in the dataset

photo_1_resolution, photo_2_resolution, …, photo_15_resolution - photo resolution in the dataset

OTHER BIOMETRIC DATASETS:

Anti Spoofing Real Dataset

Antispoofing Replay Dataset

Selfies, ID Images dataset (5591 sets of 15 files)

Selfies and video dataset (4 052 sets)

Dataset of bald people, 5000 images

🧩 This is just an example of the data. Leave a request here to learn more

Dataset structure

images - contains of original images of people

labels - includes visualized labeling for the original images

annotations.xml - contains coordinates of the bbox, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the polygons and labels . For each point, the x and y coordinates are provided.

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F19e61b2d0780e9db80afe4a0ce879c4b%2Fcarbon.png?generation=1699440100527867&alt=media" alt="">

🚀 You can learn more about our high-quality unique datasets here

keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, object detection dataset, deep learning datasets, computer vision datset, human images dataset, human faces dataset
n
Language Dataset
data.ncl.ac.uk
json
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Towers; Rob Geada; Amir Atapour-Abarghouei; Andrew Stephen McGough (2023). Language Dataset [Dataset]. http://doi.org/10.25405/data.ncl.24574729.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.25405/data.ncl.24574729.v1
Dataset updated
Nov 30, 2023
Dataset provided by
Newcastle University
Authors
David Towers; Rob Geada; Amir Atapour-Abarghouei; Andrew Stephen McGough
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset containing the images and labels for the Language data used in the CVPR NAS workshop Unseen-data challenge under the codename "LaMelo"The Language dataset is a constructed dataset using words from aspell dictionaries. The intention of this dataset is to require machine learning models to not only perform image classification but also linguistic analysis to figure out which letter frequency is associated with each language. For each Language image we selected four six-letter words using the standard latin alphabet and removed any words with letters that used diacritics (such as ́e or ̈u) or included ‘y’ or ‘z’.We encode these words on a graph with one axis representing the index of the 24 character long string (the four words joined together) and the other representing the letter (going A-X).The data is in a channels-first format with a shape of (n, 1, 24, 24) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).There are ten classes in the dataset, with 7,000 examples of each, distributed evenly between the three subsets.The ten classes and corresponding numerical label are as follows:English: 0,Dutch: 1,German: 2,Spanish: 3,French: 4,Portuguese: 5,Swahili: 6,Zulu: 7,Finnish: 8,Swedish: 9
Z
Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...
data.niaid.nih.gov
zenodo.org
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli (2022). ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6568777
Explore at:
Dataset updated
Jun 30, 2022
Dataset provided by
University of Genoa, Italy
University of Cagliari, Italy
Authors
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.

We release our dataset as a set of folders indicating the patch target label (e.g., banana), each containing 1000 subfolders as the ImageNet output classes.

An example showing how to use the dataset is shown below.

code for testing robustness of a model

import os.path

from torchvision import datasets, transforms, models import torch.utils.data

class ImageFolderWithEmptyDirs(datasets.ImageFolder): """ This is required for handling empty folders from the ImageFolder Class. """

def find_classes(self, directory): classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir()) if not classes: raise FileNotFoundError(f"Couldn't find any class folder in {directory}.") class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if len(os.listdir(os.path.join(directory, cls_name))) > 0} return classes, class_to_idx

extract and unzip the dataset, then write top folder here

dataset_folder = 'data/ImageNet-Patch'

available_labels = { 487: 'cellular telephone', 513: 'cornet', 546: 'electric guitar', 585: 'hair spray', 804: 'soap dispenser', 806: 'sock', 878: 'typewriter keyboard', 923: 'plate', 954: 'banana', 968: 'cup' }

select folder with specific target

target_label = 954

dataset_folder = os.path.join(dataset_folder, str(target_label)) normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) transforms = transforms.Compose([ transforms.ToTensor(), normalizer ])

dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms) model = models.resnet50(pretrained=True) loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5) model.eval()

batches = 10 correct, attack_success, total = 0, 0, 0 for batch_idx, (images, labels) in enumerate(loader): if batch_idx == batches: break pred = model(images).argmax(dim=1) correct += (pred == labels).sum() attack_success += sum(pred == target_label) total += pred.shape[0]

accuracy = correct / total attack_sr = attack_success / total

print("Robust Accuracy: ", accuracy) print("Attack Success: ", attack_sr)
Z
EmoLit
data.niaid.nih.gov
data.europa.eu
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rei, Luis (2023). EmoLit [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7883953
Explore at:
Dataset updated
Jun 27, 2023
Dataset provided by
Jozef Stefan Institute
Authors
Rei, Luis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Emotions in Literature

Description Literature sentences from Project Gutenberg. 38 emotion labels (+neutral examples). Semi-Supervised dataset.

Article

Detecting Fine-Grained Emotions in Literature

Please cite:

@Article{app13137502, AUTHOR = {Rei, Luis and Mladenić, Dunja}, TITLE = {Detecting Fine-Grained Emotions in Literature}, JOURNAL = {Applied Sciences}, VOLUME = {13}, YEAR = {2023}, NUMBER = {13}, ARTICLE-NUMBER = {7502}, URL = {https://www.mdpi.com/2076-3417/13/13/7502}, ISSN = {2076-3417}, DOI = {10.3390/app13137502} }

Abstract

Emotion detection in text is a fundamental aspect of affective computing and is closely linked to natural language processing. Its applications span various domains, from interactive chatbots to marketing and customer service. This research specifically focuses on its significance in literature analysis and understanding. To facilitate this, we present a novel approach that involves creating a multi-label fine-grained emotion detection dataset, derived from literary sources. Our methodology employs a simple yet effective semi-supervised technique. We leverage textual entailment classification to perform emotion-specific weak-labeling, selecting examples with the highest and lowest scores from a large corpus. Utilizing these emotion-specific datasets, we train binary pseudo-labeling classifiers for each individual emotion. By applying this process to the selected examples, we construct a multi-label dataset. Using this dataset, we train models and evaluate their performance within a traditional supervised setting. Our model achieves an F1 score of 0.59 on our labeled gold set, showcasing its ability to effectively detect fine-grained emotions. Furthermore, we conduct evaluations of the model's performance in zero- and few-shot transfer scenarios using benchmark datasets. Notably, our results indicate that the knowledge learned from our dataset exhibits transferability across diverse data domains, demonstrating its potential for broader applications beyond emotion detection in literature. Our contribution thus includes a multi-label fine-grained emotion detection dataset built from literature, the semi-supervised approach used to create it, as well as the models trained on it. This work provides a solid foundation for advancing emotion detection techniques and their utilization in various scenarios, especially within the cultural heritage analysis.

Labels

admiration: finds something admirable, impressive or worthy of respect

amusement: finds something funny, entertaining or amusing

anger: is angry, furious, or strongly displeased; displays ire, rage, or wrath

annoyance: is annoyed or irritated

approval: expresses a favorable opinion, approves, endorses or agrees with something or someone

boredom: feels bored, uninterested, monotony, tedium

calmness: is calm, serene, free from agitation or disturbance, experiences emotional tranquility

caring: cares about the well-being of someone else, feels sympathy, compassion, affectionate concern towards someone, displays kindness or generosity

courage: feels courage or the ability to do something that frightens one, displays fearlessness or bravery

curiosity: is interested, curious, or has strong desire to learn something

desire: has a desire or ambition, wants something, wishes for something to happen

despair: feels despair, helpless, powerless, loss or absence of hope, desperation, despondency

disappointment: feels sadness or displeasure caused by the non-fulfillment of hopes or expectations, being or let down, expresses regret due to the unfavorable outcome of a decision

disapproval: expresses an unfavorable opinion, disagrees or disapproves of something or someone

disgust: feels disgust, revulsion, finds something or someone unpleasant, offensive or hateful

doubt: has doubt or is uncertain about something, bewildered, confused, or shows lack of understanding

embarrassment: feels embarrassed, awkward, self-conscious, shame, or humiliation

envy: is covetous, feels envy or jealousy; begrudges or resents someone for their achievements, possessions, or qualities

excitement: feels excitement or great enthusiasm and eagerness

faith: expresses religious faith, has a strong belief in the doctrines of a religion, or trust in god

fear: is afraid or scared due to a threat, danger, or harm

frustration: feels frustrated: upset or annoyed because of inability to change or achieve something

gratitude: is thankful or grateful for something

greed: is greedy, rapacious, avaricious, or has selfish desire to acquire or possess more than what one needs

grief: feels grief or intense sorrow, or grieves for someone who has died

guilt: feels guilt, remorse, or regret to have committed wrong or failed in an obligation

indifference: is uncaring, unsympathetic, uncharitable, or callous, shows indifference, lack of concern, coldness towards someone

joy: is happy, feels joy, great pleasure, elation, satisfaction, contentment, or delight

love: feels love, strong affection, passion, or deep romantic attachment for someone

nervousness: feels nervous, anxious, worried, uneasy, apprehensive, stressed, troubled or tense

nostalgia: feels nostalgia, longing or wistful affection for the past, something lost, or for a period in one's life, feels homesickness, a longing for one's home, city, or country while being away; longing for a familiar place

optimism: feels optimism or hope, is hopeful or confident about the future, that something good may happen, or the success of something - pain: feels physical pain or is experiences physical suffering

pride: is proud, feels pride from one's own achievements, self-fulfillment, or from the achievements of those with whom one is closely associated, or from qualities or possessions that are widely admired

relief: feels relaxed, relief from tension or anxiety

sadness: feels sadness, sorrow, unhappiness, depression, dejection

surprise: is surprised, astonished or shocked by something unexpected

trust: trusts or has confidence in someone, or believes that someone is good, honest, or reliable

Dataset

EmoLit (Zenodo)

Code

EmoLit Train (Github)

Models

LARGE

BASE

DISTILL
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Body Segmentation - 6,700 Photos
kaggle.com
zip
Updated Apr 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2023). Body Segmentation - 6,700 Photos [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/human-segmentation-dataset
Explore at:
zip(35504836 bytes)Available download formats
Dataset updated
Apr 20, 2023
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
7 Types of Human Segmentation

The dataset includes 7 different types of image segmentation of people in underwear. For women, 4 types of labeling are provided, and for men, 3 types of labeling are provided. The dataset solves tasks in the field of recommendation systems and e-commerce.

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

Types of labeling

Women I - distinctively detailed labeling of women. Special emphasis is placed on distinguishing the internal, external side, and lower breast depending on the type of underwear. The labeling also includes the face and hair, hands, forearms, shoulders, armpits, thighs, shins, underwear, accessories, and smartphones.

![https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fe157d0b7db89497f85c9b2d79d301086%2Fgirls_1_227.png?generation=1681741881080579&alt=media" alt="">

Women II - labeling of images of women with attention to the side abs area (highlighted in gray on the labeling). The labeling also includes the face and hair, hands, forearms, thighs, underwear, accessories, and smartphones.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F901d120c0273ea9a5a328fff15e26583%2Fgirls_2_-1087839647-1867563540.png?generation=1681741958025976&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F75fb0412edf631adce5f42ab6b9e8052%2Fwomen_fat_image_56570.png?generation=1681742864993159&alt=media" alt="">

Women III - primarily labeling of underwear. In addition to the underwear itself, the labeling includes the face and hair, abdomen, and arms and legs.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6f32a06f0754a5a116fc994feae8c6f1%2Fgirls_5_111.png?generation=1681742011331681&alt=media" alt="">

Women IV - labeling of both underwear and body parts. It includes labeling of underwear, face and hair, hands, forearms, body, legs, as well as smartphones and tattoos.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F0dc22fcfd8b6e4fad3aa1806d14223ef%2Fgirls_6_image_4534.png?generation=1681742073295272&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F4a398547e13c555fdad142f521e62a5f%2Fsports_girls_IadpBSd3mI%20(1).png?generation=1681742947264828&alt=media" alt="">

Men I - labeling of the upper part of men's bodies. It includes labeling of hands and wrists, shoulders, body, neck, face and hair, as well as phones and accessories.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F3dae9889adb2b1415353769ccdd9c01b%2Fman_regular_1532667_38709335.png?generation=1681742128995529&alt=media" alt="">

Men II - more detailed labeling of men's bodies. The labeling includes hands and wrists, shoulders, body and neck, head and hair, underwear, tattoos and accessories, nipple and navel area.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fa57123e41066aa277bfeac140f4457da%2Fmen_1_3046.png?generation=1681742173310957&alt=media" alt="">

Men Neuro - labeling produced by a neural network for subsequent correction by annotators.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F5281cd644bc3f5949aaa9c40fb1cafd4%2F4595%20(1).png?generation=1681742187215164&alt=media" alt="">

🧩 This is just an example of the data. Leave a request here to learn more

🚀 You can learn more about our high-quality unique datasets here

keywords: body segmentation dataset, human part segmentation dataset, human semantic part segmentation, human body segmentation data, human body segmentation deep learning, computer vision dataset, people images dataset, biometric data dataset, biometric dataset, images database, image-to-image, people segmentation, machine learning
Physiological signals during activities for daily life: Dataset
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Mar 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro (2022). Physiological signals during activities for daily life: Dataset [Dataset]. http://doi.org/10.5281/zenodo.6391454
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6391454
Dataset updated
Mar 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this work is composed by four participants, two men and two women. Each of them carried the wearable device Empatica E4 for a total number of 15 days. They carried the wearable during the day, and during the nights we asked participants to charge and load the data into an external memory unit. During these days, participants were asked to answer EMA questionnaires which are used to label our data. However, some participants could not complete the full experiment or some days were discarded due to data corruption. Specific demographic information, total sampling days and total number of EMA answers can be found in table I.

Participant 1 Participant 2 Participant 3 Participant 4
Age 67 55 60 63
Gender Male Female Male Female

Final Valid Days
9 15 12 13
Total EMAs 42 57 64 46

Table I. Summary of participants' collected data.

This dataset provides three different type of labels. Activeness and happiness are two of these labels. These are the answers to EMA questionnaires that participants reported during their daily activities. These labels are numbers between 0 and 4.
These labels are used to interpolate the mental well-being state according to [1] We report in our dataset a total number of eight emotional states: (1) pleasure, (2) excitement, (3) arousal, (4) distress, (5) misery, (6) depression, (7) sleepiness, and (8) contentment.

The data we provide in this repository consist of two type of files:

CSV files: These files contain physiological signals recorded during the data collection process. The first line of each CSV file defines the timestamp by which data started being sampled. The second line defines the sampling frequency used for gathering the signal. From the third line until the end of the file, one can find sampled datapoints.

Excel files: These files contain the labels obtained from EMA answers. It is indicated the timestamp at which the answer was registered. Labels for pleasure, activeness and mood can be found in this file.

NOTE: Files are numbered according to each specific sampling day. For example, ACC1.csv corresponds to the signal ACC for sampling day 1. The same applied to excel files.

Code and a tutorial of how to labelled and extract features can be found in this repository: https://github.com/edugm94/temporal-feat-emotion-prediction

References:

[1] . A. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980
R
Car Highway Dataset
universe.roboflow.com
zip
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sallar (2023). Car Highway Dataset [Dataset]. https://universe.roboflow.com/sallar/car-highway/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Sallar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Vehicles Bounding Boxes
Description
Car-Highway Data Annotation Project

Introduction

In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.

Project Goals

Collect a diverse dataset of car images from highway scenes.

Annotate the dataset to identify and label cars within each image.

Organize and format the annotated data for machine learning model training.

Tools and Technologies

For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.

Annotation Process

Upload the raw car images to the Roboflow platform.

Use the annotation tools in Roboflow to draw bounding boxes around each car in the images.

Label each bounding box with the corresponding class (e.g., car).

Review and validate the annotations for accuracy.

Data Augmentation

Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.

Data Export

Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.

Milestones

Data Collection and Preprocessing

Annotation of Car Images

Data Augmentation

Data Export

Model Training

Conclusion

By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.
Dollar street 10 - 64x64x3
zenodo.org
data.niaid.nih.gov
+1more
bin
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven van der burg; Sven van der burg (2025). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10970014
Dataset updated
May 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sven van der burg; Sven van der burg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

These are the preprocessing steps that were performed:

Only take examples with one imagenet_synonym label

Use only examples with the 10 most frequently occuring labels

Downscale images to 64 x 64 pixels

Split data in train and test

Store as numpy array

This is the label mapping:

Category label
day bed 0
dishrag 1
plate 2
running shoe 3
soap dispenser 4
street sign 5
table lamp 6
tile roof 7
toilet seat 8
washing machine 9

Checkout https://github.com/carpentries-lab/deep-learning-intro/blob/main/instructors/prepare-dollar-street-data.ipynb" target="_blank" rel="noopener">this notebook to see how the subset was created.

The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.
Website Screenshots Dataset
universe.roboflow.com
kaggle.com
zip
Updated Aug 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow (2022). Website Screenshots Dataset [Dataset]. https://universe.roboflow.com/roboflow-gw7yv/website-screenshots/model/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 19, 2022
Dataset authored and provided by
Roboflowhttps://roboflow.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Elements Bounding Boxes
Description
About This Dataset

The Roboflow Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes: :fa-spacer: * button - navigation links, tabs, etc. * heading - text that was enclosed in <h1> to <h6> tags. * link - inline, textual <a> tags. * label - text labeling form fields. * text - all other text. * image - <img>, <svg>, or <video> tags, and icons. * iframe - ads and 3rd party content.

Example

This is an example image and annotation from the dataset: https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">

Usage

Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

Collecting Custom Data

Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. :fa-spacer: Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:
FSDKaggle2019
data.niaid.nih.gov
data.europa.eu
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Manoj Plakal; Frederic Font; Daniel P. W. Ellis; Xavier Serra (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Music Technology Grouphttps://www.upf.edu/web/mtg
Google, Inc., New York, NY, USA
Authors
Eduardo Fonseca; Manoj Plakal; Frederic Font; Daniel P. W. Ellis; Xavier Serra
Description
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

Citation

If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Data curators

Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

ABOUT FSDKaggle2019

Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

FSDKaggle2019 employs audio clips from the following sources:

Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

Ground Truth Labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

curated train set: correct (but potentially incomplete) labels

noisy train set: noisy labels

test set: correct and complete labels

Further details can be found below in the sections for each set.

Format

All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

DATA SPLIT

FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

Curated train set

The curated train set consists of manually-labeled data from FSD.

Number of clips/class: 75 except in a few cases (where there are less)

Total number of clips: 4970

Avg number of labels/clip: 1.2

Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

Noisy train set

The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

Number of clips/class: 300

Total number of clips: 19,815

Avg number of labels/clip: 1.2

Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD.

Number of clips/class: between 50 and 150

Total number of clips: 4481

Avg number of labels/clip: 1.4

Total duration: 12.9 hours

The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

Acoustic mismatch

As mentioned before, FSDKaggle2019 uses audio clips from two sources:

FSD: curated train set and test set, and

YFCC: noisy train set.

While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

LICENSE

All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

FILES & DOWNLOAD

FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy
AI Training Dataset Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). AI Training Dataset Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), APAC (China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-training-dataset-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jul 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
United Kingdom, Canada, United States
Description
Snapshot img

AI Training Dataset Market Size 2025-2029

The ai training dataset market size is valued to increase by USD 7.33 billion, at a CAGR of 29% from 2024 to 2029. Proliferation and increasing complexity of foundational AI models will drive the ai training dataset market.

Market Insights

North America dominated the market and accounted for a 36% growth during the 2025-2029. By Service Type - Text segment was valued at USD 742.60 billion in 2023 By Deployment - On-premises segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 479.81 million Market Future Opportunities 2024: USD 7334.90 million CAGR from 2024 to 2029 : 29%

Market Summary

The market is experiencing significant growth as businesses increasingly rely on artificial intelligence (AI) to optimize operations, enhance customer experiences, and drive innovation. The proliferation and increasing complexity of foundational AI models necessitate large, high-quality datasets for effective training and improvement. This shift from data quantity to data quality and curation is a key trend in the market. Navigating data privacy, security, and copyright complexities, however, poses a significant challenge. Businesses must ensure that their datasets are ethically sourced, anonymized, and securely stored to mitigate risks and maintain compliance. For instance, in the supply chain optimization sector, companies use AI models to predict demand, optimize inventory levels, and improve logistics. Access to accurate and up-to-date training datasets is essential for these applications to function efficiently and effectively. Despite these challenges, the benefits of AI and the need for high-quality training datasets continue to drive market growth. The potential applications of AI are vast and varied, from healthcare and finance to manufacturing and transportation. As businesses continue to explore the possibilities of AI, the demand for curated, reliable, and secure training datasets will only increase.

What will be the size of the AI Training Dataset Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free SampleThe market continues to evolve, with businesses increasingly recognizing the importance of high-quality datasets for developing and refining artificial intelligence models. According to recent studies, the use of AI in various industries is projected to grow by over 40% in the next five years, creating a significant demand for training datasets. This trend is particularly relevant for boardrooms, as companies grapple with compliance requirements, budgeting decisions, and product strategy. Moreover, the importance of data labeling, feature selection, and imbalanced data handling in model performance cannot be overstated. For instance, a mislabeled dataset can lead to biased and inaccurate models, potentially resulting in costly errors. Similarly, effective feature selection algorithms can significantly improve model accuracy and reduce computational resources. Despite these challenges, advances in model compression methods, dataset scalability, and data lineage tracking are helping to address some of the most pressing issues in the market. For example, model compression techniques can reduce the size of models, making them more efficient and easier to deploy. Similarly, data lineage tracking can help ensure data consistency and improve model interpretability. In conclusion, the market is a critical component of the broader AI ecosystem, with significant implications for businesses across industries. By focusing on data quality, effective labeling, and advanced techniques for handling imbalanced data and improving model performance, organizations can stay ahead of the curve and unlock the full potential of AI.

Unpacking the AI Training Dataset Market Landscape

In the realm of artificial intelligence (AI), the significance of high-quality training datasets is indisputable. Businesses harnessing AI technologies invest substantially in acquiring and managing these datasets to ensure model robustness and accuracy. According to recent studies, up to 80% of machine learning projects fail due to insufficient or poor-quality data. Conversely, organizations that effectively manage their training data experience an average ROI improvement of 15% through cost reduction and enhanced model performance.

Distributed computing systems and high-performance computing facilitate the processing of vast datasets, enabling businesses to train models at scale. Data security protocols and privacy preservation techniques are crucial to protect sensitive information within these datasets. Reinforcement learning models and supervised learning models each have their unique applications, with the former demonstrating a 30% faster convergence rate in certain use cases.

Data annot
Z
AIT Log Data Set V2.0
data.niaid.nih.gov
zenodo.org
+1more
Updated Jun 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Landauer, Max; Skopik, Florian; Frank, Maximilian; Hotwagner, Wolfgang; Wurzenberger, Markus; Rauber, Andreas (2024). AIT Log Data Set V2.0 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5789063
Explore at:
Dataset updated
Jun 28, 2024
Dataset provided by
Vienna University of Technology
AIT Austrian Institute of Technology
Authors
Landauer, Max; Skopik, Florian; Frank, Maximilian; Hotwagner, Wolfgang; Wurzenberger, Markus; Rauber, Andreas
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
AIT Log Data Sets

This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

The datasets in this repository have the following structure:

The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.

The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.

The processing directory contains the source code that was used to generate the labels.

The rules directory contains the labeling rules.

The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.

The dataset.yml file specifies the start and end time of the simulation.

The following table summarizes relevant properties of the datasets:

fox

Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00

Attack time: 2022-01-18 11:59 - 2022-01-18 13:15

Scan volume: High

Unpacked size: 26 GB

harrison

Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00

Attack time: 2022-02-08 07:07 - 2022-02-08 08:38

Scan volume: High

Unpacked size: 27 GB

russellmitchell

Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00

Attack time: 2022-01-24 03:01 - 2022-01-24 04:39

Scan volume: Low

Unpacked size: 14 GB

santos

Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00

Attack time: 2022-01-17 11:15 - 2022-01-17 11:59

Scan volume: Low

Unpacked size: 17 GB

shaw

Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00

Attack time: 2022-01-29 14:37 - 2022-01-29 15:21

Scan volume: Low

Data exfiltration is not visible in DNS logs

Unpacked size: 27 GB

wardbeck

Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00

Attack time: 2022-01-23 12:10 - 2022-01-23 12:56

Scan volume: Low

Unpacked size: 26 GB

wheeler

Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00

Attack time: 2022-01-30 07:35 - 2022-01-30 17:53

Scan volume: High

No password cracking in attack chain

Unpacked size: 30 GB

wilson

Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00

Attack time: 2022-02-07 10:57 - 2022-02-07 11:49

Scan volume: High

Unpacked size: 39 GB

The following attacks are launched in the network:

Scans (nmap, WPScan, dirb)

Webshell upload (CVE-2020-24186)

Password cracking (John the Ripper)

Privilege escalation

Remote command execution

Data exfiltration (DNSteal)

Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.

Version history:

AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.

AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

If you use the dataset, please cite the following publications:

[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]

[2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A...
zenodo.org
data.niaid.nih.gov
bin
Updated Feb 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gong Chengjuan; Yin Ranyu; Yin Ranyu; Long Tengfei; Long Tengfei; He Guojin; Jiao Weili; Wang Guizhou; Gong Chengjuan; He Guojin; Jiao Weili; Wang Guizhou (2024). Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A Spatial-Temporal Approach and Dataset" [Dataset]. http://doi.org/10.5281/zenodo.10613705
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10613705
Dataset updated
Feb 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gong Chengjuan; Yin Ranyu; Yin Ranyu; Long Tengfei; Long Tengfei; He Guojin; Jiao Weili; Wang Guizhou; Gong Chengjuan; He Guojin; Jiao Weili; Wang Guizhou
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).

Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.

Dataset Structure:

Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.

For each sample, it includes:

Sample ID;

Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);

Label list indicating cloud cover status for the center $6\times6$ pixels of each timestamp;

Ordinal list for each timestamp;

Sample weight list (reserved);

Here is a demonstration function for parsing the TFRecord file:

import tensorflow as tf # init Tensorflow Dataset from file name def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') } # The Decoder (Optional) class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def _init_(self) -> None: super()._init_() def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict # simple function def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE) #

Class Definition:

0: clear

1: opaque cloud

2: thin cloud

3: haze

4: cloud shadow

5: snow

Dataset Construction:

First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products.
It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.

Then, the time series image patches of two shapes are cropped with each point as the center.
The patches of shape $42 \times 42$ are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.
And the patches of shape $348 \times 348$ are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.

The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.
T
imagenet_a
tensorflow.org
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imagenet_a [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet_a
Explore at:
Dataset updated
Jun 1, 2024
Description
ImageNet-A is a set of images labelled with ImageNet labels that were obtained by collecting new data and keeping only those images that ResNet-50 models fail to correctly classify. For more details please refer to the paper.

The label space is the same as that of ImageNet2012. Each example is represented as a dictionary with the following keys:

'image': The image, a (H, W, 3)-tensor.

'label': An integer in the range [0, 1000).

'file_name': A unique sting identifying the example within the dataset.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet_a', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet_a-0.1.0.png" alt="Visualization" width="500px">
Kyoushi Log Data Set
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber (2025). Kyoushi Log Data Set [Dataset]. http://doi.org/10.5281/zenodo.5779411
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5779411
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.

The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.

Each dataset contains traces of a specific attack scenario:

Scenario 1 (see gather/attacker_0/logs/sm.log for detailed attack log):

nmap scan

WPScan

dirb scan

webshell upload through wpDiscuz exploit (CVE-2020-24186)

privilege escalation

Scenario 2 (see gather/attacker_0/logs/dnsteal.log for detailed attack log):

DNSteal data exfiltration

The log data collected from the servers includes

Apache access and error logs (labeled)

audit logs (labeled)

auth logs (labeled)

VPN logs (labeled)

DNS logs (labeled)

syslog

suricata logs

exim logs

horde logs

mail logs

Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.

Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

If you use the dataset, please cite the following publications:

[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317.

[2] M. Landauer, M. Frank, F. Skopik, W. Hotwagner, M. Wurzenberger, and A. Rauber, "A Framework for Automatic Labeling of Log Datasets from Model-driven Testbeds for HIDS Evaluation". ACM Workshop on Secure and Trustworthy Cyber-Physical Systems (ACM SaT-CPS 2022), April 27, 2022, Baltimore, MD, USA. ACM.

[3] M. Frank, "Quality improvement of labels for model-driven benchmark data generation for intrusion detection systems", Master's Thesis, Vienna University of Technology, 2021.
H
Replication Data for: Active Learning Approaches for Labeling Text: Review...
dataverse.harvard.edu
dataone.org
Updated Dec 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blake Miller; Fridolin Linder; Walter Mebane (2019). Replication Data for: Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches [Dataset]. http://doi.org/10.7910/DVN/T88EAX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/T88EAX
Dataset updated
Dec 11, 2019
Dataset provided by
Harvard Dataverse
Authors
Blake Miller; Fridolin Linder; Walter Mebane
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or `passive' learning) to achieve equally performing classifiers. We further investigate how varying levels of inter-coder reliability affect the active learning procedures and find that even with low-reliability active learning performs more efficiently than does random sampling.
Z
Data from: 19th Century United States Newspaper Advert images with...
data.niaid.nih.gov
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
van Strien, Daniel (2022). 19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4075210
Explore at:
Dataset updated
Jan 12, 2022
Dataset provided by
British Library
Authors
van Strien, Daniel
Area covered
United States
Description
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/).

[The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project.

source: https://news-navigator.labs.loc.gov/

One of these categories is 'advertisements. This dataset contains a sample of these images with additional labels indicating if the advert is 'illustrated' or 'not illustrated'.

The data is organised as follows:

The images themselves can be found in images.zip

newspaper-navigator-sample-metadata.csv contains metadata about each image drawn from the Newspaper Navigator Dataset.

ads.csv contains the labels for the images as a CSV file

sample.csv contains additional metadata about the images (based on the newspapers those images came from).

This dataset was created for use in an under-review Programming Historian tutorial (http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt1) The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material. The data is shared here since it may be useful for others. This data documentation is a work in progress and will be updated when the Programming Historian tutorial is released publicly.

The metadata CSV file contains the following columns:

filepath

pub_date

page_seq_num

edition_seq_num

batch

lccn

box

score

ocr

place_of_publication

geographic_coverage

name

publisher

url

page_url

month

year

iiif_url
Self-Annotated Wearable Activity Data
zenodo.org
data-staging.niaid.nih.gov
+1more
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Hölzemann; Alexander Hölzemann; Kristof Van Laerhoven; Kristof Van Laerhoven (2024). Self-Annotated Wearable Activity Data [Dataset]. http://doi.org/10.3389/fcomp.2024.1379788
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2024.1379788
Dataset updated
Sep 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander Hölzemann; Alexander Hölzemann; Kristof Van Laerhoven; Kristof Van Laerhoven
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Our dataset contains 2 weeks of approx. 8-9 hours of acceleration data per day from 11 participants wearing a Bangle.js Version 1 smartwatch with our firmware installed.

The dataset contains annotations from 4 different commonly used annotation methods utilized in user studies that focus on in-the-wild data. These methods can be grouped in user-driven, in situ annotations - which are performed before or during the activity is recorded - and recall methods - where participants annotate their data in hindsight at the end of the day.

The participants had the task to label their activities using (1) a button located on the smartwatch, (2) the activity tracking app Strava, (3) a (hand)written diary and (4) a tool to visually inspect and label activity data, called MAD-GUI. Methods (1)-(3) are used in both weeks, however method (4) is introduced in the beginning of the second study week.

The accelerometer data is recorded with 25 Hz, a sensitivity of ±8g and is stored in a csv format. Labels and raw data are not yet combined. You can either write your own script to label the data or follow the instructions in our corresponding Github repository.

The following unique classes are included in our dataset:

laying, sitting, walking, running, cycling, bus_driving, car_driving, vacuum_cleaning, laundry, cooking, eating, shopping, showering, yoga, sport, playing_games, desk_work, guitar_playing, gardening, table_tennis, badminton, horse_riding.

However, many activities are very participant specific and therefore only performed by one of the participants.

The labels are also stored as a .csv file and have the following columns:

week_day, start, stop, activity, layer

Example:

week2_day2,10:30:00,11:00:00,vacuum_cleaning,d

The layer columns specifies which annotation method was used to set this label.

The following identifiers can be found in the column:

b: in situ button

a: in situ app

d: self-recall diary

g: time-series recall labelled with a the MAD-GUI

The corresponding publication is currently under review.

	Participant 1	Participant 2	Participant 3	Participant 4
Age	67	55	60	63
Gender	Male	Female	Male	Female
Final Valid Days	9	15	12	13
Total EMAs	42	57	64	46

Facebook

Twitter

Click to copy link

Link copied

Cite

Growth Market Reports (2025). Data Labeling Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-market

Data Labeling Market Research Report 2033

Explore at:

pdf, pptx, csvAvailable download formats

Dataset updated

Sep 1, 2025

Dataset authored and provided by

Growth Market Reports

Time period covered

2024 - 2032

Area covered

Global

Description

Data Labeling Market Outlook

According to our latest research, the global data labeling market size reached USD 3.2 billion in 2024, driven by the explosive growth in artificial intelligence and machine learning applications across industries. The market is poised to expand at a CAGR of 22.8% from 2025 to 2033, and is forecasted to reach USD 25.3 billion by 2033. This robust growth is primarily fueled by the increasing demand for high-quality annotated data to train advanced AI models, the proliferation of automation in business processes, and the rising adoption of data-driven decision-making frameworks in both the public and private sectors.

One of the principal growth drivers for the data labeling market is the accelerating integration of AI and machine learning technologies across various industries, including healthcare, automotive, retail, and BFSI. As organizations strive to leverage AI for enhanced customer experiences, predictive analytics, and operational efficiency, the need for accurately labeled datasets has become paramount. Data labeling ensures that AI algorithms can learn from well-annotated examples, thereby improving model accuracy and reliability. The surge in demand for computer vision applicationsÂ—such as facial recognition, autonomous vehicles, and medical imagingÂ—has particularly heightened the need for image and video data labeling, further propelling market growth.

Another significant factor contributing to the expansion of the data labeling market is the rapid digitization of business processes and the exponential growth in unstructured data. Enterprises are increasingly investing in data annotation tools and platforms to extract actionable insights from large volumes of text, audio, and video data. The proliferation of Internet of Things (IoT) devices and the widespread adoption of cloud computing have further amplified data generation, necessitating scalable and efficient data labeling solutions. Additionally, the rise of semi-automated and automated labeling technologies, powered by AI-assisted tools, is reducing manual effort and accelerating the annotation process, thereby enabling organizations to meet the growing demand for labeled data at scale.

The evolving regulatory landscape and the emphasis on data privacy and security are also playing a crucial role in shaping the data labeling market. As governments worldwide introduce stringent data protection regulations, organizations are turning to specialized data labeling service providers that adhere to compliance standards. This trend is particularly pronounced in sectors such as healthcare and BFSI, where the accuracy and confidentiality of labeled data are critical. Furthermore, the increasing outsourcing of data labeling tasks to specialized vendors in emerging economies is enabling organizations to access skilled labor at lower costs, further fueling market expansion.

From a regional perspective, North America currently dominates the data labeling market, followed by Europe and the Asia Pacific. The presence of major technology companies, robust investments in AI research, and the early adoption of advanced analytics solutions have positioned North America as the market leader. However, the Asia Pacific region is expected to witness the fastest growth during the forecast period, driven by the rapid digital transformation in countries like China, India, and Japan. The growing focus on AI innovation, government initiatives to promote digitalization, and the availability of a large pool of skilled annotators are key factors contributing to the regionÂ’s impressive growth trajectory.

In the realm of security, Video Dataset Labeling for Security has emerged as a critical application area within the data labeling market. As surveillance systems become more sophisticated, the need for accurately labeled video data is paramount to ensure the effectiveness of security measures. Video dataset labeling involves annotating video frames to identify and track objects, behaviors, and anomalies, which are essential for developing intelligent security systems capable of real-time threat detection and response. This process not only enhances the accuracy of security algorithms but also aids in the training of AI models that can predict and prevent potential security breaches. The growing emphasis on public safety and

Clear search

Close search

Google apps

Main menu

Category	label
day bed	0
dishrag	1
plate	2
running shoe	3
soap dispenser	4
street sign	5
table lamp	6
tile roof	7
toilet seat	8
washing machine	9

Data Labeling Market Research Report 2033

Data Labeling Market Outlook

Face Detection - Face Recognition Dataset

Face Detection - Object Detection & Face Recognition Dataset

The dataset is created on the basis of Selfies and ID Dataset

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

Metadata for the full dataset:

OTHER BIOMETRIC DATASETS:

🧩 This is just an example of the data. Leave a request here to learn more

Dataset structure

Data Format

Example of XML file structure

Language Dataset

Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...

code for testing robustness of a model

extract and unzip the dataset, then write top folder here

select folder with specific target

EmoLit

Description Literature sentences from Project Gutenberg. 38 emotion labels (+neutral examples). Semi-Supervised dataset.

Article

Abstract

Labels

Dataset

Code

Models

UCI and OpenML Data Sets for Ordinal Quantification

Body Segmentation - 6,700 Photos

7 Types of Human Segmentation

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

Types of labeling

🧩 This is just an example of the data. Leave a request here to learn more

Physiological signals during activities for daily life: Dataset

Car Highway Dataset

Car-Highway Data Annotation Project

Introduction

Project Goals

Tools and Technologies

Annotation Process

Data Augmentation

Data Export

Milestones

Conclusion

Dollar street 10 - 64x64x3

Website Screenshots Dataset

About This Dataset

Example

Usage

Collecting Custom Data

About Roboflow

FSDKaggle2019

AI Training Dataset Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

AIT Log Data Set V2.0

Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A...

imagenet_a

Kyoushi Log Data Set

Replication Data for: Active Learning Approaches for Labeling Text: Review...

Data from: 19th Century United States Newspaper Advert images with...

Self-Annotated Wearable Activity Data

Data Labeling Market Research Report 2033

Data Labeling Market Outlook