100+ datasets found

G
Data Labeling Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Labeling Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 1, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Labeling Market Outlook

According to our latest research, the global data labeling market size reached USD 3.2 billion in 2024, driven by the explosive growth in artificial intelligence and machine learning applications across industries. The market is poised to expand at a CAGR of 22.8% from 2025 to 2033, and is forecasted to reach USD 25.3 billion by 2033. This robust growth is primarily fueled by the increasing demand for high-quality annotated data to train advanced AI models, the proliferation of automation in business processes, and the rising adoption of data-driven decision-making frameworks in both the public and private sectors.

One of the principal growth drivers for the data labeling market is the accelerating integration of AI and machine learning technologies across various industries, including healthcare, automotive, retail, and BFSI. As organizations strive to leverage AI for enhanced customer experiences, predictive analytics, and operational efficiency, the need for accurately labeled datasets has become paramount. Data labeling ensures that AI algorithms can learn from well-annotated examples, thereby improving model accuracy and reliability. The surge in demand for computer vision applicationsÂ—such as facial recognition, autonomous vehicles, and medical imagingÂ—has particularly heightened the need for image and video data labeling, further propelling market growth.

Another significant factor contributing to the expansion of the data labeling market is the rapid digitization of business processes and the exponential growth in unstructured data. Enterprises are increasingly investing in data annotation tools and platforms to extract actionable insights from large volumes of text, audio, and video data. The proliferation of Internet of Things (IoT) devices and the widespread adoption of cloud computing have further amplified data generation, necessitating scalable and efficient data labeling solutions. Additionally, the rise of semi-automated and automated labeling technologies, powered by AI-assisted tools, is reducing manual effort and accelerating the annotation process, thereby enabling organizations to meet the growing demand for labeled data at scale.

The evolving regulatory landscape and the emphasis on data privacy and security are also playing a crucial role in shaping the data labeling market. As governments worldwide introduce stringent data protection regulations, organizations are turning to specialized data labeling service providers that adhere to compliance standards. This trend is particularly pronounced in sectors such as healthcare and BFSI, where the accuracy and confidentiality of labeled data are critical. Furthermore, the increasing outsourcing of data labeling tasks to specialized vendors in emerging economies is enabling organizations to access skilled labor at lower costs, further fueling market expansion.

From a regional perspective, North America currently dominates the data labeling market, followed by Europe and the Asia Pacific. The presence of major technology companies, robust investments in AI research, and the early adoption of advanced analytics solutions have positioned North America as the market leader. However, the Asia Pacific region is expected to witness the fastest growth during the forecast period, driven by the rapid digital transformation in countries like China, India, and Japan. The growing focus on AI innovation, government initiatives to promote digitalization, and the availability of a large pool of skilled annotators are key factors contributing to the regionÂ’s impressive growth trajectory.

In the realm of security, Video Dataset Labeling for Security has emerged as a critical application area within the data labeling market. As surveillance systems become more sophisticated, the need for accurately labeled video data is paramount to ensure the effectiveness of security measures. Video dataset labeling involves annotating video frames to identify and track objects, behaviors, and anomalies, which are essential for developing intelligent security systems capable of real-time threat detection and response. This process not only enhances the accuracy of security algorithms but also aids in the training of AI models that can predict and prevent potential security breaches. The growing emphasis on public safety and
_labels1.csv. This data set representss the label of the corresponding...
figshare.com
txt
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
naillah gul (2023). _labels1.csv. This data set representss the label of the corresponding samples in data.csv file [Dataset]. http://doi.org/10.6084/m9.figshare.24270088.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24270088.v1
Dataset updated
Oct 9, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
naillah gul
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets contain pixel-level hyperspectral data of six snow and glacier classes. They have been extracted from a Hyperspectral image. The dataset "data.csv" has 5417 * 142 samples belonging to the classes: Clean snow, Dirty ice, Firn, Glacial ice, Ice mixed debris, and Water body. The dataset "_labels1.csv" has corresponding labels of the "data.csv" file. The dataset "RGB.csv" has only 5417 * 3 samples. There are only three band values in this file while "data.csv" has 142 band values.

Resume_Dataset

kaggle.com

zip

Updated Jul 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

RayyanKauchali0 (2025). Resume_Dataset [Dataset]. https://www.kaggle.com/datasets/rayyankauchali0/resume-dataset

Explore at:

zip(3616108 bytes)Available download formats

Dataset updated

Jul 26, 2025

Authors

RayyanKauchali0

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Tech Resume Dataset (3,500+ Samples):

This dataset is designed for cutting-edge NLP research in resume parsing, job classification, and ATS system development. Below are extensive details and several ready-made diagrams you can include in your Kaggle upload (just save and upload as “Additional Files” or use them in your dataset description).

Dataset Composition and Sourcing

Total Resumes: 3,500+
Sources:
- Real Data: 2,047 resumes (58.5%) from ResumeAtlas and reputable open repositories; all records strictly anonymized.
- Template-Based Synthetic: 573 resumes featuring varied narratives and realistic achievements for classic, modern, and professional styles.
- LLM-Generated Variations: 460 unique samples using structured prompts to diversify skills, summaries, and career tracks, focusing on AI, ML, and data.
- Faker-Seeded Synthetic: 420 resumes, especially for junior/support/cloud/network tracks, populated with robust Faker-generated work and education fields.
Role Coverage:
- 15 major technology clusters (Software Engineering, DevOps, Cloud, AI/ML, Security, Data Engineering, QA, UI/UX, and more)
- At least 200 samples per primary role group for label balance
- 60+ subcategories reflecting granular tech job roles

Key Dataset Fields (JSONL Schema)

Field	Description	Example/Data Type
ResumeID	Unique, anonymized string	"DIS4JE91Z..." (string)
Category	Tech job category/label	"DevOps Engineer"
Name	Anonymized (Faker-generated) name	"Jordan Patel"
Email	Anonymized email address	"jpatel@example.com"
Phone	Anonymized phone number	"+1-555-343-2123"
Location	City, country or region (anonymized)	"Austin, TX, USA"
Summary	Professional summary/intro	String (3-6 sentences)
Skills	List or comma-separated tech/soft skills	"Python, Kubernetes..."
Experience	Work chronology, organizations, bullet-point details	String (multiline)
Education	Universities, degrees, certs	String (multiline)
Source	"real", "template", "llm", "faker"	String

https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5b5a057-7265-4428-9827-0a4c92f88d19/0e26c38c.png" alt="Dataset Schema Overview with Field Descriptions and Data Types">

Dataset Schema Overview with Field Descriptions and Data Types

Technical Validation & Quality Assurance

Formatting:
- Uniform schema, right-tab alignment for dates (MMM-YYYY)
- Standard ATS/NLP-friendly section headers
De-duplication:
- All records checked with BERT/MinHash for uniqueness (cosine similarity >0.9 removed)
PII Scrubbing:
- Names, contacts, locations anonymized with Python Faker
Role/Skill Taxonomy:
- Job titles & skills mapped to ESCO, O*NET, NIST NICE, CNCF lexicons for research alignment
Quality Checks:
- Automatic and manual validation for section presence, data type conformity, and format alignment

Role & Source Coverage Visualizations

Composition by Data Source:

https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5aafe90-c5b6-4d07-ad9c-cf5244266561/5723c094.png" alt="Composition of Tech Resume Dataset by Data Source">

Composition of Tech Resume Dataset by Data Source

Role Cluster Diversity:

https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/8c6ba5d6-f676-4213-b4f7-16a133081e00/e9cc61b6.png" alt="Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset">

Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset

Alternative: Dataset by Source Type (Pie Chart):

https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/2325f133-7fe5-4294-9a9d-4db19be3584f/b85a47bd.png" alt="Resume Dataset Composition by Source Type">

Resume Dataset Composition by Source Type

Typical Use Cases

Resume parsing & sectioning (training for models like BERT, RoBERTa, spaCy)
Fine-tuning for NER, job classification (60+ labels), skill extraction, and ATS research
Development or benchmarking of AI-powered job matching, candidate ranking, and automated tracking tools
ML/data science education and demo pipelines

How to Use the JSONL File

Each line in tech_resumes_dataset.jsonl is a single, fully structured resume object:

import json

with open('tech_resumes_dataset.jsonl', 'r', encoding='utf-8') as f:
  resumes = [json.loads(line) for line in f]
# Each record is now a Python dictionary

Citing and Sharing

If you use this dataset, credit it as “[your Kaggle dataset URL]” and mention original sources (ResumeAtlas, Resume_Classification, Kaggle Resume Dataset, and synthetic methodology as described).

⁂

d
Data from: X-ray CT data with semantic annotations for the paper "A workflow...
catalog.data.gov
datasetcatalog.nlm.nih.gov
+1more
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). X-ray CT data with semantic annotations for the paper "A workflow for segmenting soil and plant X-ray CT images with deep learning in Google’s Colaboratory" [Dataset]. https://catalog.data.gov/dataset/x-ray-ct-data-with-semantic-annotations-for-the-paper-a-workflow-for-segmenting-soil-and-p-d195a
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
Leaves from genetically unique Juglans regia plants were scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA). Soil samples were collected in Fall of 2017 from the riparian oak forest located at the Russell Ranch Sustainable Agricultural Institute at the University of California Davis. The soil was sieved through a 2 mm mesh and was air dried before imaging. A single soil aggregate was scanned at 23 keV using the 10x objective lens with a pixel resolution of 650 nanometers on beamline 8.3.2 at the ALS. Additionally, a drought stressed almond flower bud (Prunus dulcis) from a plant housed at the University of California, Davis, was scanned using a 4x lens with a pixel resolution of 1.72 µm on beamline 8.3.2 at the ALS Raw tomographic image data was reconstructed using TomoPy. Reconstructions were converted to 8-bit tif or png format using ImageJ or the PIL package in Python before further processing. Images were annotated using Intel’s Computer Vision Annotation Tool (CVAT) and ImageJ. Both CVAT and ImageJ are free to use and open source. Leaf images were annotated in following Théroux-Rancourt et al. (2020). Specifically, Hand labeling was done directly in ImageJ by drawing around each tissue; with 5 images annotated per leaf. Care was taken to cover a range of anatomical variation to help improve the generalizability of the models to other leaves. All slices were labeled by Dr. Mina Momayyezi and Fiona Duong.To annotate the flower bud and soil aggregate, images were imported into CVAT. The exterior border of the bud (i.e. bud scales) and flower were annotated in CVAT and exported as masks. Similarly, the exterior of the soil aggregate and particulate organic matter identified by eye were annotated in CVAT and exported as masks. To annotate air spaces in both the bud and soil aggregate, images were imported into ImageJ. A gaussian blur was applied to the image to decrease noise and then the air space was segmented using thresholding. After applying the threshold, the selected air space region was converted to a binary image with white representing the air space and black representing everything else. This binary image was overlaid upon the original image and the air space within the flower bud and aggregate was selected using the “free hand” tool. Air space outside of the region of interest for both image sets was eliminated. The quality of the air space annotation was then visually inspected for accuracy against the underlying original image; incomplete annotations were corrected using the brush or pencil tool to paint missing air space white and incorrectly identified air space black. Once the annotation was satisfactorily corrected, the binary image of the air space was saved. Finally, the annotations of the bud and flower or aggregate and organic matter were opened in ImageJ and the associated air space mask was overlaid on top of them forming a three-layer mask suitable for training the fully convolutional network. All labeling of the soil aggregate and soil aggregate images was done by Dr. Devin Rippner. These images and annotations are for training deep learning models to identify different constituents in leaves, almond buds, and soil aggregates Limitations: For the walnut leaves, some tissues (stomata, etc.) are not labeled and only represent a small portion of a full leaf. Similarly, both the almond bud and the aggregate represent just one single sample of each. The bud tissues are only divided up into buds scales, flower, and air space. Many other tissues remain unlabeled. For the soil aggregate annotated labels are done by eye with no actual chemical information. Therefore particulate organic matter identification may be incorrect. Resources in this dataset:Resource Title: Annotated X-ray CT images and masks of a Forest Soil Aggregate. File Name: forest_soil_images_masks_for_testing_training.zipResource Description: This aggregate was collected from the riparian oak forest at the Russell Ranch Sustainable Agricultural Facility. The aggreagate was scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 0,0,0; pores spaces have a value of 250,250, 250; mineral solids have a value= 128,0,0; and particulate organic matter has a value of = 000,128,000. These files were used for training a model to segment the forest soil aggregate and for testing the accuracy, precision, recall, and f1 score of the model.Resource Title: Annotated X-ray CT images and masks of an Almond bud (P. Dulcis). File Name: Almond_bud_tube_D_P6_training_testing_images_and_masks.zipResource Description: Drought stressed almond flower bud (Prunis dulcis) from a plant housed at the University of California, Davis, was scanned by X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 4x lens with a pixel resolution of 1.72 µm using. For masks, the background has a value of 0,0,0; air spaces have a value of 255,255, 255; bud scales have a value= 128,0,0; and flower tissues have a value of = 000,128,000. These files were used for training a model to segment the almond bud and for testing the accuracy, precision, recall, and f1 score of the model.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads Resource Title: Annotated X-ray CT images and masks of Walnut leaves (J. Regia) . File Name: 6_leaf_training_testing_images_and_masks_for_paper.zipResource Description: Stems were collected from genetically unique J. regia accessions at the 117 USDA-ARS-NCGR in Wolfskill Experimental Orchard, Winters, California USA to use as scion, and were grafted by Sierra Gold Nursery onto a commonly used commercial rootstock, RX1 (J. microcarpa × J. regia). We used a common rootstock to eliminate any own-root effects and to simulate conditions for a commercial walnut orchard setting, where rootstocks are commonly used. The grafted saplings were repotted and transferred to the Armstrong lathe house facility at the University of California, Davis in June 2019, and kept under natural light and temperature. Leaves from each accession and treatment were scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 170,170,170; Epidermis value= 85,85,85; Mesophyll value= 0,0,0; Bundle Sheath Extension value= 152,152,152; Vein value= 220,220,220; Air value = 255,255,255.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
H
Replication Data for: Active Learning Approaches for Labeling Text: Review...
dataverse.harvard.edu
dataone.org
Updated Dec 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blake Miller; Fridolin Linder; Walter Mebane (2019). Replication Data for: Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches [Dataset]. http://doi.org/10.7910/DVN/T88EAX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/T88EAX
Dataset updated
Dec 11, 2019
Dataset provided by
Harvard Dataverse
Authors
Blake Miller; Fridolin Linder; Walter Mebane
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or `passive' learning) to achieve equally performing classifiers. We further investigate how varying levels of inter-coder reliability affect the active learning procedures and find that even with low-reliability active learning performs more efficiently than does random sampling.
Machine Learning Basics for Beginners🤖🧠
kaggle.com
zip
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
Explore at:
zip(492015 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Bhanupratap Biswas
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
Face Detection - Face Recognition Dataset
kaggle.com
zip
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2023). Face Detection - Face Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/face-detection-photos-and-labels
Explore at:
zip(1252666206 bytes)Available download formats
Dataset updated
Nov 8, 2023
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Face Detection - Object Detection & Face Recognition Dataset

The dataset is created on the basis of Selfies and ID Dataset

The dataset is a collection of images (selfies) of people and bounding box labeling for their faces. It has been specifically curated for face detection and face recognition tasks. The dataset encompasses diverse demographics, age, ethnicities, and genders.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F01348572e2ae2836f10bc2f2da381009%2FFrame%2050%20(1).png?generation=1699439342545305&alt=media" alt="">

The dataset is a valuable resource for researchers, developers, and organizations working on age prediction and face recognition to train, evaluate, and fine-tune AI models for real-world applications. It can be applied in various domains like psychology, market research, and personalized advertising.

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

Metadata for the full dataset:

assignment_id - unique identifier of the media file

worker_id - unique identifier of the person

age - age of the person

true_gender - gender of the person

country - country of the person

ethnicity - ethnicity of the person

photo_1_extension, photo_2_extension, …, photo_15_extension - photo extensions in the dataset

photo_1_resolution, photo_2_resolution, …, photo_15_resolution - photo resolution in the dataset

OTHER BIOMETRIC DATASETS:

Anti Spoofing Real Dataset

Antispoofing Replay Dataset

Selfies, ID Images dataset (5591 sets of 15 files)

Selfies and video dataset (4 052 sets)

Dataset of bald people, 5000 images

🧩 This is just an example of the data. Leave a request here to learn more

Dataset structure

images - contains of original images of people

labels - includes visualized labeling for the original images

annotations.xml - contains coordinates of the bbox, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the polygons and labels . For each point, the x and y coordinates are provided.

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F19e61b2d0780e9db80afe4a0ce879c4b%2Fcarbon.png?generation=1699440100527867&alt=media" alt="">

🚀 You can learn more about our high-quality unique datasets here

keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, object detection dataset, deep learning datasets, computer vision datset, human images dataset, human faces dataset
FSDKaggle2018
data.niaid.nih.gov
opendatalab.com
+1more
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Xavier Serra (2020). FSDKaggle2018 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2552859
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Music Technology Grouphttps://www.upf.edu/web/mtg
Google, Inc., New York, NY, USA
Authors
Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Xavier Serra
Description
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

Citation

If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

About this dataset

Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

Some other relevant characteristics of FSDKaggle2018:

The dataset is split into a train set and a test set.

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

Data labeling process

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

More details about the data labeling process can be found in [3].

License

FSDKaggle2018 has licenses at two different levels, as explained next.

All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

Files

FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET License of FSDKaggle2018 dataset as a whole

NOTE: the original train.csv file provided during the competition has been updated with more metadata (licenses, Freesound ids, etc.) into train_post_competition.csv. Likewise, the original test.csv that was not public during the competition is now available with ground truth and metadata as test_post_competition_scoring_clips.csv. The file name test_post_competition_scoring_clips.csv refers to the fact that only the 1600 clips used for systems' ranking are included. During the competition, an additional subset of padding clips was added in order to prevent undesired practices. This padding subset (that was never used for systems' ranking) is no longer included in the dataset (see our DCASE 2018 paper for more details.)

Each row (i.e. audio clip) of the train_post_competition.csv file contains the following information:

fname: the file name

label: the audio classification label (ground truth)

manually_verified: Boolean (1 or 0) flag to indicate whether or not that annotation has been manually verified; see description above for more info

freesound_id: the Freesound id for the audio clip

license: the license for the audio clip

Each row (i.e. audio clip) of the test_post_competition_scoring_clips.csv file contains the following information:

fname: the file name

label: the audio classification label (ground truth)

usage: string that indicates to which Kaggle leaderboard the clip was associated during the competition: Public or Private

freesound_id: the Freesound id for the audio clip

license: the license for the audio clip

Baseline System

A CNN baseline system for FSDKaggle2018 is available at
G
Labeling Data Governance for Warehouses Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Labeling Data Governance for Warehouses Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/labeling-data-governance-for-warehouses-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Oct 6, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Labeling Data Governance for Warehouses Market Outlook

According to our latest research, the global labeling data governance for warehouses market size reached USD 1.78 billion in 2024, with a robust year-on-year growth trajectory. The market is forecasted to expand at a CAGR of 17.2% from 2025 to 2033, propelling the market size to approximately USD 7.21 billion by 2033. This growth is primarily driven by the increasing demand for accurate, compliant, and efficient data labeling solutions within warehouse operations, as enterprises across sectors recognize the value of data governance in optimizing inventory, quality, and supply chains. As per our latest research, the market is witnessing a rapid shift towards digital transformation, particularly in the context of Industry 4.0, which is further accelerating the adoption of advanced data governance tools and strategies in warehouse environments.

One of the primary growth factors in the labeling data governance for warehouses market is the rising complexity of warehouse operations, fueled by the proliferation of e-commerce, omnichannel retailing, and globalized supply chains. As warehouses become central hubs for distribution, inventory management, and order fulfillment, the need for precise and standardized labeling processes has never been more critical. Data governance ensures that all labeling activities are consistent, traceable, and compliant with both internal policies and external regulations. This is especially important for sectors such as food & beverage, healthcare, and logistics, where labeling accuracy directly impacts product safety, traceability, and customer satisfaction. The integration of advanced analytics, automation, and IoT devices within warehouses further amplifies the volume and complexity of data, necessitating robust governance frameworks to maintain data integrity and operational efficiency.

Another significant driver is the tightening regulatory landscape around data management and product labeling. Governments and industry bodies worldwide are imposing stricter standards for labeling accuracy, traceability, and data privacy, particularly in highly regulated industries. For example, the healthcare and food & beverage sectors must comply with regulations such as the FDA’s Unique Device Identification (UDI) and the EU’s Food Information to Consumers (FIC) Regulation. These mandates require warehouses to implement comprehensive data governance solutions capable of supporting end-to-end label management, audit trails, and real-time compliance reporting. As a result, organizations are increasingly investing in sophisticated software and services that can automate compliance tasks, reduce human error, and provide actionable insights into labeling processes. This regulatory pressure is expected to sustain high demand for data governance solutions in the warehouse sector throughout the forecast period.

The surge in digital transformation initiatives across industries is also playing a pivotal role in shaping the labeling data governance for warehouses market. Enterprises are leveraging cloud computing, artificial intelligence, and machine learning to optimize warehouse operations and drive business agility. These technologies enable real-time data capture, analysis, and decision-making, which are essential for effective data governance. Cloud-based solutions, in particular, offer scalability, flexibility, and ease of integration with existing warehouse management systems, making them attractive to organizations of all sizes. Furthermore, the growing emphasis on sustainability and supply chain transparency is prompting companies to adopt data governance practices that enhance visibility, accountability, and reporting capabilities. As digital transformation continues to gain momentum, the demand for integrated, intelligent, and automated data governance solutions in warehouses is expected to rise exponentially.

Regionally, North America remains the dominant market for labeling data governance in warehouses, accounting for more than 35% of the global market share in 2024. This leadership is attributed to the region’s advanced logistics infrastructure, high adoption rate of digital technologies, and stringent regulatory environment. Europe follows closely, driven by strong compliance requirements and a mature manufacturing sector. Meanwhile, the Asia Pacific region is emerging as the fastest-growing market, supported by rapid industrialization, expanding e-commerce, and increasing investment
Self-Annotated Wearable Activity Data
zenodo.org
data-staging.niaid.nih.gov
+1more
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Hölzemann; Alexander Hölzemann; Kristof Van Laerhoven; Kristof Van Laerhoven (2024). Self-Annotated Wearable Activity Data [Dataset]. http://doi.org/10.3389/fcomp.2024.1379788
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2024.1379788
Dataset updated
Sep 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander Hölzemann; Alexander Hölzemann; Kristof Van Laerhoven; Kristof Van Laerhoven
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Our dataset contains 2 weeks of approx. 8-9 hours of acceleration data per day from 11 participants wearing a Bangle.js Version 1 smartwatch with our firmware installed.

The dataset contains annotations from 4 different commonly used annotation methods utilized in user studies that focus on in-the-wild data. These methods can be grouped in user-driven, in situ annotations - which are performed before or during the activity is recorded - and recall methods - where participants annotate their data in hindsight at the end of the day.

The participants had the task to label their activities using (1) a button located on the smartwatch, (2) the activity tracking app Strava, (3) a (hand)written diary and (4) a tool to visually inspect and label activity data, called MAD-GUI. Methods (1)-(3) are used in both weeks, however method (4) is introduced in the beginning of the second study week.

The accelerometer data is recorded with 25 Hz, a sensitivity of ±8g and is stored in a csv format. Labels and raw data are not yet combined. You can either write your own script to label the data or follow the instructions in our corresponding Github repository.

The following unique classes are included in our dataset:

laying, sitting, walking, running, cycling, bus_driving, car_driving, vacuum_cleaning, laundry, cooking, eating, shopping, showering, yoga, sport, playing_games, desk_work, guitar_playing, gardening, table_tennis, badminton, horse_riding.

However, many activities are very participant specific and therefore only performed by one of the participants.

The labels are also stored as a .csv file and have the following columns:

week_day, start, stop, activity, layer

Example:

week2_day2,10:30:00,11:00:00,vacuum_cleaning,d

The layer columns specifies which annotation method was used to set this label.

The following identifiers can be found in the column:

b: in situ button

a: in situ app

d: self-recall diary

g: time-series recall labelled with a the MAD-GUI

The corresponding publication is currently under review.
DBpedia Ontology
kaggle.com
zip
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). DBpedia Ontology [Dataset]. https://www.kaggle.com/datasets/thedevastator/dbpedia-ontology-dataset/code
Explore at:
zip(69520449 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DBpedia Ontology

Text Classification Dataset with 14 Classes

By dbpedia_14 (From Huggingface) [source]

About this dataset

The DBpedia Ontology Classification Dataset, known as dbpedia_14, is a comprehensive and meticulously constructed dataset containing a vast collection of text samples. These samples have been expertly classified into 14 distinct and non-overlapping classes. The dataset draws its information from the highly reliable and up-to-date DBpedia 2014 knowledge base, ensuring the accuracy and relevance of the data.

Each text sample in this extensive dataset consists of various components that provide valuable insights into its content. These components include a title, which succinctly summarizes the main topic or subject matter of the text sample, and content that comprehensively covers all relevant information related to a specific topic.

To facilitate effective training of machine learning models for text classification tasks, each text sample is further associated with a corresponding label. This categorical label serves as an essential element for supervised learning algorithms to classify new instances accurately.

Furthermore, this exceptional dataset is part of the larger DBpedia Ontology Classification Dataset with 14 Classes (dbpedia_14). It offers numerous possibilities for researchers, practitioners, and enthusiasts alike to conduct in-depth analyses ranging from sentiment analysis to topic modeling.

Aspiring data scientists will find great value in utilizing this well-organized dataset for training their machine learning models. Although specific details about train.csv and test.csv files are not provided here due to their dynamic nature, they play pivotal roles during model training and testing processes by respectively providing labeled training samples and unseen test samples.

Lastly, it's worth mentioning that users can refer to the included classes.txt file within this dataset for an exhaustive list of all 14 classes used in classifying these diverse text samples accurately.

Overall, with its wealth of carefully curated textual data across multiple domains and precise class labels assigned based on well-defined categories derived from DBpedia 2014 knowledge base, the DBpedia Ontology Classification Dataset (dbpedia_14) proves instrumental in advancing research efforts related to natural language processing (NLP), text classification, and other related fields

Research Ideas

Text classification: The DBpedia Ontology Classification Dataset can be used to train machine learning models for text classification tasks. With 14 different classes, the dataset is suitable for various classification tasks such as sentiment analysis, topic classification, or intent detection.

Ontology development: The dataset can also be used to improve or expand existing ontologies. By analyzing the text samples and their assigned labels, researchers can identify missing or incorrect relationships between concepts in the ontology and make improvements accordingly.

Semantic search engine: The DBpedia knowledge base is widely used in semantic search engines that aim to provide more accurate and relevant search results by understanding the meaning of user queries and matching them with structured data. This dataset can help in training models for improving the performance of these semantic search engines by enhancing their ability to classify and categorize information accurately based on user queries

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------------------------------------| | label | The class label assigned to each text sample. (Categorical) | | title | The heading or name given to each text sample, providing some context or overview of its content. (Text) |

File: test.csv | Column name | Description | |:--------------|:-----------------------...
E
Data from: Example computer vision classification training data derived from...
live.european-language-grid.eu
jpeg
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Example computer vision classification training data derived from British Library 19th Century Books Image collection [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7572
Explore at:
jpegAvailable download formats
Dataset updated
May 16, 2024
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Example computer vision classification training data derived from British Library 19th Century Books Image collection
This dataset provides training data for image classification for use in a computer vision workshop. The images are derived from 'Digitised Books - Images identified as Embellishments. c. 1510 - c. 1900. JPG' from the year '1839'.
Currently, included are four folders containing a variety of images derived from the BL books corpus.
'cv_workshop_exercise_data' include images of: 'building', 'people', 'coat of arms''humancats' contains images of humans and images of catsThe 'fashion' and 'portraits' folders both contain images of people organised into 'female' and 'male'. These labels were annotated by a single annotator and these categories may themselves not be meaningful. They are included in the workshop data as a point of discussion about how we should label data both in general and when working with historical data.
This data is intended primarily as an educational resource.
Z
EmoLit
data.niaid.nih.gov
data.europa.eu
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rei, Luis (2023). EmoLit [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7883953
Explore at:
Dataset updated
Jun 27, 2023
Dataset provided by
Jozef Stefan Institute
Authors
Rei, Luis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Emotions in Literature

Description Literature sentences from Project Gutenberg. 38 emotion labels (+neutral examples). Semi-Supervised dataset.

Article

Detecting Fine-Grained Emotions in Literature

Please cite:

@Article{app13137502, AUTHOR = {Rei, Luis and Mladenić, Dunja}, TITLE = {Detecting Fine-Grained Emotions in Literature}, JOURNAL = {Applied Sciences}, VOLUME = {13}, YEAR = {2023}, NUMBER = {13}, ARTICLE-NUMBER = {7502}, URL = {https://www.mdpi.com/2076-3417/13/13/7502}, ISSN = {2076-3417}, DOI = {10.3390/app13137502} }

Abstract

Emotion detection in text is a fundamental aspect of affective computing and is closely linked to natural language processing. Its applications span various domains, from interactive chatbots to marketing and customer service. This research specifically focuses on its significance in literature analysis and understanding. To facilitate this, we present a novel approach that involves creating a multi-label fine-grained emotion detection dataset, derived from literary sources. Our methodology employs a simple yet effective semi-supervised technique. We leverage textual entailment classification to perform emotion-specific weak-labeling, selecting examples with the highest and lowest scores from a large corpus. Utilizing these emotion-specific datasets, we train binary pseudo-labeling classifiers for each individual emotion. By applying this process to the selected examples, we construct a multi-label dataset. Using this dataset, we train models and evaluate their performance within a traditional supervised setting. Our model achieves an F1 score of 0.59 on our labeled gold set, showcasing its ability to effectively detect fine-grained emotions. Furthermore, we conduct evaluations of the model's performance in zero- and few-shot transfer scenarios using benchmark datasets. Notably, our results indicate that the knowledge learned from our dataset exhibits transferability across diverse data domains, demonstrating its potential for broader applications beyond emotion detection in literature. Our contribution thus includes a multi-label fine-grained emotion detection dataset built from literature, the semi-supervised approach used to create it, as well as the models trained on it. This work provides a solid foundation for advancing emotion detection techniques and their utilization in various scenarios, especially within the cultural heritage analysis.

Labels

admiration: finds something admirable, impressive or worthy of respect

amusement: finds something funny, entertaining or amusing

anger: is angry, furious, or strongly displeased; displays ire, rage, or wrath

annoyance: is annoyed or irritated

approval: expresses a favorable opinion, approves, endorses or agrees with something or someone

boredom: feels bored, uninterested, monotony, tedium

calmness: is calm, serene, free from agitation or disturbance, experiences emotional tranquility

caring: cares about the well-being of someone else, feels sympathy, compassion, affectionate concern towards someone, displays kindness or generosity

courage: feels courage or the ability to do something that frightens one, displays fearlessness or bravery

curiosity: is interested, curious, or has strong desire to learn something

desire: has a desire or ambition, wants something, wishes for something to happen

despair: feels despair, helpless, powerless, loss or absence of hope, desperation, despondency

disappointment: feels sadness or displeasure caused by the non-fulfillment of hopes or expectations, being or let down, expresses regret due to the unfavorable outcome of a decision

disapproval: expresses an unfavorable opinion, disagrees or disapproves of something or someone

disgust: feels disgust, revulsion, finds something or someone unpleasant, offensive or hateful

doubt: has doubt or is uncertain about something, bewildered, confused, or shows lack of understanding

embarrassment: feels embarrassed, awkward, self-conscious, shame, or humiliation

envy: is covetous, feels envy or jealousy; begrudges or resents someone for their achievements, possessions, or qualities

excitement: feels excitement or great enthusiasm and eagerness

faith: expresses religious faith, has a strong belief in the doctrines of a religion, or trust in god

fear: is afraid or scared due to a threat, danger, or harm

frustration: feels frustrated: upset or annoyed because of inability to change or achieve something

gratitude: is thankful or grateful for something

greed: is greedy, rapacious, avaricious, or has selfish desire to acquire or possess more than what one needs

grief: feels grief or intense sorrow, or grieves for someone who has died

guilt: feels guilt, remorse, or regret to have committed wrong or failed in an obligation

indifference: is uncaring, unsympathetic, uncharitable, or callous, shows indifference, lack of concern, coldness towards someone

joy: is happy, feels joy, great pleasure, elation, satisfaction, contentment, or delight

love: feels love, strong affection, passion, or deep romantic attachment for someone

nervousness: feels nervous, anxious, worried, uneasy, apprehensive, stressed, troubled or tense

nostalgia: feels nostalgia, longing or wistful affection for the past, something lost, or for a period in one's life, feels homesickness, a longing for one's home, city, or country while being away; longing for a familiar place

optimism: feels optimism or hope, is hopeful or confident about the future, that something good may happen, or the success of something - pain: feels physical pain or is experiences physical suffering

pride: is proud, feels pride from one's own achievements, self-fulfillment, or from the achievements of those with whom one is closely associated, or from qualities or possessions that are widely admired

relief: feels relaxed, relief from tension or anxiety

sadness: feels sadness, sorrow, unhappiness, depression, dejection

surprise: is surprised, astonished or shocked by something unexpected

trust: trusts or has confidence in someone, or believes that someone is good, honest, or reliable

Dataset

EmoLit (Zenodo)

Code

EmoLit Train (Github)

Models

LARGE

BASE

DISTILL
African Wildlife
kaggle.com
zip
Updated May 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bianca Ferreira (2020). African Wildlife [Dataset]. https://www.kaggle.com/biancaferreira/african-wildlife
Explore at:
zip(469442673 bytes)Available download formats
Dataset updated
May 25, 2020
Authors
Bianca Ferreira
Area covered
Africa
Description
Context

This data set was collected with the original goal of training an embedded device to perform real-time animal detection in nature reserves in South Africa.

Content

The data was collected using the following steps: 1. Perform a Google search on the image class. 2. Manually download images that are good representations of the class. 3. Manually label the images in the YOLO format Yeah .... there was a lot of manual labor involved, but what can you do!

Four animal classes commonly found in nature reserves in South Africa are represented in this data set: buffalo, elephant, rhino and zebra. See the images below for examples. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5162470%2F06c402cda9457b50f04d22c0db8ce46f%2Fsamples_real.png?generation=1590432504707673&alt=media" alt="">

This data set contains at least 376 images for each animal class collected via Google's image search function and labelled for object detection. Each example in the data set consists of a jpg image and a txt label file. The images have differing aspect ratios and contain at least one example of the specified animal class. Multiple instances of animals can exist in a single image. There could also be occurrences of the other classes in the same image, e.g. a zebra(3) in the file with an elephant(1).

The txt file contains a list of detectable instances on separate lines of the class in the YOLOv3 labeling format. The image and labelEach file contains object labels in the format: test
m
Handwritten Arabic Numerals (0-9) Image Dataset
data.mendeley.com
Updated May 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huzain Azis (2024). Handwritten Arabic Numerals (0-9) Image Dataset [Dataset]. http://doi.org/10.17632/5hpkf8v7bg.1
Explore at:
Unique identifier
https://doi.org/10.17632/5hpkf8v7bg.1
Dataset updated
May 20, 2024
Authors
Huzain Azis
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
This dataset contains images of handwritten Arabic numerals ranging from 0 to 9. It comprises a total of 9350 samples, with 935 images for each numeral class. The images were collected from various individuals to ensure diversity in handwriting styles.

Key Features:

Classes: 10 (Arabic numerals 0-9) Total Samples: 9350 Samples per Class: 935 Image Format: Grayscale Image Size: 28x28 pixels (adjust if different) Data Collection and Labeling:

The dataset was created by collecting handwritten numerals from participants with different handwriting styles. Each image was manually labeled to ensure accurate and consistent annotations. The data collection and labeling process was meticulously carried out by one of the authors. Usage:

This dataset is suitable for training and testing machine learning models for handwritten digit recognition. It can be used in various applications such as optical character recognition (OCR) systems, pattern recognition, and other related fields.

Contributors:

Author 1: Conducted the data collection and labeling process, ensuring accurate and consistent annotations for all samples. Author 2: Handled the data labelling process. Acknowledgments:

We would like to thank all the participants who contributed their handwritten numerals for this dataset.

License:

CC BY NC 3.0 You are free to adapt, copy or redistribute the material, providing you attribute appropriately and do not use the material for commercial purposes.
Number of data samples for each label.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasuhito Sawahata; Kazuteru Komine; Toshiya Morita; Nobuyuki Hiruma (2023). Number of data samples for each label. [Dataset]. http://doi.org/10.1371/journal.pone.0081009.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0081009.t003
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yasuhito Sawahata; Kazuteru Komine; Toshiya Morita; Nobuyuki Hiruma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
(mean ± s.d. across subjects).
Dollar street 10 - 64x64x3
zenodo.org
data.niaid.nih.gov
+1more
bin
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven van der burg; Sven van der burg (2025). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10970014
Dataset updated
May 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sven van der burg; Sven van der burg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

These are the preprocessing steps that were performed:

Only take examples with one imagenet_synonym label

Use only examples with the 10 most frequently occuring labels

Downscale images to 64 x 64 pixels

Split data in train and test

Store as numpy array

This is the label mapping:

Category label
day bed 0
dishrag 1
plate 2
running shoe 3
soap dispenser 4
street sign 5
table lamp 6
tile roof 7
toilet seat 8
washing machine 9

Checkout https://github.com/carpentries-lab/deep-learning-intro/blob/main/instructors/prepare-dollar-street-data.ipynb" target="_blank" rel="noopener">this notebook to see how the subset was created.

The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.
Z
TreeSatAI Benchmark Archive for Deep Learning in Forest Applications
data.niaid.nih.gov
zenodo.org
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schulz, Christian; Ahlswede, Steve; Gava, Christiano; Helber, Patrick; Bischke, Benjamin; Arias, Florencia; Förster, Michael; Hees, Jörn; Demir, Begüm; Kleinschmit, Birgit (2024). TreeSatAI Benchmark Archive for Deep Learning in Forest Applications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6598390
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Technische Universität Berlin, Remote Sensing Image Analysis Group
Technische Universität Berlin, Geoinformation in Environmental Planning Lab
Vision Impulse GmbH
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Smart Data and Knowledge Services
Authors
Schulz, Christian; Ahlswede, Steve; Gava, Christiano; Helber, Patrick; Bischke, Benjamin; Arias, Florencia; Förster, Michael; Hees, Jörn; Demir, Begüm; Kleinschmit, Birgit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context and Aim

Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.

We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.

Description

The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.

The TreeSatAI Benchmark Archive contains:

50,381 image triplets (aerial, Sentinel-1, Sentinel-2)

synchronized time steps and locations

all original spectral bands/polarizations from the sensors

20 species classes (single labels)

12 age classes (single labels)

15 genus classes (multi labels)

60 m and 200 m patches

fixed split for train (90%) and test (10%) data

additional single labels such as English species name, genus, forest stand type, foliage type, land cover

The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.

Version history

v1.0.2 - Minor bug fix multi label JSON file

v1.0.1 - Minor bug fixes in multi label JSON file and description file

v1.0.0 - First release

Citation

Ahlswede, S., Schulz, C., Gava, C., Helber, P., Bischke, B., Förster, M., Arias, F., Hees, J., Demir, B., and Kleinschmit, B.: TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing, Earth Syst. Sci. Data, 15, 681–695, https://doi.org/10.5194/essd-15-681-2023, 2023.

GitHub

Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitLab and GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark) and the Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI) (https://github.com/DFKI/treesatai_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.

Folder structure

We refer to the proposed folder structure in the PDF file.

Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.

Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.

Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.

The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]

The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.

The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).

CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),

Join the archive

Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.

Project description

This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TUB Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).

Project publications

Ahlswede, S., Schulz, C., Gava, C., Helber, P., Bischke, B., Förster, M., Arias, F., Hees, J., Demir, B., and Kleinschmit, B.: TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing, Earth System Science Data, 15, 681–695, https://doi.org/10.5194/essd-15-681-2023, 2023.

Schulz, C., Förster, M., Vulova, S. V., Rocha, A. D., and Kleinschmit, B.: Spectral-temporal traits in Sentinel-1 C-band SAR and Sentinel-2 multispectral remote sensing time series for 61 tree species in Central Europe. Remote Sensing of Environment, 307, 114162, https://doi.org/10.1016/j.rse.2024.114162, 2024.

Conference contributions

Ahlswede, S. Madam, N.T., Schulz, C., Kleinschmit, B., and Demіr, B.: Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, https://doi.org/10.48550/arXiv.2201.07495, 2022.

Schulz, C., Förster, M., Vulova, S., Gränzig, T., and Kleinschmit, B.: Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, https://doi.org/10.1109/IGARSS46834.2022.9884173, 2022.

Schulz, C., Förster, M., Vulova, S., and Kleinschmit, B.: The temporal fingerprints of common European forest types from SAR and optical remote sensing data, AGU Fall Meeting, New Orleans, USA, 2021.

Kleinschmit, B., Förster, M., Schulz, C., Arias, F., Demir, B., Ahlswede, S., Aksoy, A.K., Ha Minh, T., Hees, J., Gava, C., Helber, P., Bischke, B., Habelitz, P., Frick, A., Klinke, R., Gey, S., Seidel, D., Przywarra, S., Zondag, R., and Odermatt B.: Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests, Living Planet Symposium, Bonn, Germany, 2022.

Schulz, C., Förster, M., Vulova, S., Gränzig, T., and Kleinschmit, B.: Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series, ForestSAT, Berlin, Germany, 2022.
f
Ratio of samples with positive labels for each subgroup in the protect class...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de Sousa, Rafael; Pereira, Mayana; Mukherjee, Sumit; Dodhia, Rahul; Kshirsagar, Meghana; Ferres, Juan Lavista (2024). Ratio of samples with positive labels for each subgroup in the protect class in the Adult, COMPAS and COMPAS (fair) datasets. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001399957
Explore at:
Dataset updated
Feb 5, 2024
Authors
de Sousa, Rafael; Pereira, Mayana; Mukherjee, Sumit; Dodhia, Rahul; Kshirsagar, Meghana; Ferres, Juan Lavista
Description
We compare percentages present in the true labels of the real data and the predicted labels. Analogously, we measure the ratio of samples with positive label present in the synthetic generated data and predicted labels for datasets generated using distinct synthesizer techniques. Predictions(R) represents ratio of positive prediction labels of an experiment where model trained on synthetic data was evaluated on real data, and Predictions(S) ratio of positive prediction labels of an experiment where model trained on synthetic data was evaluated on synthetic data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Growth Market Reports (2025). Data Labeling Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-market

Data Labeling Market Research Report 2033

Explore at:

pdf, pptx, csvAvailable download formats

Dataset updated

Sep 1, 2025

Dataset authored and provided by

Growth Market Reports

Time period covered

2024 - 2032

Area covered

Global

Description

Data Labeling Market Outlook

According to our latest research, the global data labeling market size reached USD 3.2 billion in 2024, driven by the explosive growth in artificial intelligence and machine learning applications across industries. The market is poised to expand at a CAGR of 22.8% from 2025 to 2033, and is forecasted to reach USD 25.3 billion by 2033. This robust growth is primarily fueled by the increasing demand for high-quality annotated data to train advanced AI models, the proliferation of automation in business processes, and the rising adoption of data-driven decision-making frameworks in both the public and private sectors.

One of the principal growth drivers for the data labeling market is the accelerating integration of AI and machine learning technologies across various industries, including healthcare, automotive, retail, and BFSI. As organizations strive to leverage AI for enhanced customer experiences, predictive analytics, and operational efficiency, the need for accurately labeled datasets has become paramount. Data labeling ensures that AI algorithms can learn from well-annotated examples, thereby improving model accuracy and reliability. The surge in demand for computer vision applicationsÂ—such as facial recognition, autonomous vehicles, and medical imagingÂ—has particularly heightened the need for image and video data labeling, further propelling market growth.

Another significant factor contributing to the expansion of the data labeling market is the rapid digitization of business processes and the exponential growth in unstructured data. Enterprises are increasingly investing in data annotation tools and platforms to extract actionable insights from large volumes of text, audio, and video data. The proliferation of Internet of Things (IoT) devices and the widespread adoption of cloud computing have further amplified data generation, necessitating scalable and efficient data labeling solutions. Additionally, the rise of semi-automated and automated labeling technologies, powered by AI-assisted tools, is reducing manual effort and accelerating the annotation process, thereby enabling organizations to meet the growing demand for labeled data at scale.

The evolving regulatory landscape and the emphasis on data privacy and security are also playing a crucial role in shaping the data labeling market. As governments worldwide introduce stringent data protection regulations, organizations are turning to specialized data labeling service providers that adhere to compliance standards. This trend is particularly pronounced in sectors such as healthcare and BFSI, where the accuracy and confidentiality of labeled data are critical. Furthermore, the increasing outsourcing of data labeling tasks to specialized vendors in emerging economies is enabling organizations to access skilled labor at lower costs, further fueling market expansion.

From a regional perspective, North America currently dominates the data labeling market, followed by Europe and the Asia Pacific. The presence of major technology companies, robust investments in AI research, and the early adoption of advanced analytics solutions have positioned North America as the market leader. However, the Asia Pacific region is expected to witness the fastest growth during the forecast period, driven by the rapid digital transformation in countries like China, India, and Japan. The growing focus on AI innovation, government initiatives to promote digitalization, and the availability of a large pool of skilled annotators are key factors contributing to the regionÂ’s impressive growth trajectory.

In the realm of security, Video Dataset Labeling for Security has emerged as a critical application area within the data labeling market. As surveillance systems become more sophisticated, the need for accurately labeled video data is paramount to ensure the effectiveness of security measures. Video dataset labeling involves annotating video frames to identify and track objects, behaviors, and anomalies, which are essential for developing intelligent security systems capable of real-time threat detection and response. This process not only enhances the accuracy of security algorithms but also aids in the training of AI models that can predict and prevent potential security breaches. The growing emphasis on public safety and

Clear search

Close search

Google apps

Main menu

Category	label
day bed	0
dishrag	1
plate	2
running shoe	3
soap dispenser	4
street sign	5
table lamp	6
tile roof	7
toilet seat	8
washing machine	9

Data Labeling Market Research Report 2033

Data Labeling Market Outlook

_labels1.csv. This data set representss the label of the corresponding...

Resume_Dataset

Tech Resume Dataset (3,500+ Samples):

Dataset Composition and Sourcing

Key Dataset Fields (JSONL Schema)

Technical Validation & Quality Assurance

Role & Source Coverage Visualizations

Typical Use Cases

How to Use the JSONL File

Citing and Sharing

Data from: X-ray CT data with semantic annotations for the paper "A workflow...

UCI and OpenML Data Sets for Ordinal Quantification

Replication Data for: Active Learning Approaches for Labeling Text: Review...

Machine Learning Basics for Beginners🤖🧠

Face Detection - Face Recognition Dataset

Face Detection - Object Detection & Face Recognition Dataset

The dataset is created on the basis of Selfies and ID Dataset

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

Metadata for the full dataset:

OTHER BIOMETRIC DATASETS:

🧩 This is just an example of the data. Leave a request here to learn more

Dataset structure

Data Format

Example of XML file structure

FSDKaggle2018

Labeling Data Governance for Warehouses Market Research Report 2033

Labeling Data Governance for Warehouses Market Outlook

Self-Annotated Wearable Activity Data

DBpedia Ontology

DBpedia Ontology

Text Classification Dataset with 14 Classes

About this dataset

Research Ideas

Acknowledgements

License

Columns

Data from: Example computer vision classification training data derived from...

EmoLit

Description Literature sentences from Project Gutenberg. 38 emotion labels (+neutral examples). Semi-Supervised dataset.

Article

Abstract

Labels

Dataset

Code

Models

African Wildlife

Context

Content

Handwritten Arabic Numerals (0-9) Image Dataset

Number of data samples for each label.

Dollar street 10 - 64x64x3

TreeSatAI Benchmark Archive for Deep Learning in Forest Applications

Ratio of samples with positive labels for each subgroup in the protect class...

Data Labeling Market Research Report 2033

Data Labeling Market Outlook