https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
In 2023, the global market size for data labeling software was valued at approximately USD 1.2 billion and is projected to reach USD 6.5 billion by 2032, with a CAGR of 21% during the forecast period. The primary growth factor driving this market is the increasing adoption of artificial intelligence (AI) and machine learning (ML) technologies across various industry verticals, necessitating high-quality labeled data for model training and validation.
The surge in AI and ML applications is a significant growth driver for the data labeling software market. As businesses increasingly harness these advanced technologies to gain insights, optimize operations, and innovate products and services, the demand for accurately labeled data has skyrocketed. This trend is particularly pronounced in sectors such as healthcare, automotive, and finance, where AI and ML applications are critical for advancements like predictive analytics, autonomous driving, and fraud detection. The growing reliance on AI and ML is propelling the market forward, as labeled data forms the backbone of effective AI model development.
Another crucial growth factor is the proliferation of big data. With the explosion of data generated from various sources, including social media, IoT devices, and enterprise systems, organizations are seeking efficient ways to manage and utilize this vast amount of information. Data labeling software enables companies to systematically organize and annotate large datasets, making them usable for AI and ML applications. The ability to handle diverse data types, including text, images, and audio, further amplifies the demand for these solutions, facilitating more comprehensive data analysis and better decision-making.
The increasing emphasis on data privacy and security is also driving the growth of the data labeling software market. With stringent regulations such as GDPR and CCPA coming into play, companies are under pressure to ensure that their data handling practices comply with legal standards. Data labeling software helps in anonymizing and protecting sensitive information during the labeling process, thus providing a layer of security and compliance. This has become particularly important as data breaches and cyber threats continue to rise, making secure data management a top priority for organizations worldwide.
Regionally, North America holds a significant share of the data labeling software market due to early adoption of AI and ML technologies, substantial investments in tech startups, and advanced IT infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth is driven by the rapid digital transformation in countries like China and India, increasing investments in AI research, and the expansion of IT services. Europe and Latin America also present substantial growth opportunities, supported by technological advancements and increasing regulatory compliance needs.
The data labeling software market can be segmented by component into software and services. The software segment encompasses various platforms and tools designed to label data efficiently. These software solutions offer features such as automation, integration with other AI tools, and scalability, which are critical for handling large datasets. The growing demand for automated data labeling solutions is a significant trend in this segment, driven by the need for faster and more accurate data annotation processes.
In contrast, the services segment includes human-in-the-loop solutions, consulting, and managed services. These services are essential for ensuring the quality and accuracy of labeled data, especially for complex tasks that require human judgment. Companies often turn to service providers for their expertise in specific domains, such as healthcare or automotive, where domain knowledge is crucial for effective data labeling. The services segment is also seeing growth due to the increasing need for customized solutions tailored to specific business requirements.
Moreover, hybrid approaches that combine software and human expertise are gaining traction. These solutions leverage the scalability and speed of automated software while incorporating human oversight for quality assurance. This combination is particularly useful in scenarios where data quality is paramount, such as in medical imaging or autonomous vehicle training. The hybrid model is expected to grow as companies seek to balance efficiency with accuracy in their
Data Labeling And Annotation Tools Market Size 2025-2029
The data labeling and annotation tools market size is forecast to increase by USD 2.69 billion at a CAGR of 28% between 2024 and 2029.
The market is experiencing significant growth, driven by the explosive expansion of generative AI applications. As AI models become increasingly complex, there is a pressing need for specialized platforms to manage and label the vast amounts of data required for training. This trend is further fueled by the emergence of generative AI, which demands unique data pipelines for effective training. However, this market's growth trajectory is not without challenges. Maintaining data quality and managing escalating complexity pose significant obstacles. ML models are being applied across various sectors, from fraud detection and sales forecasting to speech recognition and image recognition.
Ensuring the accuracy and consistency of annotated data is crucial for AI model performance, necessitating robust quality control measures. Moreover, the growing complexity of AI systems requires advanced tools to handle intricate data structures and diverse data types. The market continues to evolve, driven by advancements in machine learning (ML), computer vision, and natural language processing. Companies seeking to capitalize on market opportunities must address these challenges effectively, investing in innovative solutions to streamline data labeling and annotation processes while maintaining high data quality.
What will be the Size of the Data Labeling And Annotation Tools Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample
The market is experiencing significant activity and trends, with a focus on enhancing annotation efficiency, ensuring data privacy, and improving model performance. Annotation task delegation and remote workflows enable teams to collaborate effectively, while version control systems facilitate model deployment pipelines and error rate reduction. Label inter-annotator agreement and quality control checks are crucial for maintaining data consistency and accuracy. Data security and privacy remain paramount, with cloud computing and edge computing solutions offering secure alternatives. Data privacy concerns are addressed through secure data handling practices and access controls. Model retraining strategies and cost optimization techniques are essential for adapting to evolving datasets and budgets. Dataset bias mitigation and accuracy improvement methods are key to producing high-quality annotated data.
Training data preparation involves data preprocessing steps and annotation guidelines creation, while human-in-the-loop systems allow for real-time feedback and model fine-tuning. Data validation techniques and team collaboration tools are essential for maintaining data integrity and reducing errors. Scalable annotation processes and annotation project management tools streamline workflows and ensure a consistent output. Model performance evaluation and annotation tool comparison are ongoing efforts to optimize processes and select the best tools for specific use cases. Data security measures and dataset bias mitigation strategies are essential for maintaining trust and reliability in annotated data.
How is this Data Labeling And Annotation Tools Industry segmented?
The data labeling and annotation tools industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Type
Text
Video
Image
Audio
Technique
Manual labeling
Semi-supervised labeling
Automatic labeling
Deployment
Cloud-based
On-premises
Geography
North America
US
Canada
Mexico
Europe
France
Germany
Italy
Spain
UK
APAC
China
South America
Brazil
Rest of World (ROW)
By Type Insights
The Text segment is estimated to witness significant growth during the forecast period. The data labeling market is witnessing significant growth and advancements, primarily driven by the increasing adoption of generative artificial intelligence and large language models (LLMs). This segment encompasses various annotation techniques, including text annotation, which involves adding structured metadata to unstructured text. Text annotation is crucial for machine learning models to understand and learn from raw data. Core text annotation tasks range from fundamental natural language processing (NLP) techniques, such as Named Entity Recognition (NER), where entities like persons, organizations, and locations are identified and tagged, to complex requirements of modern AI.
Moreover,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our dataset consists of the images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question. The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, MS COCO. All data labeling was performed on the Toloka crowdsourcing platform, https://toloka.ai/.
Our dataset has 45,199 instances split among three subsets: train (38,990 instances), public test (1,705 instances), and private test (4,504 instances). The entire train dataset was available for everyone since the start of the challenge. The public test dataset was available since the evaluation phase of the competition, but without any ground truth labels. After the end of the competition, public and private sets were released.
The datasets will be provided as files in the comma-separated values (CSV) format containing the following columns.
Column
Type
Description
image
string
URL of an image on a public content delivery network
width
integer
image width
height
integer
image height
left
integer
bounding box coordinate: left
top
integer
bounding box coordinate: top
right
integer
bounding box coordinate: right
bottom
integer
bounding box coordinate: bottom
question
string
question in English
This upload also contains a ZIP file with the images from MS COCO.
Overview Medical Image Processing service from Pixta AI & its network provides multimodal high quality labelling & annotation of medical data that are ready to use for optimizing the accuracy of computer vision models. We have strong understanding of medical expertise & terminology to ensure accurate labeling of medical images.
Medical Processing category The datasets consist of various models with annotation
X-ray Detection & Segmentation
CT Detection & Segmentation
MRI Detection & Segmentation
Mammography Detection & Segmentation
Segmentation datasets
Classification datasets
Regression datasets
Use case The dataset could be used for various Healthcare & Medical models:
Medical Image Analysis
Remote Diagnosis
Medical Record Keeping ... Each data set is supported by both AI and expert doctors review process to ensure labelling consistency and accuracy. Contact us for more custom datasets.
About PIXTA PIXTASTOCK is the largest Asian-featured stock platform providing data, contents, tools and services since 2005. PIXTA experiences 15 years of integrating advanced AI technology in managing, curating, processing over 100M visual materials and serving global leading brands for their creative and data demands. Visit us at https://www.pixta.ai/ or contact via our email admin.bi@pixta.co.jp.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The M-pox dataset is from May 1st to Sep 5th, 2022.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore and download labeled image datasets for AI, ML, and computer vision. Find datasets for object detection, image classification, and image segmentation.
Introduction The data set is based on 3,004 images collected by the Pancam instruments mounted on the Opportunity and Spirit rovers from NASA's Mars Exploration Rovers (MER) mission. We used rotation, skewing, and shearing augmentation methods to increase the total collection to 70,864 (see Image Augmentation section for more information). Based on the MER Data Catalog User Survey [1], we identified 25 classes of both scientific (e.g. soil trench, float rocks, etc.) and engineering (e.g. rover deck, Pancam calibration target, etc.) interests (see Classes section for more information). The 3,004 images were labeled on Zooniverse platform, and each image is allowed to be assigned with multiple labels. The images are either 512 x 512 or 1024 x 1024 pixels in size (see Image Sampling section for more information). Classes There is a total of 25 classes for this data set. See the list below for class names, counts, and percentages (the percentages are computed as count divided by 3,004). Note that the total counts don't sum up to 3,004 and the percentages don't sum up to 1.0 because each image may be assigned with more than one class. Class name, count, percentage of dataset Rover Deck, 222, 7.39% Pancam Calibration Target, 14, 0.47% Arm Hardware, 4, 0.13% Other Hardware, 116, 3.86% Rover Tracks, 301, 10.02% Soil Trench, 34, 1.13% RAT Brushed Target, 17, 0.57% RAT Hole, 30, 1.00% Rock Outcrop, 1915, 63.75% Float Rocks, 860, 28.63% Clasts, 1676, 55.79% Rocks (misc), 249, 8.29% Bright Soil, 122, 4.06% Dunes/Ripples, 1000, 33.29% Rock (Linear Features), 943, 31.39% Rock (Round Features), 219, 7.29% Soil, 2891, 96.24% Astronomy, 12, 0.40% Spherules, 868, 28.89% Distant Vista, 903, 30.23% Sky, 954, 31.76% Close-up Rock, 23, 0.77% Nearby Surface, 2006, 66.78% Rover Parts, 301, 10.02% Artifacts, 28, 0.93% Image Sampling Images in the MER rover Pancam archive are of sizes ranging from 64x64 to 1024x1024 pixels. The largest size, 1024x1024, was by far the most common size in the archive. For the deep learning dataset, we elected to sample only 1024x1024 and 512x512 images as the higher resolution would be beneficial to feature extraction. In order to ensure that the data set is representative of the total image archive of 4.3 million images, we elected to sample via "site code". Each Pancam image has a corresponding two-digit alphanumeric "site code" which is used to track location throughout its mission. Since each "site code" corresponds to a different general location, sampling a fixed proportion of images taken from each site ensure that the data set contained some images from each location. In this way, we could ensure that a model performing well on this dataset would generalize well to the unlabeled archive data as a whole. We randomly sampled 20% of the images at each site within the subset of Pancam data fitting all other image criteria, applying a floor function to non-whole number sample sizes, resulting in a dataset of 3,004 images. Train/validation/test sets split The 3,004 images were split into train, validation, and test data sets. The split was done so that roughly 60, 15, and 25 percent of the 3,004 images would end up as train, validation, and test data sets respectively, while ensuing that images from a given site are not split between train/validaiton/test data sets. This resulted in 1,806 train images, 456 validation images, and 742 test images. Augmentation To augment the images in train and validation data sets (note that images in the test data set were not augmented), three augmentation methods were chosen that best represent transformations that could be realistically seen in Pancam images. The three augmentations methods are rotation, skew, and shear. The augmentation methods were applied with random magnitude, followed by a random horizontal flipping, to create 30 augmented images for each image. Since each transformation is followed by a square crop in order to keep input shape consistent, we had to constrict the magnitude limits of each augmentation to avoid cropping out important features at the edges of input images. Thus, rotations were limited to 15 degrees in either direction, the 3-dimensional skew was limited to 45 degrees in any direction, and shearing was limited to 10 degrees in either direction. Note that augmentation was done only on training and validation images. Directory Contents images: contains all 70,864 images train-set-v1.1.0.txt: label file for the training data set val-set-v1.1.0.txt: label file for the validation data set test-set-v1.1.0.txt: label file for the testing data set Images with relatively short file names (e.g., 1p128287181mrd0000p2303l2m1.img.jpg) are original images, and images with long file names (e.g., 1p128287181mrd0000p2303l2m1.img.jpg_04140167-5781-49bd-a913-6d4d0a61dab1.jpg) are augmented images. The label files are formatted as "Image name, Class1, Class2, ..., ClassN". Reference [1] S.B. Cole, J.C. Aubele, B.A. Cohen, S.M. Milkovich, and S.A...
Top Notch Label Co Limited Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the AI-powered medical imaging annotation market size reached USD 1.85 billion globally in 2024. The market is experiencing robust expansion, driven by technological advancements and the rising adoption of artificial intelligence in healthcare. The market is projected to grow at a CAGR of 27.8% from 2025 to 2033, reaching a forecasted value of USD 15.69 billion by 2033. The primary growth factor fueling this trajectory is the increasing demand for accurate, scalable, and rapid annotation solutions to support AI-driven diagnostics and decision-making in clinical settings.
The growth of the AI-powered medical imaging annotation market is propelled by the exponential rise in medical imaging data generated by advanced diagnostic modalities. As healthcare providers continue to digitize patient records and imaging workflows, there is a pressing need for sophisticated annotation tools that can efficiently label vast volumes of images for training and validating AI algorithms. This trend is further amplified by the integration of machine learning and deep learning techniques, which require large, well-annotated datasets to achieve high accuracy in disease detection and classification. Consequently, hospitals, research institutes, and diagnostic centers are increasingly investing in AI-powered annotation platforms to streamline their operations and enhance clinical outcomes.
Another significant driver for the market is the growing prevalence of chronic diseases and the subsequent surge in diagnostic imaging procedures. Conditions such as cancer, cardiovascular diseases, and neurological disorders necessitate frequent imaging for early detection, monitoring, and treatment planning. The complexity and volume of these images make manual annotation labor-intensive and prone to variability. AI-powered annotation solutions address these challenges by automating the labeling process, ensuring consistency, and significantly reducing turnaround times. This not only improves the efficiency of radiologists and clinicians but also accelerates the deployment of AI-based diagnostic tools in routine clinical practice.
The evolution of regulatory frameworks and the increasing emphasis on data quality and patient safety are also shaping the growth of the AI-powered medical imaging annotation market. Regulatory agencies worldwide are encouraging the adoption of AI in healthcare, provided that the underlying data used for algorithm development is accurately annotated and validated. This has led to the emergence of specialized service providers offering compliant annotation solutions tailored to the stringent requirements of medical device approvals and clinical trials. As a result, the market is witnessing heightened collaboration between healthcare providers, technology vendors, and regulatory bodies to establish best practices and standards for medical image annotation.
Regionally, North America continues to dominate the AI-powered medical imaging annotation market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The United States, in particular, benefits from a mature healthcare IT infrastructure, strong research funding, and a high concentration of leading AI technology companies. Meanwhile, Asia Pacific is emerging as a high-growth region, fueled by rapid healthcare digitization, increasing investments in AI research, and expanding patient populations. Latin America and the Middle East & Africa are also witnessing steady growth, albeit at a slower pace, as healthcare systems modernize and adopt advanced imaging technologies.
The component segment of the AI-powered medical imaging annotation market is bifurcated into software and services, both of which play pivotal roles in the overall ecosystem. Software solutions encompass annotation platforms, data management tools, and integration modules that enable seamless image labeling, workflow automation, and interoperability with existing hospital information systems. These platforms leverage advanced algorithms for image segmentation, object detection, and feature extraction, significantly enhancing the speed and accuracy of annotation tasks. The increasing sophistication of annotation software, including support for multi-modality images and customizable labeling protocols, is driving widespread adoption among health
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
According to our latest research, the AI in Human-in-the-Loop AI market size reached USD 4.1 billion in 2024, reflecting robust expansion driven by the rising demand for high-quality, reliable AI systems across industries. The market is poised for significant growth, projected to achieve a value of USD 15.6 billion by 2033, registering a compelling CAGR of 15.8% over the forecast period. The surge in adoption is primarily fueled by the necessity for human intervention in critical AI processes, ensuring accuracy, compliance, and ethical outcomes in machine learning applications, as per the latest research findings.
One of the principal growth factors in the AI in Human-in-the-Loop AI market is the increasing complexity and scale of AI models, which necessitate human oversight to maintain accuracy and fairness. As organizations across sectors deploy AI solutions for mission-critical tasks, the need to mitigate algorithmic bias and ensure compliance with evolving regulatory frameworks has become paramount. Human-in-the-loop (HITL) approaches allow experts to validate, correct, and annotate data, improving both the performance and trustworthiness of AI models. This trend is particularly evident in sectors such as healthcare, autonomous vehicles, and financial services, where the cost of error is high and explainability is crucial.
Another significant driver is the proliferation of data-intensive applications, which require extensive data labeling, annotation, and continuous model training. The rise of generative AI, conversational agents, and computer vision systems has exponentially increased the volume of data that needs to be processed. HITL frameworks enable organizations to leverage human expertise for nuanced tasks such as sentiment analysis, object recognition, and content moderation, which are challenging for fully automated systems. As businesses strive for higher model accuracy and reduced time-to-market, the integration of human feedback loops into AI workflows has emerged as a best practice, further accelerating market growth.
Furthermore, the adoption of AI in Human-in-the-Loop AI solutions is being bolstered by the growing emphasis on ethical AI and responsible innovation. Enterprises are increasingly held accountable for the societal impacts of their AI systems, prompting investments in transparent, auditable, and human-centric AI development processes. The convergence of AI with regulatory requirements such as GDPR, HIPAA, and emerging AI Acts in various regions underscores the necessity for HITL mechanisms. This alignment between business objectives and regulatory compliance is creating a virtuous cycle, driving sustained demand for HITL solutions across diverse industry verticals.
From a regional perspective, North America continues to dominate the AI in Human-in-the-Loop AI market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The United States, in particular, is at the forefront due to its advanced AI research ecosystem, significant investments by tech giants, and a mature regulatory landscape. Europe is witnessing steady growth driven by stringent data protection laws and a strong focus on ethical AI. Meanwhile, Asia Pacific is emerging as a high-growth region, propelled by rapid digitalization, government initiatives, and the expansion of AI-driven industries in countries such as China, Japan, and India. These regional dynamics are expected to shape the competitive landscape and innovation trajectories in the years ahead.
The Component segment of the AI in Human-in-the-Loop AI market is categorized into Software, Hardware, and Services, each playing a crucial role in the ecosystem. Software solutions form the backbone of HITL systems, encompassing data annotation platforms, model management tools, and workflow automation suites. These tools enable seamless collaboration between human experts and AI models, facilitating efficient data labeling, validation, and feedback integration. The demand for advanced software platforms is surging as organizations seek scalable, user-friendly, and secure solutions to manage complex HITL workflows. Innovations in user interface design, integration capabilities, and automation features are further enhancing the value proposition of software offerings in this segment.
Hardware components, while representing a smaller share compared to sof
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A new relative quantification strategy for glycomics, named deuterium oxide (D2O) labeling for global omics relative quantification (DOLGOReQ), has been developed based on the partial metabolic D2O labeling, which induces a subtle change in the isotopic distribution of glycan ions. The relative abundance of unlabeled to D-labeled glycans was extracted from the overlapped isotopic envelope obtained from a mixture containing equal amounts of unlabeled and D-labeled glycans. The glycan quantification accuracy of DOLGOReQ was examined with mixtures of unlabeled and D-labeled HeLa glycans combined in varying ratios according to the number of cells present in the samples. The relative quantification of the glycans mixed in an equimolar ratio revealed that 92.4 and 97.8% of the DOLGOReQ results were within a 1.5- and 2-fold range of the predicted mixing ratio, respectively. Furthermore, the dynamic quantification range of DOLGOReQ was investigated with unlabeled and D-labeled HeLa glycans mixed in different ratios from 20:1 to 1:20. A good correlation (Pearson’s r > 0.90) between the expected and measured quantification ratios over 2 orders of magnitude was observed for 87% of the quantified glycans. DOLGOReQ was also applied in the measurement of quantitative HeLa cell glycan changes that occur under normoxic and hypoxic conditions. Given that metabolic D2O labeling can incorporate D into all types of glycans, DOLGOReQ has the potential as a universal quantification platform for large-scale comparative glycomic experiments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Discover the top import markets for paper label globally, based on data from the IndexBox market intelligence platform. Explore key statistics and market insights.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
Top Label Fzc Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A random sample of 200 machine learning publications, systematically analyzed by a team of labelers, who asked up to 15 questions about how the publication discusses its training data.Supervised machine learning, in which models are automatically derived from labeled training data, is only as good as the quality of that data. This study builds on prior work that investigated to what extent 'best practices' around labeling training data were followed in applied ML publications within a single domain (social media platforms). In this paper, we expand by studying publications that apply supervised ML in a far broader spectrum of disciplines, focusing on human-labeled data. We report to what extent a random sample of ML application papers across disciplines give specific details about whether best practices were followed, while acknowledging that a greater range of application fields necessarily produces greater diversity of labeling and annotation methods. Because much of machine learning research and education only focuses on what is done once a "ground truth" or "gold standard" of training data is available, it is especially relevant to discuss issues around the equally-important aspect of whether such data is reliable in the first place. This determination becomes increasingly complex when applied to a variety of specialized fields, as labeling can range from a task requiring little-to-no background knowledge to one that must be performed by someone with career expertise.
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.
Metadata includes
product IDs
bounding boxes
Basic Statistics:
Scenes: 47,739
Products: 38,111
Scene-Product Pairs: 93,274
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)
Abstract
The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.
For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.
The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)
There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)
The following is a description of the attributes present in this dataset
Post ID: Unique ID of each Instagram post
Post Description: Complete description of each post in the language in which it was originally published
Date: Date of publication in MM/DD/YYYY format
Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API
Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API
Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral
Open Research Questions
This dataset is expected to be helpful for the investigation of the following research questions and even beyond:
How does sentiment toward COVID-19 vary across different languages?
How has public sentiment toward COVID-19 evolved from 2020 to the present?
How do cultural differences affect social media discourse about COVID-19 across various languages?
How has COVID-19 impacted mental health, as reflected in social media posts across different languages?
How effective were public health campaigns in shifting public sentiment in different languages?
What patterns of vaccine hesitancy or support are present in different languages?
How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?
What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?
How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?
What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?
All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Unlock Logo Labels trends 2025: Track sales velocity, growth patterns & top-performing tags through interactive analytics. Discover data-proven opportunities with our dual-axis charts comparing product sales vs. keyword demand acceleration - your ultimate toolkit for winning eCommerce assortment strategies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (and especially deep learning) algorithms need lots of training and validation datasets, which are often unavailable. Creating on-ground datasets is costly and time consuming. Within the European Space Agency funded project ‘Crowds & Machine – Next Level’ (by Blackshore B.V., 52impact B.V. and The Hague Centre for Strategic Studies), we aimed to solve this issue by generating labelled data effectively using an innovative gamified crowdsourced-based method.
The objective of the project ‘Crowds & Machines Next Level’ was to generate labelled data for the training and validation of machine learning algorithms to classify the crop wheat. We make those labelled datasets freely available as open data to organisations that use machine learning for their activities, mainly companies and knowledge institutes. As part of the project we developed example scripts (Jupyter notebooks) that enable organisations to use the crowdsourced generated data smoothly for their own machine learning systems.
BlackShore has developed the online platform Cerberus to enable large scale generation of labelled datasets, which is deployed on twenty locations around the Mediterranean Sea to generate labelled datasets of wheat and other land cover classes (see table). Those different locations encompass a diversity of climate regions, harvest cultures and crop calendars, posing a challenge to the training of machine learning algorithms. Gamers click on hexagons plotted on top of very high resolution satellite imagery (captured during the harvest period in 2021), and by combining 3 different hexagon grids those clicks are converted into triangles. Each triangle has a number of clicks (by different users) per land cover category, which provides a measure of accuracy to the label.
52impact developed example tutorials to use the data to train pixel-based (Random Forest) and segmentation-based (U-Net) machine learning models, using Sentinel-2 imagery (provided in the data folder), which can be forked here: https://bitbucket.org/52impact/crowds-machines.
Overview of locations
ID
location_id
Country
Region
Shape
Harvest period
VHR image date
S-2 pre-harvest
S-2 harvest
S-2 post-harvest
01
portugalAlentejo
Portugal
Alentejo
01_Portugal_Alentejo_SELECTION
10 Jul - 1 Aug
07/07/2021
14/05/2021
13/07/2021
22/08/2022
02
spainAndalusia
Spain
Andalusia
02_Spain_Andalusia_SELECTION
10 Jul - 1 Aug
02/07/2021
16/05/2021
15/07/2021
03/09/2021
03
spainAragon
Spain
Aragon
03_Spain_Aragon_SELECTION
10 Jul - 1 Aug
26/10/2021
20/05/2021
19/07/2021
05/09/2021
04
franceAude
France
Aude
04_France_Aude_SELECTION
1 Jul - 1 Oct
22/09/2021
12/05/2021
10/08/2021
18/11/2021
05
franceCamargue
France
Camargue
05_France_Camargue_SELECTION
1 Jul - 1 Oct
07/10/2021
12/05/2021
10/08/2021
18/11/2021
06
franceProvence
France
Provence
06_France_Provence_SELECTION
1 Jul - 1 Oct
26/10/2021
19/05/2021
17/08/2021
20/11/2021
07_08
italyMarche
Italy
Marche (East and West)
07_08_Italy_Marche_SELECTION
1 Jul - 1 Sept
09/08/2021
26/05/2021
25/07/2021
20/11/2021
09
italySardinia
Italy
Sardinia
09_Italy_Sardinia_SELECTION
1 Jul - 1 Sept
31/08/2021
26/05/2021
22/07/2021
10/10/2021
10
italySicily
Italy
Sicily
10_Italy_Sicily_SELECTION
1 Jul - 1 Sept
19/09/2021
22/05/2021
26/07/2021
10/10/2021
11
italyPugliaNorth
Italy
Puglia (North)
11_Italy_PugliaNorth_SELECTION
1 Jul - 1 Sept
06/10/2021
11/06/2021
31/07/2021
04/10/2021
12
italyPuglia
Italy
Puglia
12_Italy_Puglia_SELECTION
1 Jul - 1 Sept
19/08/2021
03/06/2021
02/08/2021
21/10/2021
13
greeceWest
Greece
West
13_Greece_West_SELECTION
1 Sept - 1 Nov
02/09/2021
27/07/2021
05/10/2021
14/12/2021
14
greeceThessaly
Greece
Thessaly
14_Greece_Thessaly_SELECTION
1 Sept - 1 Nov
14/07/2021
27/07/2021
25/09/2021
19/12/2021
15
greeceMacedoniaCentral
Greece
Macedonia (Central)
15_Greece_MacedoniaCentral_SELECTION
1 Jun - 1 Aug
22/07/2021
13/05/2021
22/07/2021
15/09/2021
16
greeceMacedoniaEast
Greece
Macedonia (East)
16_Greece_MacedoniaEast_SELECTION
1 Jun - 1 Aug
05/08/2021
25/05/2021
29/07/2021
27/10/2021
17
greeceRhodes
Greece
Rhodes
17_Greece_Rhodes_SELECTION
15 May - 1 Jul
09/05/2021
25/03/2021
24/05/2021
22/08/2021
18
cyprusLarnaca
Cyprus
Larnaca
18_Cyprus_Larnaca_SELECTION
15 May - 1 Jul
05/06/2021
19/03/2021
07/06/2021
21/08/2021
19
turkeyCyprus
Cyprus (T)
Farmagusta
19_Turkey_Cyprus_SELECTION
15 May - 1 Jul
05/06/2021
29/03/2021
17/06/2021
26/08/2021
20
egyptBehera
Egypt
Behera
20_Egypt_Behera_SELECTION
1 Apr - 1 Jul
06/03/2021
26/01/2021
07/03/2021
19/08/2021
The following data is provided:
Triangulated_data.zip: contains per region and per category a geopackage (gpkg) file containing triangular polygons with the number of clicks per polygon. The filename of the polygon files depends on the location and category. For example, a file that contains the triangles corresponding to Cattle in Alentejo, Portugal, is called: 01_Portugal_Alentejo_Cattle.gpkg
Data.zip: all data necessary to run the Jupyter notebooks, i.e., location data, cropped Sentinel-2 satellite imagery (for training location IDs 01, 02, 12 and 15, and validation locations near IDs 02 and 15) and also the triangulated polygons.
Models.zip: pre-trained random forest and U-Net models based on the data, which can be generated by the Jupyter notebooks.
Combining satellite imagery with machine learning (SIML) has the potential to address global challenges by remotely estimating socioeconomic and environmental conditions in data-poor regions, yet the resource requirements of SIML limit its accessibility and use. The mission of MOSAIKS is to make SIML more accessible by making the process simpler and easier. Using MOSAIKS, you can make predictions in areas of interest in five steps:
Download MOSAIKS features from this API for the areas where you have labels.
Merge the features spatially with your own ground truth information (called “labels”)
Run a regression of your labels on the MOSAIKS features
Evaluate performance
Make predictions in a new area of interest, downloading additional features as necessary.
We’ve found that MOSAIKS, though simple, works well across diverse prediction tasks (e.g. forest cover, house price, road length). And, it’s fast; MOSAIKS achieves accuracy competitive with deep neural networks at orders of magnitude lower computational cost (Rolf et al., 2021). Additional tutorial materials on how to use MOSAIKS can be found at mosaiks.org.
The native resolution features are organized using a 0.01 x 0.01 degree latitude-longitude global grid, centered at .005 degree intervals. Features have been created from a 2019 Quarter 3 composite image of the earth from Planet Labs .
You will generally receive features in a tabular .csv format. Each row represents a unique grid cell (or administrative unit), with the first two columns representing latitude and longitude coordinates (or the administrative unit code), and subsequent columns representing K features (for now, there are K = 4000 features).
We offer MOSAIKS features for the globe at coarsened resolutions that are easy to download. The advantage of using these files, is that they provide rich information globally and are relatively small in file size. For many users intending to experiment with the platform, these grid files may be a great place to start.
Currently, we offer 1 x 1 degree, 0.25 0.25 degree, and 0.1 x 0.1 degree coarsened grids. These aggregations are available with area weights as well as population weights.
**Proceed to **Coarsened Global Grids
We offer MOSAIKS features that are aggregated to the country (ADM0), state/province (ADM1), and county/municipality (ADM2) levels. A significant amount of administrative data is only available when aggregated up to these political units. For many users using label data for ADM units, these files may be all that is needed.
Just as with the Global Grids, these administrative unit aggregations are available with area weights as well as population weights.
These data are also what is used to produce the results of Sherman et al., (2023) For more information on administrative unit aggregations, see Sherman et al., (2023)
**Proceed to **Administrative Region Aggregations
More advanced users, may want native resolution grid files (0.01 x 0.01 degree resolution). Users can query for these files using directly using Redivis.
More information on these query methods will be added soon. Data download limits may apply.
For questions, contact mosaiksteam@gmail.com.
When referring to the MOSAIKS methodology or when generating MOSAIKS features, please reference “A generalizable and accessible approach to machine learning with global satellite imagery.” Nature Communications (2021)
You can use the following Bibtex:
@article{article, author = {Rolf, Esther and Proctor, Jonathan and Carleton, Tamma and Bolliger, Ian and Shankar, Vaishaal and Ishihara, Miyabi and Recht, Benjamin and Hsiang, Solomon}, year = {2021}, month = {07}, pages = {}, title = {A generalizable and accessible approach to machine learning with global satellite imagery}, volume = {12}, journal = {Nature Communications}, doi = {10.1038/s41467-021-24638-z}}
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
In 2023, the global market size for data labeling software was valued at approximately USD 1.2 billion and is projected to reach USD 6.5 billion by 2032, with a CAGR of 21% during the forecast period. The primary growth factor driving this market is the increasing adoption of artificial intelligence (AI) and machine learning (ML) technologies across various industry verticals, necessitating high-quality labeled data for model training and validation.
The surge in AI and ML applications is a significant growth driver for the data labeling software market. As businesses increasingly harness these advanced technologies to gain insights, optimize operations, and innovate products and services, the demand for accurately labeled data has skyrocketed. This trend is particularly pronounced in sectors such as healthcare, automotive, and finance, where AI and ML applications are critical for advancements like predictive analytics, autonomous driving, and fraud detection. The growing reliance on AI and ML is propelling the market forward, as labeled data forms the backbone of effective AI model development.
Another crucial growth factor is the proliferation of big data. With the explosion of data generated from various sources, including social media, IoT devices, and enterprise systems, organizations are seeking efficient ways to manage and utilize this vast amount of information. Data labeling software enables companies to systematically organize and annotate large datasets, making them usable for AI and ML applications. The ability to handle diverse data types, including text, images, and audio, further amplifies the demand for these solutions, facilitating more comprehensive data analysis and better decision-making.
The increasing emphasis on data privacy and security is also driving the growth of the data labeling software market. With stringent regulations such as GDPR and CCPA coming into play, companies are under pressure to ensure that their data handling practices comply with legal standards. Data labeling software helps in anonymizing and protecting sensitive information during the labeling process, thus providing a layer of security and compliance. This has become particularly important as data breaches and cyber threats continue to rise, making secure data management a top priority for organizations worldwide.
Regionally, North America holds a significant share of the data labeling software market due to early adoption of AI and ML technologies, substantial investments in tech startups, and advanced IT infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth is driven by the rapid digital transformation in countries like China and India, increasing investments in AI research, and the expansion of IT services. Europe and Latin America also present substantial growth opportunities, supported by technological advancements and increasing regulatory compliance needs.
The data labeling software market can be segmented by component into software and services. The software segment encompasses various platforms and tools designed to label data efficiently. These software solutions offer features such as automation, integration with other AI tools, and scalability, which are critical for handling large datasets. The growing demand for automated data labeling solutions is a significant trend in this segment, driven by the need for faster and more accurate data annotation processes.
In contrast, the services segment includes human-in-the-loop solutions, consulting, and managed services. These services are essential for ensuring the quality and accuracy of labeled data, especially for complex tasks that require human judgment. Companies often turn to service providers for their expertise in specific domains, such as healthcare or automotive, where domain knowledge is crucial for effective data labeling. The services segment is also seeing growth due to the increasing need for customized solutions tailored to specific business requirements.
Moreover, hybrid approaches that combine software and human expertise are gaining traction. These solutions leverage the scalability and speed of automated software while incorporating human oversight for quality assurance. This combination is particularly useful in scenarios where data quality is paramount, such as in medical imaging or autonomous vehicle training. The hybrid model is expected to grow as companies seek to balance efficiency with accuracy in their