Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.
To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.
We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.
Examples of Annotated Headlines
Forex Pair
Headline
Sentiment
Explanation
GBPUSD
Diminishing bets for a move to 12400
Neutral
Lack of strong sentiment in either direction
GBPUSD
No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft
Positive
Positive sentiment towards GBPUSD (Cable) in the near term
GBPUSD
When are the UK jobs and how could they affect GBPUSD
Neutral
Poses a question and does not express a clear sentiment
JPYUSD
Appropriate to continue monetary easing to achieve 2% inflation target with wage growth
Positive
Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply
USDJPY
Dollar rebounds despite US data. Yen gains amid lower yields
Neutral
Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other
USDJPY
USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains
Negative
USDJPY is expected to reach a lower value, with the USD losing value against the JPY
AUDUSD
<p>RBA Governor Lowe’s Testimony High inflation is damaging and corrosive </p>
Positive
Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD.
Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "applescript-lines-annotated"
Description
This is a dataset of single lines of AppleScript code scraped from GitHub and GitHub Gist and manually annotated with descriptions, intents, prompts, and other metadata.
Content
Each row contains 8 features:
text - The raw text of the AppleScript code. source - The name of the file from which the line originates. type - Either compiled (files using the .scpt extension) or uncompiled (everything else).… See the full description on the dataset page: https://huggingface.co/datasets/HelloImSteven/applescript-lines-annotated.
eturok/zerobench-annotated dataset hosted on Hugging Face and contributed by the HF Datasets community
http://researchdatafinder.qut.edu.au/display/n47576http://researchdatafinder.qut.edu.au/display/n47576
md5sum: 116aade568ccfeaefcdd07b5110b815a QUT Research Data Respository Dataset Resource available for download
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.
The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.
The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation.
List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene
The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, five empty columns (reserved for e.g. dependency parsing, which is absent from this version), and the final MISC column containing the following: the token's whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression).
Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.
For more information, please refer to 00README.txt.
Differences to version 1.0: - Several minor errors were fixed (e.g. a typo in one of the Slovene sense IDs). - The corpus was converted to the true CoNLL-U format (as opposed to the CoNLL-U-like format used in v1.0). - An error was fixed that resulted in missing UPOS tags in version 1.0. - The sentences in all corpora now follow the same order (from 1 to 2024).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SunspotsYoloDataset is a set of 1690+380+128 high-resolution RGB astronomical images captured with smart telescopes with specific solar filters and annotated with the positions of sunspots that are effectively in the images. Two instruments were used for several months from Luxembourg and France between January 2023 and May 2024: a Stellina smart telescope (https://vaonis.com/stellina) and a Vespera smart telescope (https://vaonis.com/vespera).
SunspotsYoloDataset can be used to train YOLO detection models on solar images, enabling the prediction of unexpected events such as Borealis Aurora with astronomical equipment accessible to the public.
SunspotsYoloDataset is formatted with the YOLO standard, i.e., with separated files for images and annotations, usable by state-of-the-art training tools and graphical software like MakeSense (https://www.makesense.ai). More precisely, there is a ZIP file containing RGB images in JPEG format (minimal compression), and text files containing the positions of sunspots. Each RGB image has a resolution of 640 × 640 pixels.
For more details about the dataset, please contact the author: olivier.parisot@list.lu .
For more information about Luxembourg of Science and Technology (LIST), please consult: https://www.list.lu .
Directory content: This directory contains 21 .CSV files reporting the data about 23 specific drivers’ mannerisms and behaviors (e.g., rubbing/holding face, yawning) observed during the driving session (ID_Number_LabelData).Method and instruments: To collect the annotation data, two trained and independent raters employed a customized video analysis tool (HADRIAN’s EYE software; Di Stasi et al., 2023) to identify and annotate drivers’ fatigue- and sleepiness-related mannerisms and behaviors. The tool allows the synchronized reproduction of the videos obtained through the RGB camera (for further details, see RGB and Depth videos directory) and the recording of the main central screen of the simulator (for further details, see Driving simulator indices directory). Both videos were automatically divided by the HADRIAN’s EYE software into a series of 5-min 39 chunks that were then shuffled and presented to the raters in a randomized order to minimize bias (e.g., overestimating the level of fatigue towards the end of the experimental session). Then, a customized Matlab code (Mathworks Inc., Natick, MA, USA) was used to detect discrepancies in the outputs of the two raters. Two types of discrepancies were detected: (i) type, and (ii) timing of the detected mannerism/behavior. Finally, in case of discrepancies, a third independent rater reviewed the videos to solve the issue.
Nayana-cognitivelab/Nayana-DocVQA-22-langs-annotated dataset hosted on Hugging Face and contributed by the HF Datasets community
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
This dataset contains image annotations derived from the NCI Clinical Trial "ACRIN-HNSCC-FDG-PET-CT (ACRIN 6685)”. This dataset was generated as part of an NCI project to augment TCIA datasets with annotations that will improve their value for cancer researchers and AI developers.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The AI data annotation service market is experiencing robust growth, driven by the increasing demand for high-quality training data to fuel the advancement of artificial intelligence applications. The market, estimated at $2 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This expansion is fueled by several key factors. The rapid adoption of AI across diverse sectors, including medical imaging analysis, autonomous driving systems, and sophisticated content moderation tools, is a major driver. Furthermore, the rising complexity of AI models necessitates larger, more accurately annotated datasets, contributing to market growth. The market is segmented by application (medical, education, autonomous driving, content moderation, others) and type of service (image, text, video data annotation, others). The medical and autonomous driving segments are currently leading the market due to their high data requirements and the critical role of accuracy in these fields. However, the education and content moderation sectors show significant growth potential as AI adoption expands in these areas. While the market presents significant opportunities, certain challenges exist. The high cost of data annotation, the need for specialized expertise, and the potential for human error in the annotation process act as restraints. However, technological advancements in automation and the emergence of more efficient annotation tools are gradually mitigating these challenges. The competitive landscape is characterized by a mix of established players and emerging startups, with companies like Appen, iMerit, and Scale AI occupying significant market share. Geographic concentration is currently skewed towards North America and Europe, but emerging economies in Asia and elsewhere are expected to witness rapid growth in the coming years as AI adoption expands globally. The continuous improvement in AI algorithms and increasing availability of affordable annotation tools further contribute to the dynamic nature of this ever-evolving market.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ODDS Smart Building Depth Dataset
The goal of this dataset is to facilitate research focusing on recognizing objects in smart buildings using the depth sensor mounted at the ceiling. This dataset contains annotations of depth images for eight frequently seen object classes. The classes are: person, backpack, laptop, gun, phone, umbrella, cup, and box.
We collected data from two settings. We had Kinect mounted at a 9.3 feet ceiling near to a 6 feet wide door. We also used a tripod with a horizontal extender holding the kinect at a similar height looking downwards. We asked about 20 volunteers to enter and exit a number of times each in different directions (3 times walking straight, 3 times walking towards left side, 3 times walking towards right side) holding objects in many different ways and poses underneath the Kinect. Each subject was using his/her own backpack, purse, laptop, etc. As a result, we considered varieties within the same object, e.g., for laptops, we considered Macbooks, HP laptops, Lenovo laptops of different years and models, and for backpacks, we considered backpacks, side bags, and purse of women. We asked the subjects to walk while holding it in many ways, e.g., for laptop, the laptop was fully open, partially closed, and fully closed while carried. Also, people hold laptops in front and side of their bodies, and underneath their elbow. The subjects carried their backpacks in their back, in their side at different levels from foot to shoulder. We wanted to collect data with real guns. However, bringing real guns to the office is prohibited. So, we obtained a few nerf guns and the subjects were carrying these guns pointing it to front, side, up, and down while walking.
The Annotated dataset is created following the structure of Pascal VOC devkit, so that the data preparation becomes simple and it can be used quickly with different with object detection libraries that are friendly to Pascal VOC style annotations (e.g. Faster-RCNN, YOLO, SSD). The annotated data consists of a set of images; each image has an annotation file giving a bounding box and object class label for each object in one of the eight classes present in the image. Multiple objects from multiple classes may be present in the same image. The dataset has 3 main directories:
1)DepthImages: Contains all the images of training set and validation set.
2)Annotations: Contains one xml file per image file, (e.g., 1.xml for image file 1.png). The xml file includes the bounding box annotations for all objects in the corresponding image.
3)ImagesSets: Contains two text files training_samples.txt and testing_samples.txt. The training_samples.txt file has the name of images used in training and the testing_samples.txt has the name of images used for testing. (We randomly choose 80%, 20% split)
The un-annotated data consists of several set of depth images. No ground-truth annotation is available for these images yet. These un-annotated sets contain several challenging scenarios and no data has been collected from this office during annotated dataset construction. Hence, it will provide a way to test generalization performance of the algorithm.
If you use ODDS Smart Building dataset in your work, please cite the following reference in any publications: @inproceedings{mithun2018odds, title={ODDS: Real-Time Object Detection using Depth Sensors on Embedded GPUs}, author={Niluthpol Chowdhury Mithun and Sirajum Munir and Karen Guo and Charles Shelton}, booktitle={ ACM/IEEE Conference on Information Processing in Sensor Networks (IPSN)}, year={2018}, }
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Annotated T2-weighted MR images of the Lower Spine
Chengwen Chu, Daniel Belavy, Gabriele Armbrecht, Martin Bansmann, Dieter Felsenberg, and Guoyan Zheng
Introduction
The Institute for Surgical Technology and Biomechanics, University of Bern, Switzerland, Charité - University Medicine Berlin, Centre of Muscle and Bone Research, Free University & Humboldt-University Berlin, Germany, Centre for Physical Activity and Nutrition Research, School of Exercise and Nutrition Sciences, Deakin University Burwood Campus, Australia and Institut für Diagnostische und Interventionelle Radiologie, Krankenhaus Porz Am Rhein gGmbH, Köln, Germany, are making this dataset available as a resource in the development of algorithms and tools for spinal image analysis.
Description
The database consists of T2-weighted turbo spin echo MR spine images of 23 anonymized patients, each containing at least 7 vertebral bodies (VBs) of the lower spine (T11 – L5). For each vertebral body, reference manual segmentation is provided in the form of a binary mask. All images and binary masks are stored in the Neuroimaging Informatics Technology Initiative (NIFTI) file format, see details at http://nifti.nimh.nih.gov/. Image files are stored as "Img_xx.nii" while the associated annotation files are stored as "Img_xx_Labels.nii", where "xx" is the internal case number for the patient.
Image annotations were prepared by Mr. Chengwen Chu (no professional training in radiology).
Acknowledgements
Reference
C. Chu, D. Belavy, W. Yu, G. Armbrecht, M. Bansmann, D. Felsenberg, and G. Zheng, “Fully Automatic Localization and Segmentation of 3D Vertebral Bodies from CT/MR Images via A Learning-based Method”, PLoS One. 2015 Nov 23;10(11):e0143327. doi: 10.1371/journal.pone.0143327. eCollection 2015.
This dataset features over 340,000 high-quality images of jewelry sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly detailed and carefully annotated collection of jewelry imagery across styles, materials, and contexts.
Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, including jewelry type, material, and context—ideal for tasks like object detection, style classification, and fine-grained visual analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.
Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on jewelry photography ensure high-quality, well-lit, and visually appealing submissions. Custom datasets can be sourced on-demand within 72 hours to meet specific requirements such as jewelry category (rings, necklaces, bracelets, etc.), material type, or presentation style (worn vs. product shots).
Global Diversity: photographs have been submitted by contributors in over 100 countries, offering an extensive range of cultural styles, design traditions, and jewelry aesthetics. The dataset includes handcrafted and luxury items, traditional and contemporary pieces, and representations across diverse ethnic and regional fashions.
High-Quality Imagery: the dataset includes high-resolution images suitable for detailed product analysis. Both studio-lit commercial shots and lifestyle/editorial photography are included, allowing models to learn from various presentation styles and settings.
Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This metric offers insight into aesthetic appeal and global consumer preferences, aiding AI models focused on trend analysis or user engagement.
AI-Ready Design: this dataset is optimized for training AI in jewelry classification, attribute tagging, visual search, and recommendation systems. It integrates easily into retail AI workflows and supports model development for e-commerce and fashion platforms.
Licensing & Compliance: the dataset complies fully with data privacy and IP standards, offering transparent licensing for commercial and academic purposes.
Use Cases: 1. Training AI for visual search and recommendation engines in jewelry e-commerce. 2. Enhancing product recognition, classification, and tagging systems. 3. Powering AR/VR applications for virtual try-ons and 3D visualization. 4. Supporting fashion analytics, trend forecasting, and cultural design research.
This dataset offers a diverse, high-quality resource for training AI and ML models in the jewelry and fashion space. Customizations are available to meet specific product or market needs. Contact us to learn more!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains 285 hour-long soundscape recordings which have been annotated by expert ornithologists who provided 50,760 bounding box labels for 81 different bird species from the Northeastern USA. The data were recorded in 2017 in the Sapsucker Woods bird sanctuary in Ithaca, NY, USA. This collection has (partially) been used as test data in the 2019, 2020 and 2021 BirdCLEF competition and can primarily be used for training and evaluation of machine learning algorithms.
Data collection
As part of the Sapsucker Woods Acoustic Monitoring Project (SWAMP), the K. Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology deployed 30 first-generation SWIFT recorders in the surrounding bird sanctuary area in Ithaca, NY, USA. The sensitivity of the used microphones was -44 (+/-3) dB re 1 V/Pa. The microphone's frequency response was not measured, but is assumed to be flat (+/- 2 dB) in the frequency range 100 Hz to 7.5 kHz. The analog signal was amplified by 33 dB and digitized (16-bit resolution) using an analog-to-digital converter (ADC) with a clipping level of -/+ 0.9 V. This ongoing study aims to investigate the vocal activity patterns and seasonally changing diversity of local bird species. The data are also used to assess the impact of noise pollution on the behavior of birds. Recordings were recorded 24 h/day in 1-hour uncompressed WAVE files at 48 kHz, converted to FLAC and resampled to 32 kHz for this collection. Parts of this dataset have previously been used in the 2019, 2020 and 2021 BirdCLEF competition.
Sampling and annotation protocol
We subsampled data for this collection by randomly selecting one 1-hour file from one of the 30 different recording units for each hour of one day per week between Feb and Aug 2017. For this collection, we excluded recordings that were shorter than one hour or did not contain a bird vocalization. Annotators were asked to box every bird call they could recognize, ignoring those that are too faint or unidentifiable. Raven Pro software was used to annotate the data. Provided labels contain full bird calls that are boxed in time and frequency. Annotators were allowed to combine multiple consecutive calls of one species into one bounding box label if pauses between calls were shorter than five seconds. We use eBird species codes as labels, following the 2021 eBird taxonomy (Clements list).
Files in this collection
Audio recordings can be accessed by downloading and extracting the “soundscape_data.zip” file. Soundscape recording filenames contain a sequential file ID, recording date and timestamp in UTC. As an example, the file “SSW_001_20170225_010000Z.flac” has sequential ID 001 and was recorded on Feb 25th 2017 at 01:00:00 UTC. Ground truth annotations are listed in “annotations.csv” where each line specifies the corresponding filename, start and end time in seconds, low and high frequency in Hertz and an eBird species code. These species codes can be assigned to scientific and common name of a species with the “species.csv” file. The approximate recording location with longitude and latitude can be found in the “recording_location.txt” file.
Acknowledgements
Compiling this extensive dataset was a major undertaking, and we are very thankful to the domain experts who helped to collect and manually annotate the data for this collection (individual contributors in alphabetic order): Jessie Barry, Sarah Dzielski, Cullen Hanks, Robert Koch, Jim Lowe, Jay McGowan, Ashik Rahaman, Yu Shiu, Laurel Symes, and Matt Young.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SOTAB V2 for SemTab 2023 includes datasets used to evaluate Column Type Annotation (CTA) and Columns Property Annotation (CPA) systems in the 2023 edition of the SemTab challenge. The datasets for both rounds of the challenge were down-sampled from the full train, test and validation splits of the SOTAB V2 (WDC Schema.org Table Annotation Benchmark version 2) benchmark, so that the datasets of the first round have a smaller vocabulary of 40 and 50 labels for CTA and CPA respectively corresponding to easier/more general domains, and the datasets of the second round include the full vocabulary size of 80 and 105 labels and are therefore considered to be harder to annotate. The columns and the relationships between columns are annotated using the Schema.org and DBpedia vocabulary.
SOTAB V2 for SemTab 2023 contains the splits used in Round 1 and Round 2 of the challenge. Each round includes a training, validation and test split together with the ground truth for the test splits and the vocabulary list. The ground truth of the test sets of both rounds are manually verified.
Files contained in SOTAB V2 for SemTab 2023:
Round1-SOTAB-CPA-DatasetsAndGroundTruth = This file contains the csv files of the first round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.
Round1-SOTAB-CTA-DatasetsAndGroundTruth = This file contains the csv files of the first round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.
Round2-SOTAB-CPA-SCH-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.
Round2-SOTAB-CTA-SCH-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.
Round2-SOTAB-CPA-DBP-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with DBpedia.
Round2-SOTAB-CTA-DBP-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with DBpedia.
All the corresponding tables can be found in the "Tables" zip folders.
Note on License: This data includes data from the following sources. Refer to each source for license details:
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the AI-powered medical imaging annotation market size reached USD 1.85 billion globally in 2024. The market is experiencing robust expansion, driven by technological advancements and the rising adoption of artificial intelligence in healthcare. The market is projected to grow at a CAGR of 27.8% from 2025 to 2033, reaching a forecasted value of USD 15.69 billion by 2033. The primary growth factor fueling this trajectory is the increasing demand for accurate, scalable, and rapid annotation solutions to support AI-driven diagnostics and decision-making in clinical settings.
The growth of the AI-powered medical imaging annotation market is propelled by the exponential rise in medical imaging data generated by advanced diagnostic modalities. As healthcare providers continue to digitize patient records and imaging workflows, there is a pressing need for sophisticated annotation tools that can efficiently label vast volumes of images for training and validating AI algorithms. This trend is further amplified by the integration of machine learning and deep learning techniques, which require large, well-annotated datasets to achieve high accuracy in disease detection and classification. Consequently, hospitals, research institutes, and diagnostic centers are increasingly investing in AI-powered annotation platforms to streamline their operations and enhance clinical outcomes.
Another significant driver for the market is the growing prevalence of chronic diseases and the subsequent surge in diagnostic imaging procedures. Conditions such as cancer, cardiovascular diseases, and neurological disorders necessitate frequent imaging for early detection, monitoring, and treatment planning. The complexity and volume of these images make manual annotation labor-intensive and prone to variability. AI-powered annotation solutions address these challenges by automating the labeling process, ensuring consistency, and significantly reducing turnaround times. This not only improves the efficiency of radiologists and clinicians but also accelerates the deployment of AI-based diagnostic tools in routine clinical practice.
The evolution of regulatory frameworks and the increasing emphasis on data quality and patient safety are also shaping the growth of the AI-powered medical imaging annotation market. Regulatory agencies worldwide are encouraging the adoption of AI in healthcare, provided that the underlying data used for algorithm development is accurately annotated and validated. This has led to the emergence of specialized service providers offering compliant annotation solutions tailored to the stringent requirements of medical device approvals and clinical trials. As a result, the market is witnessing heightened collaboration between healthcare providers, technology vendors, and regulatory bodies to establish best practices and standards for medical image annotation.
Regionally, North America continues to dominate the AI-powered medical imaging annotation market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The United States, in particular, benefits from a mature healthcare IT infrastructure, strong research funding, and a high concentration of leading AI technology companies. Meanwhile, Asia Pacific is emerging as a high-growth region, fueled by rapid healthcare digitization, increasing investments in AI research, and expanding patient populations. Latin America and the Middle East & Africa are also witnessing steady growth, albeit at a slower pace, as healthcare systems modernize and adopt advanced imaging technologies.
The component segment of the AI-powered medical imaging annotation market is bifurcated into software and services, both of which play pivotal roles in the overall ecosystem. Software solutions encompass annotation platforms, data management tools, and integration modules that enable seamless image labeling, workflow automation, and interoperability with existing hospital information systems. These platforms leverage advanced algorithms for image segmentation, object detection, and feature extraction, significantly enhancing the speed and accuracy of annotation tasks. The increasing sophistication of annotation software, including support for multi-modality images and customizable labeling protocols, is driving widespread adoption among health
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Folders descriptions: - images -> dataset images. - PixelLabelData -> pixels representation for each image (annotations). - dataset.zip -> a compressed folder of the dataset files and folders.
Files descriptions: - imds.mat -> images datastore in Matlab format. - pxds.mat -> pixels datastore in Matlab format.
Note: files paths in the imds and pxds Matlab files need to be modified to point to the new location of the images on your disk.
Classes: - Classes -> 'door' 'pull door handle' 'push button' 'moveable door handle' 'push door handle' 'fire extinguisher' 'key slot' 'carpet floor' 'background wall'
Video: - IndoorDataset_960_540.mp4 -> a short video that can be used to benchmark the system's speed (FPS) and inference capabilities.
Paper: E. Mohamed, K. Sirlantzis and G. Howells, "A pixel-wise annotated dataset of small overlooked indoor objects for semantic segmentation applications". in Data in Brief, vol.40, pp. 107791, 2022, doi:10.1016/j.dib.2022.107791.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the manuscript titled "Deep Learning-Based Methods for Automated Estimation of Insect Length, Volume, and Biomass". It contains the annotated image data used to develop, train, and evaluate two complementary deep learning methods for automated insect morphometrics.
The dataset is divided into two primary components corresponding to the methods developed:
1. Oriented Bounding Box (OBB) Dataset:
Purpose: Used for training and evaluating an OBB model for rapid insect length estimation.
Content: Contains 815 images of insects from diverse families within the orders Diptera, Hymenoptera, and Coleoptera.
Image Acquisition: Images were captured using a low-cost, high-resolution DIY microscope, the Entomoscope (Wührl et al., 2024). Multiple focal planes were stacked using Helicon Focus (Helicon Soft Ltd, 2025) to achieve optimal clarity. Specimens were preserved in ethanol.
Format: YOLO OBB format label files.
Metadata: Specimen details and image lists are provided in the supporting information file.
2. Segmentation Dataset:
Purpose: Used for training and evaluating a segmentation model for detailed curvilinear length and volume estimation, demonstrated with Tachinidae (Diptera).
Content: Comprises a total of 1,320 images of several representative tachinid species.
Image Acquisition: Same as the OBB dataset (Entomoscope, Helicon Focus, ethanol preservation).
Annotation Strategy & Structure: A two-stage annotation strategy was employed, and the dataset is structured to reflect this:
initial_labeling_strategy_820_images/: Data corresponding to the initial annotation approach focusing only on directly visible body parts
final_refined_dataset_500_images/: This is the dataset used for the final model. The key refinement in this stage was the explicit annotation of inferred outlines for body parts partially obscured by wings or legs, where reliable estimation was possible.
Format: YOLO segmentation format label files.
Metadata: Specimen details and image lists are provided in the Supporting Information file.
General Information:
Diversity: For both datasets, images were captured from various views and orientations to ensure model robustness.
This dataset is provided to ensure the reproducibility of our research and to serve as a resource for the broader community working on automated image-based analysis of insects and other biological specimens.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
No description provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.
To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.
We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.
Examples of Annotated Headlines
Forex Pair
Headline
Sentiment
Explanation
GBPUSD
Diminishing bets for a move to 12400
Neutral
Lack of strong sentiment in either direction
GBPUSD
No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft
Positive
Positive sentiment towards GBPUSD (Cable) in the near term
GBPUSD
When are the UK jobs and how could they affect GBPUSD
Neutral
Poses a question and does not express a clear sentiment
JPYUSD
Appropriate to continue monetary easing to achieve 2% inflation target with wage growth
Positive
Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply
USDJPY
Dollar rebounds despite US data. Yen gains amid lower yields
Neutral
Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other
USDJPY
USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains
Negative
USDJPY is expected to reach a lower value, with the USD losing value against the JPY
AUDUSD
<p>RBA Governor Lowe’s Testimony High inflation is damaging and corrosive </p>
Positive
Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD.
Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.