100+ datasets found

Forex News Annotated Dataset for Sentiment Analysis
zenodo.org
paperswithcode.com
+1more
csv
Updated Nov 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali (2023). Forex News Annotated Dataset for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.7976208
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7976208
Dataset updated
Nov 11, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.

We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.

Examples of Annotated Headlines Forex Pair Headline Sentiment Explanation GBPUSD Diminishing bets for a move to 12400 Neutral Lack of strong sentiment in either direction GBPUSD No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft Positive Positive sentiment towards GBPUSD (Cable) in the near term GBPUSD When are the UK jobs and how could they affect GBPUSD Neutral Poses a question and does not express a clear sentiment JPYUSD Appropriate to continue monetary easing to achieve 2% inflation target with wage growth Positive Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply USDJPY Dollar rebounds despite US data. Yen gains amid lower yields Neutral Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other USDJPY USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains Negative USDJPY is expected to reach a lower value, with the USD losing value against the JPY AUDUSD <p>RBA Governor Lowe’s Testimony High inflation is damaging and corrosive </p> Positive Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD.

Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.
h
applescript-lines-annotated
huggingface.co
Updated May 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
applescript-lines-annotated [Dataset]. https://huggingface.co/datasets/HelloImSteven/applescript-lines-annotated
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 2, 2023
Authors
Stephen Kaplan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "applescript-lines-annotated"

Description

This is a dataset of single lines of AppleScript code scraped from GitHub and GitHub Gist and manually annotated with descriptions, intents, prompts, and other metadata.

Content

Each row contains 8 features:

text - The raw text of the AppleScript code. source - The name of the file from which the line originates. type - Either compiled (files using the .scpt extension) or uncompiled (everything else).… See the full description on the dataset page: https://huggingface.co/datasets/HelloImSteven/applescript-lines-annotated.
h
zerobench-annotated
huggingface.co
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ethan Turok (2025). zerobench-annotated [Dataset]. https://huggingface.co/datasets/eturok/zerobench-annotated
Explore at:
Dataset updated
Apr 13, 2025
Authors
Ethan Turok
Description
eturok/zerobench-annotated dataset hosted on Hugging Face and contributed by the HF Datasets community
q
Annotated Data, part 5
data.researchdatafinder.qut.edu.au
Updated Oct 24, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Annotated Data, part 5 [Dataset]. https://data.researchdatafinder.qut.edu.au/dataset/saivt-buildingmonitoring/resource/827b8eb9-ffc1-4d0a-bb6b-19c3d986ea6b
Explore at:
Dataset updated
Oct 24, 2016
License
http://researchdatafinder.qut.edu.au/display/n47576http://researchdatafinder.qut.edu.au/display/n47576
Description
md5sum: 116aade568ccfeaefcdd07b5110b815a QUT Research Data Respository Dataset Resource available for download
E
Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1
live.european-language-grid.eu
binary format
Updated May 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Parallel sense-annotated corpus ELEXIS-WSD 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22947
Explore at:
binary formatAvailable download formats
Dataset updated
May 21, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.

The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation.

List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene

The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, five empty columns (reserved for e.g. dependency parsing, which is absent from this version), and the final MISC column containing the following: the token's whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression).

Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.

For more information, please refer to 00README.txt.

Differences to version 1.0: - Several minor errors were fixed (e.g. a typo in one of the Slovene sense IDs). - The corpus was converted to the true CoNLL-U format (as opposed to the CoNLL-U-like format used in v1.0). - An error was fixed that resulted in missing UPOS tags in version 1.0. - The sentences in all corpora now follow the same order (from 1 to 2024).
MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane
zenodo.org
Updated May 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane [Dataset]. http://doi.org/10.5281/zenodo.15401479
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15401479
Dataset updated
May 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane

We present a Multiplatform Annotated Dataset for Societal Impact of Hurricane (MASH) that includes 98,662 relevant social media data posts from Reddit, X, TikTok, and YouTube.

In addition, all relevant posts are annotated on three dimensions: Humanitarian Classes, Bias Classes, and Information Integrity Classes in a multi-modal approach that considers both textual and visual content (text, images, and videos), providing a rich labeled dataset for in-depth analysis.

The dataset is also complemented by an Online Analytics Platform (https://hurricane.web.illinois.edu/) that not only allows users to view hurricane-related posts and articles, but also explores high-frequency keywords, user sentiment, and the locations where posts were made.

To our best knowledge, MASH is the first large-scale, multi-platform, multimodal, and multi-dimensionally annotated hurricane dataset. We envision that MASH can contribute to the study of hurricanes' impact on society, such as disaster severity classification, event detections, public sentiment analysis, and bias identification.

Usage Notice

This dataset includes four annotation files:

• reddit_anno_publish.csv

• tiktok_anno_publish.csv

• twitter_anno_publish.csv

• youtube_anno_publish.csv

Each file contains post IDs and corresponding annotations on three dimensions: Humanitarian Classes, Bias Classes, and Information Integrity Classes.

To protect user privacy, only post IDs are released. We recommend retrieving the full post content via the official APIs of each platform, in accordance with their respective terms of service.

- Reddit API (https://www.reddit.com/dev/api)

- TikTok API (https://developers.tiktok.com/products/research-api)

- X/Twitter API (https://developer.x.com/en/docs/x-api)

- YouTube API (https://developers.google.com/youtube/v3)

Humanitarian Classes

Each post is annotated with seven binary humanitarian classes. For each class, the label is either:

• True – the post contains this humanitarian information

• False – the post does not contain this information

These seven humanitarian classes include:

• Casualty: The post reports people or animals who are killed, injured, or missing during the hurricane.

• Evacuation: The post describes the evacuation, relocation, rescue, or displacement of individuals or animals due to the hurricane.

• Damage: The post reports damage to infrastructure or public utilities caused by the hurricane.

• Advice: The post provides advice, guidance, or suggestions related to hurricanes, including how to stay safe, protect property, or prepare for the disaster.

• Request: Request for help, support, or resources due to the hurricane

• Assistance: This includes both physical aid and emotional or psychological support provided by individuals, communities, or organizations.

• Recovery: The post describes efforts or activities related to the recovery and rebuilding process after the hurricane.

Note: A single post may be labeled as True for multiple humanitarian categories.

Bias Classes

Each post is annotated with five binary bias classes. For each class, the label is either:

• True – the post contains this bias information

• False – the post does not contain this information

These five bias classes include:

• Linguistic Bias: The post contains biased, inappropriate, or offensive language, with a focus on word choice, tone, or expression.

• Political Bias: The post expresses political ideology, showing favor or disapproval toward specific political actors, parties, or policies.

• Gender Bias: The post contains biased, stereotypical, or discriminatory language or viewpoints related to gender.

• Hate Speech: The post contains language that expresses hatred, hostility, or dehumanization toward a specific group or individual, especially those belonging to minority or marginalized communities.

• Racial Bias: The post contains biased, discriminatory, or stereotypical statements directed toward one or more racial or ethnic groups.

Note: A single post may be labeled as True for multiple bias categories.

Information Integrity Classes

Each post is also annotated with a single information integrity class, represented by an integer:

• -1 → False information (i.e., misinformation or disinformation)

• 0 → Unverifiable information (unclear or lacking sufficient evidence)

• 1 → True information (verifiable and accurate)

Key Notes

This dataset is also available at https://huggingface.co/datasets/YRC10/MASH.

Version 1 is no longer available.
P
SunspotsYoloDataset: annotated solar images captured with smart telescopes...
paperswithcode.com
data.niaid.nih.gov
+1more
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). SunspotsYoloDataset: annotated solar images captured with smart telescopes (January 2023 - May 2024) Dataset [Dataset]. https://paperswithcode.com/dataset/sunspotsyolodataset
Explore at:
Dataset updated
May 21, 2024
Description
SunspotsYoloDataset is a set of 1690+380+128 high-resolution RGB astronomical images captured with smart telescopes with specific solar filters and annotated with the positions of sunspots that are effectively in the images. Two instruments were used for several months from Luxembourg and France between January 2023 and May 2024: a Stellina smart telescope (https://vaonis.com/stellina) and a Vespera smart telescope (https://vaonis.com/vespera).

SunspotsYoloDataset can be used to train YOLO detection models on solar images, enabling the prediction of unexpected events such as Borealis Aurora with astronomical equipment accessible to the public.

SunspotsYoloDataset is formatted with the YOLO standard, i.e., with separated files for images and annotations, usable by state-of-the-art training tools and graphical software like MakeSense (https://www.makesense.ai). More precisely, there is a ZIP file containing RGB images in JPEG format (minimal compression), and text files containing the positions of sunspots. Each RGB image has a resolution of 640 × 640 pixels.

For more details about the dataset, please contact the author: olivier.parisot@list.lu .

For more information about Luxembourg of Science and Technology (LIST), please consult: https://www.list.lu .
u
Data from: Annotations directory
produccioncientifica.ugr.es
plus.figshare.com
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Di Stasi, Leandro L.; Angioi, Francesco; Costa Fernandes, Marcelo A.; Caurcel-Cara, María Jesús; Prat, Christophe; Sodnik, Jaka; Díaz-Piedra, Carolina; Di Stasi, Leandro L.; Angioi, Francesco; Costa Fernandes, Marcelo A.; Caurcel-Cara, María Jesús; Prat, Christophe; Sodnik, Jaka; Díaz-Piedra, Carolina (2024). Annotations directory [Dataset]. https://produccioncientifica.ugr.es/documentos/668fc483b9e7c03b01bdfbbb
Explore at:
Dataset updated
2024
Authors
Di Stasi, Leandro L.; Angioi, Francesco; Costa Fernandes, Marcelo A.; Caurcel-Cara, María Jesús; Prat, Christophe; Sodnik, Jaka; Díaz-Piedra, Carolina; Di Stasi, Leandro L.; Angioi, Francesco; Costa Fernandes, Marcelo A.; Caurcel-Cara, María Jesús; Prat, Christophe; Sodnik, Jaka; Díaz-Piedra, Carolina
Description
Directory content: This directory contains 21 .CSV files reporting the data about 23 specific drivers’ mannerisms and behaviors (e.g., rubbing/holding face, yawning) observed during the driving session (ID_Number_LabelData).Method and instruments: To collect the annotation data, two trained and independent raters employed a customized video analysis tool (HADRIAN’s EYE software; Di Stasi et al., 2023) to identify and annotate drivers’ fatigue- and sleepiness-related mannerisms and behaviors. The tool allows the synchronized reproduction of the videos obtained through the RGB camera (for further details, see RGB and Depth videos directory) and the recording of the main central screen of the simulator (for further details, see Driving simulator indices directory). Both videos were automatically divided by the HADRIAN’s EYE software into a series of 5-min 39 chunks that were then shuffled and presented to the raters in a randomized order to minimize bias (e.g., overestimating the level of fatigue towards the end of the experimental session). Then, a customized Matlab code (Mathworks Inc., Natick, MA, USA) was used to detect discrepancies in the outputs of the two raters. Two types of discrepancies were detected: (i) type, and (ii) timing of the detected mannerism/behavior. Finally, in case of discrepancies, a third independent rater reviewed the videos to solve the issue.
h
Nayana-DocVQA-22-langs-annotated
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nayana-CognitiveLab (2025). Nayana-DocVQA-22-langs-annotated [Dataset]. https://huggingface.co/datasets/Nayana-cognitivelab/Nayana-DocVQA-22-langs-annotated
Explore at:
Dataset updated
Apr 3, 2025
Dataset authored and provided by
Nayana-CognitiveLab
Description
Nayana-cognitivelab/Nayana-DocVQA-22-langs-annotated dataset hosted on Hugging Face and contributed by the HF Datasets community
c
Annotations for ACRIN-HNSCC-FDG-PET-CT Collection
dev.cancerimagingarchive.net
cancerimagingarchive.net
csv, dicom, n/a
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive, Annotations for ACRIN-HNSCC-FDG-PET-CT Collection [Dataset]. http://doi.org/10.7937/JVGC-AQ36
Explore at:
csv, n/a, dicomAvailable download formats
Unique identifier
https://doi.org/10.7937/JVGC-AQ36
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
Nov 13, 2023
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
This dataset contains image annotations derived from the NCI Clinical Trial "ACRIN-HNSCC-FDG-PET-CT (ACRIN 6685)”. This dataset was generated as part of an NCI project to augment TCIA datasets with annotations that will improve their value for cancer researchers and AI developers.
Annotation Protocol
For each patient, all scans were reviewed to identify and annotate the clinically relevant time points and sequences/series. Scans were initially annotated by an international team of radiologists holding MBBS degrees or higher, which were then reviewed by US-based board-certified radiologists to ensure accuracy. In a typical patient all available time points were annotated. The following annotation rules were followed:

PERCIST criteria was followed for PET imaging. Specifically, the lesions estimated to have the most elevated SUVmax were annotated.

RECIST 1.1 was otherwise generally followed for MR and CT imaging. A maximum of 5 lesions were annotated per patient scan (timepoint); no more than 2 per organ. The same 5 lesions were annotated at each time point. Lymph nodes were annotated if >1 cm in short axis. Other lesions were annotated if >1 cm. If the primary lesion is < 1 cm, it was still annotated.

Three-dimensional segmentations of lesions were created in the axial plane. If no axial plane was available, lesions were annotated in the coronal plane.

MRIs were annotated using the T1-weighted axial post contrast sequence, fat saturated if available.

CTs were annotated using the axial post contrast series. If not available, the non contrast series were annotated.

PET/CTs were annotated on the CT and attenuation corrected PET images.

If the post contrast CT was performed the same day as the PET/CT, the non contrast CT portion of the PET/CT was not annotated.

Lesions were labeled separately.

The volume of each annotated lesion was calculated and reported in cubic centimeters [cc] in the Annotation Metadata CSV.

Seed points were automatically generated, but reviewed by a radiologist.

A “negative” annotation was created for any exam without findings.

At each time point:

A seed point (kernel) was created for each segmented structure. The seed points for each segmentation are provided in a separate DICOM RTSTRUCT file.

SNOMED-CT “Anatomic Region Sequence” and “Segmented Property Category Code Sequence” and codes were inserted for all segmented structures.

“Tracking ID” and “Tracking UID” tags were inserted for each segmented structure to enable longitudinal lesion tracking.

Imaging time point codes were inserted to help identify each annotation in the context of the clinical trial assessment protocol.

“Clinical Trial Time Point ID” was used to encode time point type using one of the following strings as applicable: “pre-dose” or “post-chemotherapy”.

Content Item in “Acquisition Context Sequence” was added containing "Time Point Type" using Concept Code Sequence (0040,A168) selected from:

(255235001, SCT, “Pre-dose”) (in this trial, both the CT/MRI and PET/CT, while being different timepoints, are pre-treatment)

Important supplementary information and sample code

A spreadsheet containing key details about the annotations is available in the Data Access section below.

A Jupyter notebook demonstrating how to use the NBIA Data Retriever Command-Line Interface application and the REST API to access these data can be found in the Additional Resources section below.
A
AI Data Annotation Service Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). AI Data Annotation Service Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-data-annotation-service-528915
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
May 14, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
CA
Variables measured
Market Size
Description
The AI data annotation service market is experiencing robust growth, driven by the increasing demand for high-quality training data to fuel the advancement of artificial intelligence applications. The market, estimated at $2 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This expansion is fueled by several key factors. The rapid adoption of AI across diverse sectors, including medical imaging analysis, autonomous driving systems, and sophisticated content moderation tools, is a major driver. Furthermore, the rising complexity of AI models necessitates larger, more accurately annotated datasets, contributing to market growth. The market is segmented by application (medical, education, autonomous driving, content moderation, others) and type of service (image, text, video data annotation, others). The medical and autonomous driving segments are currently leading the market due to their high data requirements and the critical role of accuracy in these fields. However, the education and content moderation sectors show significant growth potential as AI adoption expands in these areas. While the market presents significant opportunities, certain challenges exist. The high cost of data annotation, the need for specialized expertise, and the potential for human error in the annotation process act as restraints. However, technological advancements in automation and the emergence of more efficient annotation tools are gradually mitigating these challenges. The competitive landscape is characterized by a mix of established players and emerging startups, with companies like Appen, iMerit, and Scale AI occupying significant market share. Geographic concentration is currently skewed towards North America and Europe, but emerging economies in Asia and elsewhere are expected to witness rapid growth in the coming years as AI adoption expands globally. The continuous improvement in AI algorithms and increasing availability of affordable annotation tools further contribute to the dynamic nature of this ever-evolving market.
Z
Data from: ODDS: Real-Time Object Detection using Depth Sensors on Embedded...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shelton, Charles (2020). ODDS: Real-Time Object Detection using Depth Sensors on Embedded GPUs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1163769
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Guo, Karen
Mithun, Niluthpol Chowdhury
Shelton, Charles
Munir, Sirajum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ODDS Smart Building Depth Dataset

Introduction:

The goal of this dataset is to facilitate research focusing on recognizing objects in smart buildings using the depth sensor mounted at the ceiling. This dataset contains annotations of depth images for eight frequently seen object classes. The classes are: person, backpack, laptop, gun, phone, umbrella, cup, and box.

Data Collection:

We collected data from two settings. We had Kinect mounted at a 9.3 feet ceiling near to a 6 feet wide door. We also used a tripod with a horizontal extender holding the kinect at a similar height looking downwards. We asked about 20 volunteers to enter and exit a number of times each in different directions (3 times walking straight, 3 times walking towards left side, 3 times walking towards right side) holding objects in many different ways and poses underneath the Kinect. Each subject was using his/her own backpack, purse, laptop, etc. As a result, we considered varieties within the same object, e.g., for laptops, we considered Macbooks, HP laptops, Lenovo laptops of different years and models, and for backpacks, we considered backpacks, side bags, and purse of women. We asked the subjects to walk while holding it in many ways, e.g., for laptop, the laptop was fully open, partially closed, and fully closed while carried. Also, people hold laptops in front and side of their bodies, and underneath their elbow. The subjects carried their backpacks in their back, in their side at different levels from foot to shoulder. We wanted to collect data with real guns. However, bringing real guns to the office is prohibited. So, we obtained a few nerf guns and the subjects were carrying these guns pointing it to front, side, up, and down while walking.

Annotated Data Description:

The Annotated dataset is created following the structure of Pascal VOC devkit, so that the data preparation becomes simple and it can be used quickly with different with object detection libraries that are friendly to Pascal VOC style annotations (e.g. Faster-RCNN, YOLO, SSD). The annotated data consists of a set of images; each image has an annotation file giving a bounding box and object class label for each object in one of the eight classes present in the image. Multiple objects from multiple classes may be present in the same image. The dataset has 3 main directories:

1)DepthImages: Contains all the images of training set and validation set.

2)Annotations: Contains one xml file per image file, (e.g., 1.xml for image file 1.png). The xml file includes the bounding box annotations for all objects in the corresponding image.

3)ImagesSets: Contains two text files training_samples.txt and testing_samples.txt. The training_samples.txt file has the name of images used in training and the testing_samples.txt has the name of images used for testing. (We randomly choose 80%, 20% split)

UnAnnotated Data Description:

The un-annotated data consists of several set of depth images. No ground-truth annotation is available for these images yet. These un-annotated sets contain several challenging scenarios and no data has been collected from this office during annotated dataset construction. Hence, it will provide a way to test generalization performance of the algorithm.

Citation:

If you use ODDS Smart Building dataset in your work, please cite the following reference in any publications: @inproceedings{mithun2018odds, title={ODDS: Real-Time Object Detection using Depth Sensors on Embedded GPUs}, author={Niluthpol Chowdhury Mithun and Sirajum Munir and Karen Guo and Charles Shelton}, booktitle={ ACM/IEEE Conference on Information Processing in Sensor Networks (IPSN)}, year={2018}, }
Annotated T2-weighted MR images of the Lower Spine
zenodo.org
explore.openaire.eu
+1more
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chengwen Chu; Daniel L. Belavy; Gabriele Armbrecht; Martin Bansmann; Dieter Felsenberg; Guoyan Zheng; Chengwen Chu; Daniel L. Belavy; Gabriele Armbrecht; Martin Bansmann; Dieter Felsenberg; Guoyan Zheng (2020). Annotated T2-weighted MR images of the Lower Spine [Dataset]. http://doi.org/10.5281/zenodo.22304
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.22304
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chengwen Chu; Daniel L. Belavy; Gabriele Armbrecht; Martin Bansmann; Dieter Felsenberg; Guoyan Zheng; Chengwen Chu; Daniel L. Belavy; Gabriele Armbrecht; Martin Bansmann; Dieter Felsenberg; Guoyan Zheng
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Annotated T2-weighted MR images of the Lower Spine

Chengwen Chu, Daniel Belavy, Gabriele Armbrecht, Martin Bansmann, Dieter Felsenberg, and Guoyan Zheng

Introduction
The Institute for Surgical Technology and Biomechanics, University of Bern, Switzerland, Charité - University Medicine Berlin, Centre of Muscle and Bone Research, Free University & Humboldt-University Berlin, Germany, Centre for Physical Activity and Nutrition Research, School of Exercise and Nutrition Sciences, Deakin University Burwood Campus, Australia and Institut für Diagnostische und Interventionelle Radiologie, Krankenhaus Porz Am Rhein gGmbH, Köln, Germany, are making this dataset available as a resource in the development of algorithms and tools for spinal image analysis.

Description
The database consists of T2-weighted turbo spin echo MR spine images of 23 anonymized patients, each containing at least 7 vertebral bodies (VBs) of the lower spine (T11 – L5). For each vertebral body, reference manual segmentation is provided in the form of a binary mask. All images and binary masks are stored in the Neuroimaging Informatics Technology Initiative (NIFTI) file format, see details at http://nifti.nimh.nih.gov/. Image files are stored as "Img_xx.nii" while the associated annotation files are stored as "Img_xx_Labels.nii", where "xx" is the internal case number for the patient.

Image annotations were prepared by Mr. Chengwen Chu (no professional training in radiology).

Acknowledgements

The acquisition of original images was supported by the Grant 14431/02/NL/SH2 from the European Space Agency, grant 50WB0720 from the German Aerospace Center (DLR) and the Charité Universitätsmedizin Berlin.

Preparation of this data collection was made possible thanks to the funding from the Swiss National Science Foundation (SNSF) through project: 205321 157207/1.

Reference
C. Chu, D. Belavy, W. Yu, G. Armbrecht, M. Bansmann, D. Felsenberg, and G. Zheng, “Fully Automatic Localization and Segmentation of 3D Vertebral Bodies from CT/MR Images via A Learning-based Method”, PLoS One. 2015 Nov 23;10(11):e0143327. doi: 10.1371/journal.pone.0143327. eCollection 2015.
d
340K+ Jewelry Images | AI Training Data | Object Detection Data | Annotated...
datarade.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds, 340K+ Jewelry Images | AI Training Data | Object Detection Data | Annotated imagery data | Global Coverage [Dataset]. https://datarade.ai/data-products/200k-jewelry-images-ai-training-data-object-detection-da-data-seeds
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Data Seeds
Area covered
Bahrain, Saint Vincent and the Grenadines, Madagascar, Venezuela (Bolivarian Republic of), Denmark, Turkey, Tokelau, Equatorial Guinea, Bhutan, Swaziland
Description
This dataset features over 340,000 high-quality images of jewelry sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly detailed and carefully annotated collection of jewelry imagery across styles, materials, and contexts.

Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, including jewelry type, material, and context—ideal for tasks like object detection, style classification, and fine-grained visual analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.

Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on jewelry photography ensure high-quality, well-lit, and visually appealing submissions. Custom datasets can be sourced on-demand within 72 hours to meet specific requirements such as jewelry category (rings, necklaces, bracelets, etc.), material type, or presentation style (worn vs. product shots).

Global Diversity: photographs have been submitted by contributors in over 100 countries, offering an extensive range of cultural styles, design traditions, and jewelry aesthetics. The dataset includes handcrafted and luxury items, traditional and contemporary pieces, and representations across diverse ethnic and regional fashions.

High-Quality Imagery: the dataset includes high-resolution images suitable for detailed product analysis. Both studio-lit commercial shots and lifestyle/editorial photography are included, allowing models to learn from various presentation styles and settings.

Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This metric offers insight into aesthetic appeal and global consumer preferences, aiding AI models focused on trend analysis or user engagement.

AI-Ready Design: this dataset is optimized for training AI in jewelry classification, attribute tagging, visual search, and recommendation systems. It integrates easily into retail AI workflows and supports model development for e-commerce and fashion platforms.

Licensing & Compliance: the dataset complies fully with data privacy and IP standards, offering transparent licensing for commercial and academic purposes.

Use Cases: 1. Training AI for visual search and recommendation engines in jewelry e-commerce. 2. Enhancing product recognition, classification, and tagging systems. 3. Powering AR/VR applications for virtual try-ons and 3D visualization. 4. Supporting fashion analytics, trend forecasting, and cultural design research.

This dataset offers a diverse, high-quality resource for training AI and ML models in the jewelry and fashion space. Customizations are available to meet specific product or market needs. Contact us to learn more!
A collection of fully-annotated soundscape recordings from the Northeastern...
zenodo.org
data.niaid.nih.gov
csv, pdf, txt, zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Kahl; Stefan Kahl; Russell Charif; Russell Charif; Holger Klinck; Holger Klinck (2024). A collection of fully-annotated soundscape recordings from the Northeastern United States [Dataset]. http://doi.org/10.5281/zenodo.7018484
Explore at:
pdf, csv, zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7018484
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stefan Kahl; Stefan Kahl; Russell Charif; Russell Charif; Holger Klinck; Holger Klinck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Northeastern United States, United States
Description
This collection contains 285 hour-long soundscape recordings which have been annotated by expert ornithologists who provided 50,760 bounding box labels for 81 different bird species from the Northeastern USA. The data were recorded in 2017 in the Sapsucker Woods bird sanctuary in Ithaca, NY, USA. This collection has (partially) been used as test data in the 2019, 2020 and 2021 BirdCLEF competition and can primarily be used for training and evaluation of machine learning algorithms.

Data collection

As part of the Sapsucker Woods Acoustic Monitoring Project (SWAMP), the K. Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology deployed 30 first-generation SWIFT recorders in the surrounding bird sanctuary area in Ithaca, NY, USA. The sensitivity of the used microphones was -44 (+/-3) dB re 1 V/Pa. The microphone's frequency response was not measured, but is assumed to be flat (+/- 2 dB) in the frequency range 100 Hz to 7.5 kHz. The analog signal was amplified by 33 dB and digitized (16-bit resolution) using an analog-to-digital converter (ADC) with a clipping level of -/+ 0.9 V. This ongoing study aims to investigate the vocal activity patterns and seasonally changing diversity of local bird species. The data are also used to assess the impact of noise pollution on the behavior of birds. Recordings were recorded 24 h/day in 1-hour uncompressed WAVE files at 48 kHz, converted to FLAC and resampled to 32 kHz for this collection. Parts of this dataset have previously been used in the 2019, 2020 and 2021 BirdCLEF competition.

Sampling and annotation protocol

We subsampled data for this collection by randomly selecting one 1-hour file from one of the 30 different recording units for each hour of one day per week between Feb and Aug 2017. For this collection, we excluded recordings that were shorter than one hour or did not contain a bird vocalization. Annotators were asked to box every bird call they could recognize, ignoring those that are too faint or unidentifiable. Raven Pro software was used to annotate the data. Provided labels contain full bird calls that are boxed in time and frequency. Annotators were allowed to combine multiple consecutive calls of one species into one bounding box label if pauses between calls were shorter than five seconds. We use eBird species codes as labels, following the 2021 eBird taxonomy (Clements list).

Files in this collection

Audio recordings can be accessed by downloading and extracting the “soundscape_data.zip” file. Soundscape recording filenames contain a sequential file ID, recording date and timestamp in UTC. As an example, the file “SSW_001_20170225_010000Z.flac” has sequential ID 001 and was recorded on Feb 25th 2017 at 01:00:00 UTC. Ground truth annotations are listed in “annotations.csv” where each line specifies the corresponding filename, start and end time in seconds, low and high frequency in Hertz and an eBird species code. These species codes can be assigned to scientific and common name of a species with the “species.csv” file. The approximate recording location with longitude and latitude can be found in the “recording_location.txt” file.

Acknowledgements

Compiling this extensive dataset was a major undertaking, and we are very thankful to the domain experts who helped to collect and manually annotate the data for this collection (individual contributors in alphabetic order): Jessie Barry, Sarah Dzielski, Cullen Hanks, Robert Koch, Jim Lowe, Jay McGowan, Ashik Rahaman, Yu Shiu, Laurel Symes, and Matt Young.
Z
SOTAB V2 for SemTab 2023
data.niaid.nih.gov
zenodo.org
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Korini, Keti (2023). SOTAB V2 for SemTab 2023 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8422036
Explore at:
Dataset updated
Oct 10, 2023
Dataset provided by
Korini, Keti
Peeters, Ralph
Bizer, Christian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SOTAB V2 for SemTab 2023 includes datasets used to evaluate Column Type Annotation (CTA) and Columns Property Annotation (CPA) systems in the 2023 edition of the SemTab challenge. The datasets for both rounds of the challenge were down-sampled from the full train, test and validation splits of the SOTAB V2 (WDC Schema.org Table Annotation Benchmark version 2) benchmark, so that the datasets of the first round have a smaller vocabulary of 40 and 50 labels for CTA and CPA respectively corresponding to easier/more general domains, and the datasets of the second round include the full vocabulary size of 80 and 105 labels and are therefore considered to be harder to annotate. The columns and the relationships between columns are annotated using the Schema.org and DBpedia vocabulary.

SOTAB V2 for SemTab 2023 contains the splits used in Round 1 and Round 2 of the challenge. Each round includes a training, validation and test split together with the ground truth for the test splits and the vocabulary list. The ground truth of the test sets of both rounds are manually verified.

Files contained in SOTAB V2 for SemTab 2023:

Round1-SOTAB-CPA-DatasetsAndGroundTruth = This file contains the csv files of the first round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.

Round1-SOTAB-CTA-DatasetsAndGroundTruth = This file contains the csv files of the first round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.

Round2-SOTAB-CPA-SCH-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.

Round2-SOTAB-CTA-SCH-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.

Round2-SOTAB-CPA-DBP-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with DBpedia.

Round2-SOTAB-CTA-DBP-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with DBpedia.

All the corresponding tables can be found in the "Tables" zip folders.

Note on License: This data includes data from the following sources. Refer to each source for license details:

CommonCrawl https://commoncrawl.org/

THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
AI-Powered Medical Imaging Annotation Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI-Powered Medical Imaging Annotation Market Research Report 2033 [Dataset]. https://dataintelo.com/report/ai-powered-medical-imaging-annotation-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jun 28, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI-Powered Medical Imaging Annotation Market Outlook

According to our latest research, the AI-powered medical imaging annotation market size reached USD 1.85 billion globally in 2024. The market is experiencing robust expansion, driven by technological advancements and the rising adoption of artificial intelligence in healthcare. The market is projected to grow at a CAGR of 27.8% from 2025 to 2033, reaching a forecasted value of USD 15.69 billion by 2033. The primary growth factor fueling this trajectory is the increasing demand for accurate, scalable, and rapid annotation solutions to support AI-driven diagnostics and decision-making in clinical settings.

The growth of the AI-powered medical imaging annotation market is propelled by the exponential rise in medical imaging data generated by advanced diagnostic modalities. As healthcare providers continue to digitize patient records and imaging workflows, there is a pressing need for sophisticated annotation tools that can efficiently label vast volumes of images for training and validating AI algorithms. This trend is further amplified by the integration of machine learning and deep learning techniques, which require large, well-annotated datasets to achieve high accuracy in disease detection and classification. Consequently, hospitals, research institutes, and diagnostic centers are increasingly investing in AI-powered annotation platforms to streamline their operations and enhance clinical outcomes.

Another significant driver for the market is the growing prevalence of chronic diseases and the subsequent surge in diagnostic imaging procedures. Conditions such as cancer, cardiovascular diseases, and neurological disorders necessitate frequent imaging for early detection, monitoring, and treatment planning. The complexity and volume of these images make manual annotation labor-intensive and prone to variability. AI-powered annotation solutions address these challenges by automating the labeling process, ensuring consistency, and significantly reducing turnaround times. This not only improves the efficiency of radiologists and clinicians but also accelerates the deployment of AI-based diagnostic tools in routine clinical practice.

The evolution of regulatory frameworks and the increasing emphasis on data quality and patient safety are also shaping the growth of the AI-powered medical imaging annotation market. Regulatory agencies worldwide are encouraging the adoption of AI in healthcare, provided that the underlying data used for algorithm development is accurately annotated and validated. This has led to the emergence of specialized service providers offering compliant annotation solutions tailored to the stringent requirements of medical device approvals and clinical trials. As a result, the market is witnessing heightened collaboration between healthcare providers, technology vendors, and regulatory bodies to establish best practices and standards for medical image annotation.

Regionally, North America continues to dominate the AI-powered medical imaging annotation market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The United States, in particular, benefits from a mature healthcare IT infrastructure, strong research funding, and a high concentration of leading AI technology companies. Meanwhile, Asia Pacific is emerging as a high-growth region, fueled by rapid healthcare digitization, increasing investments in AI research, and expanding patient populations. Latin America and the Middle East & Africa are also witnessing steady growth, albeit at a slower pace, as healthcare systems modernize and adopt advanced imaging technologies.

Component Analysis

The component segment of the AI-powered medical imaging annotation market is bifurcated into software and services, both of which play pivotal roles in the overall ecosystem. Software solutions encompass annotation platforms, data management tools, and integration modules that enable seamless image labeling, workflow automation, and interoperability with existing hospital information systems. These platforms leverage advanced algorithms for image segmentation, object detection, and feature extraction, significantly enhancing the speed and accuracy of annotation tasks. The increasing sophistication of annotation software, including support for multi-modality images and customizable labeling protocols, is driving widespread adoption among health
m
Data from: A pixel-wise annotated dataset of small overlooked indoor objects...
data.mendeley.com
Updated May 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elhassan Mohamed (2022). A pixel-wise annotated dataset of small overlooked indoor objects for semantic segmentation applications [Dataset]. http://doi.org/10.17632/hs5w7xfzdk.3
Explore at:
Unique identifier
https://doi.org/10.17632/hs5w7xfzdk.3
Dataset updated
May 26, 2022
Authors
Elhassan Mohamed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Folders descriptions: - images -> dataset images. - PixelLabelData -> pixels representation for each image (annotations). - dataset.zip -> a compressed folder of the dataset files and folders.

Files descriptions: - imds.mat -> images datastore in Matlab format. - pxds.mat -> pixels datastore in Matlab format.

Note: files paths in the imds and pxds Matlab files need to be modified to point to the new location of the images on your disk.

Classes: - Classes -> 'door' 'pull door handle' 'push button' 'moveable door handle' 'push door handle' 'fire extinguisher' 'key slot' 'carpet floor' 'background wall'

Video: - IndoorDataset_960_540.mp4 -> a short video that can be used to benchmark the system's speed (FPS) and inference capabilities.

Paper: E. Mohamed, K. Sirlantzis and G. Howells, "A pixel-wise annotated dataset of small overlooked indoor objects for semantic segmentation applications". in Data in Brief, vol.40, pp. 107791, 2022, doi:10.1016/j.dib.2022.107791.
Annotated Insect Image Dataset for Training Oriented Bounding Box and...
zenodo.org
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hossein Shirali; Hossein Shirali; Aleida Ascenzi; Aleida Ascenzi (2025). Annotated Insect Image Dataset for Training Oriented Bounding Box and Segmentation Models in Morphometric Analysis [Dataset]. http://doi.org/10.5281/zenodo.15483553
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15483553
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hossein Shirali; Hossein Shirali; Aleida Ascenzi; Aleida Ascenzi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the manuscript titled "Deep Learning-Based Methods for Automated Estimation of Insect Length, Volume, and Biomass". It contains the annotated image data used to develop, train, and evaluate two complementary deep learning methods for automated insect morphometrics.

The dataset is divided into two primary components corresponding to the methods developed:

1. Oriented Bounding Box (OBB) Dataset:

Purpose: Used for training and evaluating an OBB model for rapid insect length estimation.

Content: Contains 815 images of insects from diverse families within the orders Diptera, Hymenoptera, and Coleoptera.

Image Acquisition: Images were captured using a low-cost, high-resolution DIY microscope, the Entomoscope (Wührl et al., 2024). Multiple focal planes were stacked using Helicon Focus (Helicon Soft Ltd, 2025) to achieve optimal clarity. Specimens were preserved in ethanol.

Format: YOLO OBB format label files.

Metadata: Specimen details and image lists are provided in the supporting information file.

2. Segmentation Dataset:

Purpose: Used for training and evaluating a segmentation model for detailed curvilinear length and volume estimation, demonstrated with Tachinidae (Diptera).

Content: Comprises a total of 1,320 images of several representative tachinid species.

Image Acquisition: Same as the OBB dataset (Entomoscope, Helicon Focus, ethanol preservation).

Annotation Strategy & Structure: A two-stage annotation strategy was employed, and the dataset is structured to reflect this:

initial_labeling_strategy_820_images/: Data corresponding to the initial annotation approach focusing only on directly visible body parts

final_refined_dataset_500_images/: This is the dataset used for the final model. The key refinement in this stage was the explicit annotation of inferred outlines for body parts partially obscured by wings or legs, where reliable estimation was possible.

Format: YOLO segmentation format label files.

Metadata: Specimen details and image lists are provided in the Supporting Information file.

General Information:

Diversity: For both datasets, images were captured from various views and orientations to ensure model robustness.

This dataset is provided to ensure the reproducibility of our research and to serve as a resource for the broader community working on automated image-based analysis of insects and other biological specimens.
Francis Poulenc – Mouvements Perpetuels (A corpus of annotated scores)
zenodo.org
zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Johannes Hentschel; Johannes Johannes Hentschel; Yannis Rammos; Yannis Rammos; Markus Neuwirth; Markus Neuwirth; Martin Rohrmeier; Martin Rohrmeier (2025). Francis Poulenc – Mouvements Perpetuels (A corpus of annotated scores) [Dataset]. http://doi.org/10.5281/zenodo.15292614
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15292614
Dataset updated
Apr 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Johannes Hentschel; Johannes Johannes Hentschel; Yannis Rammos; Yannis Rammos; Markus Neuwirth; Markus Neuwirth; Martin Rohrmeier; Martin Rohrmeier
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
No description provided.

Facebook

Twitter

Click to copy link

Link copied

Cite

Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali (2023). Forex News Annotated Dataset for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.7976208

Forex News Annotated Dataset for Sentiment Analysis

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7976208

Dataset updated

Nov 11, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.

We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.

Examples of Annotated Headlines


    Forex Pair
    Headline
    Sentiment
    Explanation




    GBPUSD 
    Diminishing bets for a move to 12400 
    Neutral
    Lack of strong sentiment in either direction


    GBPUSD 
    No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft 
    Positive
    Positive sentiment towards GBPUSD (Cable) in the near term


    GBPUSD 
    When are the UK jobs and how could they affect GBPUSD 
    Neutral
    Poses a question and does not express a clear sentiment


    JPYUSD
    Appropriate to continue monetary easing to achieve 2% inflation target with wage growth 
    Positive
    Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply


    USDJPY
    Dollar rebounds despite US data. Yen gains amid lower yields 
    Neutral
    Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other


    USDJPY
    USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains 
    Negative
    USDJPY is expected to reach a lower value, with the USD losing value against the JPY


    AUDUSD

    <p>RBA Governor Lowe’s Testimony High inflation is damaging and corrosive </p>

    Positive
    Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD.

Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.

Clear search

Close search

Google apps

Main menu

Forex News Annotated Dataset for Sentiment Analysis

applescript-lines-annotated

zerobench-annotated

Annotated Data, part 5

Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1

MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane

MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane

Usage Notice

Humanitarian Classes

Bias Classes

Information Integrity Classes

Key Notes

SunspotsYoloDataset: annotated solar images captured with smart telescopes...

Data from: Annotations directory

Nayana-DocVQA-22-langs-annotated

Annotations for ACRIN-HNSCC-FDG-PET-CT Collection

Annotation Protocol

Important supplementary information and sample code

AI Data Annotation Service Report

Data from: ODDS: Real-Time Object Detection using Depth Sensors on Embedded...

Introduction:

Data Collection:

Annotated Data Description:

UnAnnotated Data Description:

Citation:

Annotated T2-weighted MR images of the Lower Spine

340K+ Jewelry Images | AI Training Data | Object Detection Data | Annotated...

A collection of fully-annotated soundscape recordings from the Northeastern...

SOTAB V2 for SemTab 2023

AI-Powered Medical Imaging Annotation Market Research Report 2033

AI-Powered Medical Imaging Annotation Market Outlook

Component Analysis

Data from: A pixel-wise annotated dataset of small overlooked indoor objects...

Annotated Insect Image Dataset for Training Oriented Bounding Box and...

Francis Poulenc – Mouvements Perpetuels (A corpus of annotated scores)

Forex News Annotated Dataset for Sentiment Analysis