Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Fire Data Annotations is a dataset for object detection tasks - it contains Fire annotations for 1,942 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Leaves from genetically unique Juglans regia plants were scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA). Soil samples were collected in Fall of 2017 from the riparian oak forest located at the Russell Ranch Sustainable Agricultural Institute at the University of California Davis. The soil was sieved through a 2 mm mesh and was air dried before imaging. A single soil aggregate was scanned at 23 keV using the 10x objective lens with a pixel resolution of 650 nanometers on beamline 8.3.2 at the ALS. Additionally, a drought stressed almond flower bud (Prunus dulcis) from a plant housed at the University of California, Davis, was scanned using a 4x lens with a pixel resolution of 1.72 µm on beamline 8.3.2 at the ALS Raw tomographic image data was reconstructed using TomoPy. Reconstructions were converted to 8-bit tif or png format using ImageJ or the PIL package in Python before further processing. Images were annotated using Intel’s Computer Vision Annotation Tool (CVAT) and ImageJ. Both CVAT and ImageJ are free to use and open source. Leaf images were annotated in following Théroux-Rancourt et al. (2020). Specifically, Hand labeling was done directly in ImageJ by drawing around each tissue; with 5 images annotated per leaf. Care was taken to cover a range of anatomical variation to help improve the generalizability of the models to other leaves. All slices were labeled by Dr. Mina Momayyezi and Fiona Duong.To annotate the flower bud and soil aggregate, images were imported into CVAT. The exterior border of the bud (i.e. bud scales) and flower were annotated in CVAT and exported as masks. Similarly, the exterior of the soil aggregate and particulate organic matter identified by eye were annotated in CVAT and exported as masks. To annotate air spaces in both the bud and soil aggregate, images were imported into ImageJ. A gaussian blur was applied to the image to decrease noise and then the air space was segmented using thresholding. After applying the threshold, the selected air space region was converted to a binary image with white representing the air space and black representing everything else. This binary image was overlaid upon the original image and the air space within the flower bud and aggregate was selected using the “free hand” tool. Air space outside of the region of interest for both image sets was eliminated. The quality of the air space annotation was then visually inspected for accuracy against the underlying original image; incomplete annotations were corrected using the brush or pencil tool to paint missing air space white and incorrectly identified air space black. Once the annotation was satisfactorily corrected, the binary image of the air space was saved. Finally, the annotations of the bud and flower or aggregate and organic matter were opened in ImageJ and the associated air space mask was overlaid on top of them forming a three-layer mask suitable for training the fully convolutional network. All labeling of the soil aggregate and soil aggregate images was done by Dr. Devin Rippner. These images and annotations are for training deep learning models to identify different constituents in leaves, almond buds, and soil aggregates Limitations: For the walnut leaves, some tissues (stomata, etc.) are not labeled and only represent a small portion of a full leaf. Similarly, both the almond bud and the aggregate represent just one single sample of each. The bud tissues are only divided up into buds scales, flower, and air space. Many other tissues remain unlabeled. For the soil aggregate annotated labels are done by eye with no actual chemical information. Therefore particulate organic matter identification may be incorrect. Resources in this dataset:Resource Title: Annotated X-ray CT images and masks of a Forest Soil Aggregate. File Name: forest_soil_images_masks_for_testing_training.zipResource Description: This aggregate was collected from the riparian oak forest at the Russell Ranch Sustainable Agricultural Facility. The aggreagate was scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 0,0,0; pores spaces have a value of 250,250, 250; mineral solids have a value= 128,0,0; and particulate organic matter has a value of = 000,128,000. These files were used for training a model to segment the forest soil aggregate and for testing the accuracy, precision, recall, and f1 score of the model.Resource Title: Annotated X-ray CT images and masks of an Almond bud (P. Dulcis). File Name: Almond_bud_tube_D_P6_training_testing_images_and_masks.zipResource Description: Drought stressed almond flower bud (Prunis dulcis) from a plant housed at the University of California, Davis, was scanned by X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 4x lens with a pixel resolution of 1.72 µm using. For masks, the background has a value of 0,0,0; air spaces have a value of 255,255, 255; bud scales have a value= 128,0,0; and flower tissues have a value of = 000,128,000. These files were used for training a model to segment the almond bud and for testing the accuracy, precision, recall, and f1 score of the model.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads Resource Title: Annotated X-ray CT images and masks of Walnut leaves (J. Regia) . File Name: 6_leaf_training_testing_images_and_masks_for_paper.zipResource Description: Stems were collected from genetically unique J. regia accessions at the 117 USDA-ARS-NCGR in Wolfskill Experimental Orchard, Winters, California USA to use as scion, and were grafted by Sierra Gold Nursery onto a commonly used commercial rootstock, RX1 (J. microcarpa × J. regia). We used a common rootstock to eliminate any own-root effects and to simulate conditions for a commercial walnut orchard setting, where rootstocks are commonly used. The grafted saplings were repotted and transferred to the Armstrong lathe house facility at the University of California, Davis in June 2019, and kept under natural light and temperature. Leaves from each accession and treatment were scanned using X-ray micro-computed tomography (microCT) on the X-ray μCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 170,170,170; Epidermis value= 85,85,85; Mesophyll value= 0,0,0; Bundle Sheath Extension value= 152,152,152; Vein value= 220,220,220; Air value = 255,255,255.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads
https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/ZDNOGFhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/ZDNOGF
The “Training and development dataset for information extraction in plant epidemiomonitoring” is the annotation set of the “Corpus for the epidemiomonitoring of plant”. The annotations include seven entity types (e.g. species, locations, disease), their normalisation by the NCBI taxonomy and GeoNames and binary (seven) and ternary relationships. The annotations refer to character positions within the documents of the corpus. The annotation guidelines give their definitions and representative examples. Both datasets are intended for the training and validation of information extraction methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.
For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.
Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.
Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.
By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.
This dataset is the tagged csv file resulting from a study investigating the vocalisations of Koala populations on St Bees island. Audio data can be retrieved by date and time period and by searching annotation tags which have been applied to the audio recordings (for example it is possible to search for all audio samples tagged with Kookaburra). Researchers can download audio files and csv files containing information about the tags specified in the search. The 'tag' file includes: Tag Name,Start Time,End Time,Max Frequency (hz), Min Frequency (hz),Project Site, Sensor Name, Score and a link to the specific audio sample associated with the individual tag.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotations in the context of the real sentence are as follows:The phenotypes of mxp19 (Fig 1B) |A2:**1SP3E3| and mxp170 (data not shown) homozygotes and hemizygotes (data not shown) are identical, |A3:**1SP3E3| |A4:**1SP3E3| |A5:**1GP3E3| suggesting that mxp19 and mxp170 are null alleles. |A1:**1SP3E3| |A2:**2SP3E1| |A3:**1SP2E0| |A4:**2SP2E0| |A5:**2GP2E3|The minimum number of sentence fragments required to represent these annotations is three:A = “The phenotypes of mxp19 (Fig 1B)”B = “and mxp170 (data not shown) homozygotes and hemizygotes (data not shown) are identical,”C = “suggesting that mxp19 and mxp170 are null alleles.”Annotators' identities are concealed with codes A1, A2, A3, A4, and A5.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an emoticon visual annotation data set, which collects 5329 emoticons and uses the glm-4v api and step-free-api projects to complete the visual annotation through multi-modal large models.
Example:
0f20b31d-e019-4565-9286-fdf29cc8e144.jpg
Original 这个表情包中的内容和笑点在于它展示了一只卡通兔子,兔子的表情看起来既无奈又有些生气,配文是“活着已经够累了,上网你还要刁难我”。这句话以一种幽默的方式表达了许多人在上网时可能会遇到的挫折感或烦恼,尤其是当遇到困难或不顺心的事情时。这种对现代生活压力的轻松吐槽使得这个表情包在社交媒体上很受欢迎,人们用它来表达自己在网络世界中的疲惫感或面对困难时的幽默态度。
Translated: The content and laughter of this emoticon package is that it shows a cartoon rabbit. The rabbit's expression looks helpless and a little angry. The caption is "I am tired of living, but you still make things difficult for me online." This quote expresses in a humorous way the frustration or annoyance that many people may experience when surfing the Internet, especially when something difficult or doesn't go their way. This lighthearted take on the pressures of modern life has made the meme popular on social media, where people use it to express their feelings of exhaustion in the online world or to use humor in the face of difficulties.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Included here are a coding manual and supplementary examples of gesture forms (in still images and video recordings) that informed the coding of the first author (Kate Mesh) and four project reliability coders.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our dataset contains 2 weeks of approx. 8-9 hours of acceleration data per day from 11 participants wearing a Bangle.js Version 1 smartwatch with our firmware installed.
The dataset contains annotations from 4 different commonly used annotation methods utilized in user studies that focus on in-the-wild data. These methods can be grouped in user-driven, in situ annotations - which are performed before or during the activity is recorded - and recall methods - where participants annotate their data in hindsight at the end of the day.
The participants had the task to label their activities using (1) a button located on the smartwatch, (2) the activity tracking app Strava, (3) a (hand)written diary and (4) a tool to visually inspect and label activity data, called MAD-GUI. Methods (1)-(3) are used in both weeks, however method (4) is introduced in the beginning of the second study week.
The accelerometer data is recorded with 25 Hz, a sensitivity of ±8g and is stored in a csv format. Labels and raw data are not yet combined. You can either write your own script to label the data or follow the instructions in our corresponding Github repository.
The following unique classes are included in our dataset:
laying, sitting, walking, running, cycling, bus_driving, car_driving, vacuum_cleaning, laundry, cooking, eating, shopping, showering, yoga, sport, playing_games, desk_work, guitar_playing, gardening, table_tennis, badminton, horse_riding.
However, many activities are very participant specific and therefore only performed by one of the participants.
The labels are also stored as a .csv file and have the following columns:
week_day, start, stop, activity, layer
Example:
week2_day2,10:30:00,11:00:00,vacuum_cleaning,d
The layer columns specifies which annotation method was used to set this label.
The following identifiers can be found in the column:
b: in situ button
a: in situ app
d: self-recall diary
g: time-series recall labelled with a the MAD-GUI
The corresponding publication is currently under review.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
IT Skills Named Entity Recognition (NER) Dataset
Description:
This dataset includes 5,029 curriculum vitae (CV) samples, each annotated with IT skills using Named Entity Recognition (NER). The skills are manually labeled and extracted from PDFs, and the data is provided in JSON format. This dataset is ideal for training and evaluating NER models, especially for extracting IT skills from CVs.
Highlights:
5,029 CV samples with annotated IT skills Manual annotations for… See the full description on the dataset page: https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The HED schema library for the Standardized Computer-based Organized Reporting of EEG (SCORE) can be used to add annotations for BIDS datasets. The annotations are machine readable and validated with the BIDS and HED validators.
This example is related to the following preprint: Dora Hermes, Tal Pal Attia, Sándor Beniczky, Jorge Bosch-Bayard, Arnaud Delorme, Brian Nils Lundstrom, Christine Rogers, Stefan Rampp, Seyed Yahya Shirazi, Dung Truong, Pedro Valdes-Sosa, Greg Worrell, Scott Makeig, Kay Robbins. Hierarchical Event Descriptor library schema for EEG data annotation. arXiv preprint arXiv:2310.15173. 2024 Oct 27.
This BIDS example dataset includes iEEG data from one subject that were measured during clinical photic stimulation. Intracranial EEG data were collected at Mayo Clinic Rochester, MN under IRB#: 15-006530.
The events are annotated according to the HED-SCORE schema library. Data are annotated by adding a column for annotations in the _events.tsv. The levels and annotations in this column are defined in the _events.json sidecar as HED tags.
HED: https://www.hedtags.org/ HED schema library for SCORE: https://github.com/hed-standard/hed-schema-library
Dora Hermes: hermes.dora@mayo.edu
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Labelled industry datasets are one of the most valuable assets in prognostics and health management (PHM) research. However, creating labelled industry datasets is both difficult and expensive, making publicly available industry datasets rare at best, in particular labelled datasets. Recent studies have showcased that industry annotations can be used to train artificial intelligence models directly on industry data ( https://doi.org/10.36001/ijphm.2022.v13i2.3137 , https://doi.org/10.36001/phmconf.2023.v15i1.3507 ), but while many industry datasets also contain text descriptions or logbooks in the form of annotations and maintenance work orders, few, if any, are publicly available. Therefore, we release a dataset consisting with annotated signal data from two large (80mx10mx10m) paper machines, from a Kraftliner production company in northern Sweden. The data consists of 21 090 pairs of signals and annotations from one year of production. The annotations are written in Swedish, by on-site Swedish experts, and the signals consist primarily of accelerometer vibration measurements from the two machines. The dataset is structured as a Pandas dataframe and serialized as a pickle (.pkl) file and a JSON (.json) file. The first column (‘id’) is the ID of the samples; the second column (‘Spectra’) are the fast Fourier transform and envelope-transformed vibration signals; the third column (‘Notes’) are the associated annotations, mapped so that each annotation is associated with all signals from ten days before the annotation date, up to the annotation date; and finally the fourth column (‘Embeddings’) are pre-computed embeddings using Swedish SentenceBERT. Each row corresponds to a vibration measurement sample, though there is no distinction in this data between which sensor or machine part each measurement is from.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This online appendix contains the coding guide and the data used in the paper Information Correspondence between Types of Documentation for APIs accepted for publication in the Empirical Software Engineering (EMSE) journal. The tutorial data was retrieved in October 2018.
It contains the following files:
CodingGuide.pdf: the coding guide to classify a sentence as API Information or Supporting Text.
annotated_sampled_sentences.csv: the set of 332 sampled sentences and two columns of corresponding annotations – one by the first author of this work and the second by an external annotator. This data was used to calculate the agreement score reported in the paper.
-.csv: the data set of annotated sentences in the tutorial on in . For example Python-REGEX.csv is the file containing sentences from the Python tutorial on regular expressions. This file contains the preprocessed sentences from the tutorial, their source files, and their annotation of sentence correspondence with reference documentation.
For licensing reasons, we are unable to upload the original API reference documentation and tutorials, however these are available on request.
This dataset was recorded in the Virtual Annotated Cooking Environment (VACE), a new open-source virtual reality dataset (https://sites.google.com/view/vacedataset) and simulator (https://github.com/michaelkoller/vacesimulator) for object interaction tasks in a rich kitchen environment. We use the Unity-based VR simulator to create thoroughly annotated video sequences of a virtual human avatar performing food preparation activities. Based on the MPII Cooking 2 dataset, it enables the recreation of recipes for meals such as sandwiches, pizzas, fruit salads and smaller activity sequences such as cutting vegetables. For complex recipes, multiple samples are present, following different orderings of valid partially ordered plans. The dataset includes an RGB and depth camera view, bounding boxes, object masks segmentation, human joint poses and object poses, as well as ground truth interaction data in the form of temporally labeled semantic predicates (holding, on, in, colliding, moving, cutting). In our effort to make the simulator accessible as an open-source tool, researchers are able to expand the setting and annotation to create additional data samples. The research leading to these results has received funding from the Austrian Science Fund (FWF) under grant agreement No. I3969-N30 InDex and the project Doctorate College TrustRobots by TU Wien. Thanks go out to Simon Schreiberhuber for sharing his Unity expertise and to the colleagues at the TU Wien Center for Research Data Management for data hosting and support.
Background Currently, most genome annotation is curated by centralized groups with limited resources. Efforts to share annotations transparently among multiple groups have not yet been satisfactory.
Results
Here we introduce a concept called the Distributed Annotation System (DAS). DAS allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. The communication between client and servers in DAS is defined by the DAS XML specification. Annotations are displayed in layers, one per server. Any client or server adhering to the DAS XML specification can participate in the system; we describe a simple prototype client and server example.
Conclusions
The DAS specification is being used experimentally by Ensembl, WormBase, and the Berkeley Drosophila Genome Project. Continued success will depend on the readiness of the research community to adopt DAS and provide annotations. All components are freely available from the project website .
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset is a collection of images along with corresponding bounding box annotations that are specifically curated for detecting cows in images. The dataset covers different cows breeds, sizes, and orientations, providing a comprehensive representation of cows appearances and positions. Additionally, the visibility of each cow is presented in the .xml file.
The cow detection dataset offers a diverse collection of annotated images, allowing for comprehensive algorithm development, evaluation, and benchmarking, ultimately aiding in the development of accurate and robust models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fc1495731b6dff54b97ba132fc8d36fd9%2FMacBook%20Air%20-%201.png?generation=1692031830924617&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes for cows detection. For each point, the x and y coordinates are provided. Visibility of the cow is also provided by the label is_visible (true, false).
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F7a0f0bd6a019e945074361896d27ee90%2Fcarbon%20(1).png?generation=1692032268744062&alt=media" alt="">
keywords: farm animal, animal recognition, farm animal detection, image-based recognition, farmers, “on-farm” data, cows detection, cow images dataset, object detection, deep learning, computer vision, animal contacts, images dataset, agriculture, multiple animal pose estimation, cattle detection, identification, posture recognition, cattle images, individual beef cattle, cattle ranch, dairy cattle, farming, bounding boxes
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform,
author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
pages = {1--7},
title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
year = {2019}
}
@inproceedings{SrbaMonantMedicalDataset,
author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
numpages = {11},
title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
year = {2022},
doi = {10.1145/3477495.3531726},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531726},
}
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
At the same time, annotations are associated with a particular object identified by:
entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation
FAVA Dataset (Processed)
Dataset Description
Dataset Summary
The FAVA (Factual Association and Verification Annotations) dataset is designed for evaluating hallucinations in language model outputs. This processed version contains binary hallucination labels derived from detailed span-level annotations in the original dataset.
Dataset Structure
Each example contains:
Required columns: query: The prompt given to the model context: Empty field (for… See the full description on the dataset page: https://huggingface.co/datasets/wandb/fava-data-processed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SEE-AI Project Dataset is a collection of small bowel capsule endoscopy (CE) images obtained using the PillCam™ SB 3 (Medtronic, Minneapolis, MN, USA), which is the subject of the present paper (Small Bowel Capsule Endoscopy Examination with Object Detection Artificial Intelligence Model: The SEE-AI Project; paper is currently in submission). This dataset comprises 18,481 images extracted from 523 small bowel capsule endoscopy videos. We annotated 12,3320 images with 23,033 disease lesions and combined them with 6,161 normal mucosa images. The annotations are provided in YOLO format. While automated or assisted reading techniques for small bowel CE are highly desired, current AI models have not yet been able to accurately identify multiple types of clinically relevant lesions from CE images to the same extent as expert physicians. One major reason for this is the presence of a certain number of images that are difficult to annotate and label, and the lack of adequately constructed data sets. In the aforementioned paper, we tested an object detection model using YOLO v5. The annotations were created by us, and we believe that more effective methods for annotation should be further investigated. We hope that this dataset will be useful for future small bowel CE object detection research."
We have presented the dataset of the SEE-AI project at Kaggle (https://www.kaggle.com/), the world’s largest data science online community. Our data are licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License. The material is free to copy and redistribute in any medium or format and can be remixed, transformed, and built upon for any purpose if appropriate credit is given.
More details on this data set can be found in the following paper. Please cite this paper when using this dataset. Yokote, A., Umeno, J., Kawasaki, K., Fujioka, S., Fuyuno, Y., Matsuno, Y., Yoshida, Y., Imazu, N., Miyazono, S., Moriyama, T., Kitazono, T. and Torisu, T. (2024), Small bowel capsule endoscopy examination and open access database with artificial intelligence: The SEE-artificial intelligence project. DEN Open, 4: e258. https://doi.org/10.1002/deo2.258
The main content of The SEE-AI Project Dataset includes image data and annotation data. 18,481 images and annotation data in YOLO format are available. The annotations are written in the txt data whose filenames match the image data. There are also empty txt data for images without annotations.
We want to thank Department of Medicine and Clinical Science, Kyushu University, for their cooperation in data collection. We also thank Ultralytics for making YOLO ver5 available. The project name of this dataset was changed due to its name duplication. The previous project name was The AICE project. This was changed on May 14, 2023.
https://colab.research.google.com/drive/1mEE5zXq1U9vC01P-qjxHR2kvxr_3Imz0?usp=sharing
We would be grateful if you could consider setting up better annotation and colleting small intestine CE images. We hope that many more facilities will collect CE images in the future, and datasets will become larger.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global image tagging and annotation services market size was valued at approximately USD 1.5 billion in 2023 and is projected to reach around USD 4.8 billion by 2032, growing at a compound annual growth rate (CAGR) of about 14%. This robust growth is driven by the exponential rise in demand for machine learning and artificial intelligence applications, which heavily rely on annotated datasets to train algorithms effectively. The surge in digital content creation and the increasing need for organized data for analytical purposes are also significant contributors to the market expansion.
One of the primary growth factors for the image tagging and annotation services market is the increasing adoption of AI and machine learning technologies across various industries. These technologies require large volumes of accurately labeled data to function optimally, making image tagging and annotation services crucial. Specifically, sectors such as healthcare, automotive, and retail are investing in AI-driven solutions that necessitate high-quality annotated images to enhance machine learning models' efficiency. For example, in healthcare, annotated medical images are essential for developing tools that can aid in diagnostics and treatment decisions. Similarly, in the automotive industry, annotated images are pivotal for the development of autonomous vehicles.
Another significant driver is the growing emphasis on improving customer experience through personalized solutions. Companies are leveraging image tagging and annotation services to better understand consumer behavior and preferences by analyzing visual content. In retail, for instance, businesses analyze customer-generated images to tailor marketing strategies and improve product offerings. Additionally, the integration of augmented reality (AR) and virtual reality (VR) in various applications has escalated the need for precise image tagging and annotation, as these technologies rely on accurately labeled datasets to deliver immersive experiences.
Data Collection and Labeling are foundational components in the realm of image tagging and annotation services. The process of collecting and labeling data involves gathering vast amounts of raw data and meticulously annotating it to create structured datasets. These datasets are crucial for training machine learning models, enabling them to recognize patterns and make informed decisions. The accuracy of data labeling directly impacts the performance of AI systems, making it a critical step in the development of reliable AI applications. As industries increasingly rely on AI-driven solutions, the demand for high-quality data collection and labeling services continues to rise, underscoring their importance in the broader market landscape.
The rising trend of digital transformation across industries has also significantly bolstered the demand for image tagging and annotation services. Organizations are increasingly investing in digital tools that can automate processes and enhance productivity. Image annotation plays a critical role in enabling technologies such as computer vision, which is instrumental in automating tasks ranging from quality control to inventory management. Moreover, the proliferation of smart devices and the Internet of Things (IoT) has led to an unprecedented amount of image data generation, further fueling the need for efficient image tagging and annotation services to make sense of the vast data deluge.
From a regional perspective, North America is currently the largest market for image tagging and annotation services, attributed to the early adoption of advanced technologies and the presence of numerous tech giants investing in AI and machine learning. The region is expected to maintain its dominance due to ongoing technological advancements and the growing demand for AI solutions across various sectors. Meanwhile, the Asia Pacific region is anticipated to experience the fastest growth during the forecast period, driven by rapid industrialization, increasing internet penetration, and the rising adoption of AI technologies in countries like China, India, and Japan. The European market is also witnessing steady growth, supported by government initiatives promoting digital innovation and the use of AI-driven applications.
The service type segment in the image tagging and annotation services market is bifurcated into manual annotation and automa
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Fire Data Annotations is a dataset for object detection tasks - it contains Fire annotations for 1,942 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).