Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Manual annotation for human action recognition with content semantics using 3D Point Cloud (3D-PC) in industrial environments consumes a lot of time and resources. This work aims to recognize, analyze, and model human actions to develop a framework for automatically extracting content semantics. Main Contributions of this work: 1. design a multi-layer structure of various DNN classifiers to detect and extract humans and dynamic objects using 3D-PC preciously, 2. empirical experiments with over 10 subjects for collecting datasets of human actions and activities in one industrial setting, 3. development of an intuitive GUI to verify human actions and its interaction activities with the environment, 4. design and implement a methodology for automatic sequence matching of human actions in 3D-PC. All these procedures are merged in the proposed framework and evaluated in one industrial Use-Case with flexible patch sizes. Comparing the new approach with standard methods has shown that the annotation process can be accelerated by 5.2 times through automation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.
Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.
The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses
directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data
directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json
corresponds to 2025-01-09-13-59-54.h5
.
The structure of the JSON files is as follows:
{"Hama1": [
[x ,y, z],
[qx, qy, qz, qw]
],
"Hama2": [
[x ,y, z],
[qx, qy, qz, qw]
],
"DAVIS346": [
[x ,y, z],
[qx, qy, qz, qw]
],
"NIST_Board1": [
[x ,y, z],
[qx, qy, qz, qw]
]
}
[x, y, z]
represent the position of the object, and [qx, qy, qz, qw]
represent its orientation as a quaternion.
The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.
📁
The splits folder contains two text files which list the h5 files used for the traning and validation splits.
The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.
📄 Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
💻 Code: https://github.com/TUWIEN-ASL/REASSEMBLE
Recording | Issue |
2025-01-10-15-28-50.h5 | hand cam missing at beginning |
2025-01-10-16-17-40.h5 | missing hand cam |
2025-01-10-17-10-38.h5 | hand cam missing at beginning |
2025-01-10-17-54-09.h5 | no empty action at |
https://www.htfmarketinsights.com/privacy-policyhttps://www.htfmarketinsights.com/privacy-policy
Global Generative AI in Data Labeling Solution and Services is segmented by Application (Autonomous driving, NLP, Medical imaging, Retail AI, Robotics), Type (Text Annotation, Image/Video Tagging, Audio Labeling, 3D Point Cloud Labeling, Synthetic Data Generation) and Geography(North America, LATAM, West Europe, Central & Eastern Europe, Northern Europe, Southern Europe, East Asia, Southeast Asia, South Asia, Central Asia, Oceania, MEA)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Introduction As mobile service robots increasingly operate in human-centered environments, they must learn to use elevators without modifying elevator hardware. This task traditionally involves processing an image of an elevator control panel using instance segmentation of the buttons and labels, reading the text on the labels, and associating buttons with their corresponding labels. In addition to the standard approach, our project also implements an additional segmentation step where missing buttons and labels are recovered after the first feature detection pass. In a robust system, both the first segmentation pass and the recovery models’ training data requires pixel-level annotations of buttons and labels, while the label reading step needs annotations of the text on the labels. Current elevator panel feature datasets, however, either do not provide segmentation annotations, or do not draw distinctions between the buttons and labels. The “Living With Robots Elevator Button Dataset” was assembled for purposes of training segmentation and scene text recognition models on realistic scenarios involving varying conditions such as lighting, blur, and position of the camera relative to the elevator control panel. Buttons are labeled with the same action as their respective labels for purposes of training a button-label association model. A pipeline including all steps of the task mentioned was trained and evaluated, producing state-of-the-art accuracy and precision results using the high quality elevator button dataset. Dataset Contents 400 jpeg images of elevator panels. 292 taken of 25 different elevators across 24 buildings on the University of Texas at Austin campus. 108 sourced from the internet, with varying lighting, quality, and perspective conditions. JSON files containing border annotations, button and label distinctions, and text on labels for the Campus and Internet Sub-Datasets. PyTorch files containing state dictionaries with network weights for: The first-pass segmentation model, a transformer-based model trained to segment buttons and labels in a full-color image: “segmentation_vit_model.pth”. The feature-recovery segmentation model, a transformer-based model trained to segment masks of missed buttons and labels from the class map output of the first pass: “recovery_vit_model.pth”. The scene text recognition model, trained from PARSeq to read the special characters present on elevator panel labels: “parseq_str.ckpt”. Links to the data loader, training, and evaluation scripts for the segmentation models hosted in GitHub. The data subsets are all JPGs collected through 2 different means. The campus subset images were taken in buildings on and around the University of Texas at Austin campus. All pictures were taken facing the elevator panel’s wall roughly straight-on, while the camera itself was positioned in each of nine locations in a 3x3 grid layout relative to the panel: to the top left, top middle, top right, middle left, center, middle right, bottom left, bottom middle, and bottom right. A subset of these also includes versions of each image with the elevator door closed or open, varying the lighting and background conditions. All of these images are 3024 × 4032, and were taken with either an iPhone 12 or 12 Pro Max. The Internet subset deliberately features user-shared photos with irregular or uncommon panel characteristics. Images in this dataset vary widely in terms of resolution, clarity, button/label shape, and angle of the image to add variety to the dataset and robustness to any models trained with it. Data Segmentation The segmentation for this dataset served two training purposes. First, they were used to identify the pixels that comprise the elevator buttons and labels in the images. A segmentation model was then trained to accurately recognize buttons and labels in an image at the pixel-level. The second use, and the one that most distinguishes our approach, was training a separate model to recover missed button and label detections. The annotations were used to generate class maps of each, before being procedurally masked to provide a data ground-truth (the remaining masks) and a target (the hidden masks) for the recovery model. Data Annotation Method All annotations were done with the VGG Image Annotator published by the University of Oxford. All images were given their own set of annotations, identified in their file naming convention. Regarding the segmentation annotations, any button that was largely in-view of the image was segmented as one of several shapes that most closely fit the feature: rectangle, ellipse, or polygon. In the annotation JSONs, these appeared as either the coordinates of each point of a polygon or as the dimensions of an ellipse (center coordinates, radius dimensions, and angle of rotation). Additionally, each feature was designated as a “button” or “label”. For retraining the model that reads text on labels, each label and its...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here.
Dataset Description
ActiveHuman was generated using Unity's Perception package.
It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals).
The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset.
Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.
Folder configuration
The dataset consists of 3 folders:
Essential Terminology
Dataset Data
The dataset includes 4 types of JSON annotation files files:
Most Labelers generate different annotation specifications in the spec key-value pair:
Each Labeler generates different annotation specifications in the values key-value pair:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset focuses on enabling Tuta absoluta detection, necessitated annotated images. It was created as part of the H2020 PestNu project (No. 101037128) using the SpyFly AI-robotic trap from Agrorobotica. The SpyFly trap features a color camera (Svpro 13MP, sensor: Sony 1/3” IMX214) with a resolution of 3840 × 2880 for high-quality image capture. The camera was positioned 15 cm from the glue-paper to capture the entire adhesive board. In Total 217 images were captured.
Expert agronomists annotated the images using Roboflow, labeling a total of 6787 T. absoluta insects, averaging 62.26 annotations per image. Images without insects were excluded, resulting in 109 annotated images, one per day.
The dataset was split into training and validation subsets with an 80–20% ratio, leading to 87 images for training and 22 for validation. The dataset is organized into two main folders: “0_captured_dataset" contains the original 217 .jpg images. "1_annotated_dataset" includes the images and the annotated data, split into separate subfolders for training and validation. The Tuta absoluta count in each subset can be seen in the following table:
Set Images Tuta Absoluta Instances
Training 87 5344
Validation 22 1443
Total 109 6787
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewThe HA4M dataset is a collection of multi-modal data relative to actions performed by different subjects in an assembly scenario for manufacturing. It has been collected to provide a good test-bed for developing, validating and testing techniques and methodologies for the recognition of assembly actions. To the best of the authors' knowledge, few vision-based datasets exist in the context of object assembly.The HA4M dataset provides a considerable variety of multi-modal data compared to existing datasets. Six types of simultaneous data are supplied: RGB frames, Depth maps, IR frames, RGB-Depth-Aligned frames, Point Clouds and Skeleton data.These data allow the scientific community to make consistent comparisons among processing approaches or machine learning approaches by using one or more data modalities. Researchers in computer vision, pattern recognition and machine learning can use/reuse the data for different investigations in different application domains such as motion analysis, human-robot cooperation, action recognition, and so on.Dataset detailsThe dataset includes 12 assembly actions performed by 41 subjects for building an Epicyclic Gear Train (EGT).The assembly task involves three phases first, the assembly of Block 1 and Block 2 separately, and then the final setting up of both Blocks to build the EGT. The EGT is made up of a total of 12 components divided into two sets: the first eight components for building Block 1 and the remaining four components for Block 2. Finally, two screws are fixed with an Allen Key to assemble the two blocks and thus obtain the EGT.Acquisition setupThe acquisition experiment took place in two laboratories (one in Italy and one in Spain), where an acquisition area was reserved for the experimental setup. A Microsoft Azure Kinect camera acquires videos during the execution of the assembly task. It is placed in front of the operator and the table where the components are spread over. The camera is place on a tripod at an height h of 1.54 m and a distance of 1.78m. The camera is down-tilted by an angle of 17 degrees.Technical informationThe HA4M dataset contains 217 videos of the assembly task performed by 41 subjects (15 females and 26 males). Their ages ranged from 23 to 60. All the subjects participated voluntarily and were provided with a written description of the experiment. Each subject was asked to execute the task several times and to perform the actions at their own convenience (e.g. with both hands), independently from their dominant hand. The HA4M project is a growing project. So new acquisitions, planned in the next future, will expand the current dataset.ActionsTwelve actions are considered in HA4M. Actions from 1 to 4 are needed to build Block 1, then actions from 5 to 8 for building Block 2 and finally, the actions from 9 to 12 for completing the EGT. Actions are listed below:Pick up/Place CarrierPick up/Place Gear Bearings (x3)Pick up/Place Planet Gears (x3)Pick up/Place Carrier ShaftPick up/Place Sun ShaftPick up/Place Sun GearPick up/Place Sun Gear BearingPick up/Place Ring BearPick up Block 2 and place it on Block 1Pick up/Place CoverPick up/Place Screws (x2)Pick up/Place Allen Key, Turn Screws, Return Allen Key and EGTAnnotationData annotation concerns the labeling of the different actions in the video sequences.The annotation of the actions has been manually done by observing the RGB videos, frame by frame. The start frame of each action is identified as the subject starts to move the arm to the component to be grasped. The end frame, instead, is recorded when the subject releases the component, so the next frame becomes the start frame of the subsequent action.The total number of actions annotated in this study is 4123, including the “don't care” action (ID=0) and the action repetitions in the case of actions 2, 3 and 11.Available codeThe dataset has been acquired using the Multiple Azure Kinect GUI software, available at https://gitlab.com/roberto.marani/multiple-azure-kinect-gui, based on the Azure Kinect Sensor SDK v1.4.1 and Azure Kinect Body Tracking SDK v1.1.2.The software records device data to a Matroska (.mkv) file, containing video tracks, IMU samples, and device calibration. In this work, IMU samples are not considered.The same Multiple Azure Kinect GUI software processes the Matroska file and returns the different types of data provided with our dataset: RGB images, RGB-depth-Aligned (RGB-A) images, Depth images, IR images, Point Cloud and Skeleton data.
The following files are available with the dataset:
rocog_s00.zip, ..., rocog_s12.zip (26.2 GB): Raw videos for the human subjects performing the gestures and annotations
rocog_human_frames.zip, ..., rocog_human_frames.z02 (18.7 GB): Frames for human data used for training and testing. Each folder also has annotations for gesture (label.bin), orientation (orientation.bin), and the number of times the gesture is repeated (repetitions.bin)
rocog_synth_frames.zip, ..., rocog_synth_frames.z09 (~85.0 GB): Frames for synthetic data used for training and testing. Each folder also has annotations for gesture (label.bin), orientation (orientation.bin), and the number of times the gesture is repeated (repetitions.bin)
The labels are saved into Python binary struct arrays. Each file contains one entry per frame in the corresponding directory. Here's Python sample code to open these files:
import glob import os import struct
frames_dir = 'FemaleCivilian\10_Advance_11_1_2019_1...
Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset presents a set of acoustic signals captured during a single-bead wall experiment in robotic Laser Directed Energy Deposition (LDED) using Maraging Steel C300. The acoustic data was recorded using a high-fidelity Prepolarized microphone sensor (Xiris WeldMIC), capturing the intricate sound profiles associated with the LDED process at a sampling rate of 44,100 Hz. Laser Directed Energy Deposition: This dataset was generated with a robotic LDED process that consists of a six-axis industrial robot (KUKA KR90) coupled with a two-axis positioner, a laser head, and a coaxial powder-feeding nozzle.
Folder Structure:
/sample-1: The main folder for the experiment sample.
/audio_files: Contains 4624 .wav audio files, each representing a 40 ms chunk of the LDED process sound. /annotations_1.csv: A CSV file providing annotations for the audio files, labeling each as "Defect-free", "Defective", or "Laser-off". audio_features.h5: extracted acoustic features in time-domain, frequency-domain, and time-frequency representations (MFCC features). Feature extraction was conducted using Python Essentia Library.
File Naming Convention:
Audio files within the audio_files folder are named following the pattern sample_ExperimentID_SampleID.wav. Given that there's only one experiment and one sample, the naming will be consistent, for example, sample_1_1.wav for the first file. Annotation Details:
The annotations_1.csv file contains detailed labels for each audio file, correlating to the conditions observed during the experiment, aiding in quick identification and analysis. Experimental Parameters: The dataset reflects a controlled experiment setup with the following specifications:
Geometry: Single bead wall structure Dimensions: 90 mm * 42.5 mm Number of layers: 50 Laser beam diameter: 2 mm Layer thickness: 0.85 mm Stand-off distance: 12 mm Laser profile: Gaussian Laser wavelength: 1064 nm Process Parameters:
Laser power: 2.3 kW Speed: 25 mm/s Dwell time: 0 s Powder flow rate: 12 g/min This dataset aims to facilitate the development and testing of acoustic-based defect detection models for real-time quality monitoring in LDED processes. It can also serve as a reference point for further research on sensor fusion, machine learning, and real-time monitoring of manufacturing processes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ground-level Blueberry Orchard Dataset v1 consists of 2000 RGB images of blueberry orchard scenes captured in the village of Babe, Serbia on three occasions in March, May, and August of 2022. Images are captured using the RGB module of Luxonis OAK-D device, with the resolution of 1920×1080 pixels and stored in the lossless PNG format.
The dataset is created for the purpose of training deep learning models for blueberry bush detection, for the task of autonomous UGV guidance. It contains sequences of images captured from the UGV moving and rotating in blueberry orchard rows. Images are captured from a height of approximately 0.5 meters, with the camera angled towards the base of a blueberry plant and the surrounding bank on which it grows. Dataset is captured in real-life outdoor conditions and contains multiple sources of variability (bush shape and size, lighting conditions, shadows, saturation etc.) and artifacts (occlusions by weeds, branches, presence of irregular objects etc.).
There are two classes of annotated objects of interest:
Bush, corresponding to the base of the blueberry bush.
Pole, corresponding to hail netting poles and similar obstructing objects such as lamp posts or wooden legs of bumblebee hives (distinguishing poles is important to prevent equipment damage in operations such as soil sampling and pruning).
Objects of interest are annotated with bounding boxes. Labels are saved in two formats:
LabelMe JSON format (x1, y1, x2, y2; in pixels)
Yolo TXT format (x_center, y_center, width, height; as a ratio of total image size, with numerical labels 0 and 1 corresponding to Bush and Pole)
There are 61 images with no annotated objects, and there are no corresponding label files for these images.
The dataset is split into train, validation and test sets with 75%, 10%, and 15% split (1490, 200, and 310 images, respectively). As the data contains sequences of images, the split is made based on sequences rather than individual images to prevent data leakage.
Detailed description and statistics are available in:
V. Filipović, D. Stefanović, N. Pajević, Ž. Grbović, N. Đurić and M. Panić, "Bush Detection for Vision-based UGV Guidance in Blueberry Orchards: Data Set and Methods," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Vancouver, Canada, 2023. (Accepted)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://www.robots.ox.ac.uk/%7Evgg/data/pets/pet_annotations.jpg" alt="Example Annotations">
The Oxford Pets dataset (also known as the "dogs vs cats" dataset) is a collection of images and annotations labeling various breeds of dogs and cats. There are approximately 100 examples of each of the 37 breeds. This dataset contains the object detection portion of the original dataset with bounding boxes around the animals' heads.
This dataset was collected by the Visual Geometry Group (VGG) at the University of Oxford.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CHIRLA dataset (Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis) is designed for long-term person re-identification (Re-ID) in real-world scenarios. The dataset consists of multi-camera video recordings captured over seven months in an indoor office environment. This dataset aims to facilitate the development and evaluation of Re-ID algorithms capable of handling significant variations in individuals’ appearances, including changes in clothing and physical characteristics. The dataset includes 22 individuals with 963,554 bounding box annotations across 596,345 frames.Data Generation ProceduresThe dataset was recorded at the Robotics, Vision, and Intelligent Systems Research Group headquarters at the University of Alicante, Spain. Seven strategically placed Reolink RLC-410W cameras were used to capture videos in a typical office setting, covering areas such as laboratories, hallways, and shared workspaces. Each camera features a 1/2.7" CMOS image sensor with a 5.0-megapixel resolution and an 80° horizontal field of view. The cameras were connected via Ethernet and WiFi to ensure stable streaming and synchronization.A ROS-based interconnection framework was used to synchronize and retrieve images from all cameras. The dataset includes video recordings at a resolution of 1080×720 pixels, with a consistent frame rate of 30 fps, stored in AVI format with DivX MPEG-4 encoding.Data Processing Methods and StepsData processing involved a semi-automatic labeling procedure:Detection: YOLOv8x was used to detect individuals in video frames and extract bounding boxes.Tracking: The Deep SORT algorithm was employed to generate tracklets and assign unique IDs to detected individuals.Manual Verification: A custom graphical user interface (GUI) was developed to facilitate manual verification and correction of the automatically generated labels.Bounding boxes and IDs were assigned consistently across different cameras and sequences to maintain identity coherence.Data Structure and FormatThe dataset comprises:Video Files: 70 videos, each corresponding to a specific camera view in a sequence, stored in AVI format.Annotation Files: JSON files containing frame-wise annotations, including bounding box coordinates and identity labels.The dataset is structured as follows:videos/seq_XXX/camera_Y.avi: Video files for each camera view.annotations/seq_XXX/camera_Y.json: Annotation files providing labeled bounding boxes and IDs.Use Cases and ReusabilityThe CHIRLA dataset is suitable for:Long-term person re-identificationMulti-camera tracking and re-identificationSingle-camera tracking and re-identification
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Oxford-IIIT Pet Dataset
Description
A 37 category pet dataset with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. This instance of the dataset uses standard label ordering and includes the standard train/test splits. Trimaps and bbox are not included, but there is an image_id field that can be used to reference those annotations from official metadata. Website: https://www.robots.ox.ac.uk/~vgg/data/pets/… See the full description on the dataset page: https://huggingface.co/datasets/timm/oxford-iiit-pet.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains annotated underwater images of pipeline components, designed for robotics applications such as subsea inspection, maintenance, and navigation. The dataset was obtained from Roboflow Universe - Yellow Pipes v4.
The dataset includes the following object classes, each represented with pixel-accurate segmentation masks:
tpipe: T-junctions in pipelines (where three pipes connect in a "T" shape).
lpipe: Pipe elbows or bends (usually at 90° or 45° angles).
coupler: Pipe couplers or connectors joining two straight pipe segments.
pipe: Straight pipe sections without visible joints or bends.
YOLO Segmentation:
Each image has an associated .txt
file with segmentation label data in YOLO format.
Each line represents one object instance.
The first value is the class ID (0
= tpipe, 1
= lpipe, 2
= coupler, 3
= pipe).
The remaining values are normalized segmentation points describing the object’s outline as polygons.
Images:
Supplied in standard image formats (e.g., .jpg
, .png
).
Example segmentation label line (YOLO format):
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Manual annotation for human action recognition with content semantics using 3D Point Cloud (3D-PC) in industrial environments consumes a lot of time and resources. This work aims to recognize, analyze, and model human actions to develop a framework for automatically extracting content semantics. Main Contributions of this work: 1. design a multi-layer structure of various DNN classifiers to detect and extract humans and dynamic objects using 3D-PC preciously, 2. empirical experiments with over 10 subjects for collecting datasets of human actions and activities in one industrial setting, 3. development of an intuitive GUI to verify human actions and its interaction activities with the environment, 4. design and implement a methodology for automatic sequence matching of human actions in 3D-PC. All these procedures are merged in the proposed framework and evaluated in one industrial Use-Case with flexible patch sizes. Comparing the new approach with standard methods has shown that the annotation process can be accelerated by 5.2 times through automation.