https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1984321%2Fee10abf5409ea4eaaad3dfaa9514a4bb%2FScreenshot_2021-08-06_at_16.15.03.png?generation=1694441423300452&alt=media" alt="">
The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. Homepage.
The kinetics dataset is licensed by Google Inc. under a Creative Commons Attribution 4.0 International License. Published. May 22, 2017.
Kinetics-700 is a video dataset of 650,000 clips that covers 700 human action classes. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 700 video clips. Each clip is annotated with an action class and lasts approximately 10 seconds.
The Kinetics dataset is a large-scale human action dataset, which consists of 400 action classes where each category has more than 400 videos.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Description
Original source: https://www.deepmind.com/open-source/kinetics
innat/KineticsTop5
A small set from Kinetics-400. It contains 5 classes. {0: 'opening_bottle', 1: 'squat', 2: 'reading_book', 3: 'sneezing', 4: 'reading_newspaper'}
kinetics_top5.zip: No internal data drop. kinetics_top5_tiny.zip: Internal data drop.
This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:
ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...
You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
I3D Video Features, Labels and Splits for Multicamera Overlapping Datasets Pets-2009, HQFS and Up-Fall
The Inflated 3D (I3D) video features, ground truths, and train/test splits for the multicamera datasets Pets-2009, HQFS, and Up-Fall are available here. We relabeled two datasets (HQFS and Pets-2009) for the task of VAD-MIL under multiple cameras. Three feature dispositions of I3D data are available: I3D-RGB, I3D-OF, and the linear concatenation of these features. These datasets can be used as benchmarks for the video anomaly detection task under multiple instance learning and multiple overlapping cameras.
Preprocessed Datasets
PETS-2009 is a benchmark dataset (https://cs.binghamton.edu/~mrldata/pets2009) aggregating different scene sets with multiple overlapping camera views and distinct events involving crowds. We labeled the scenes at \textit{frame} level as anomaly or normal events. Scenes with background, people walking individually or in a crowd, and regular passing of cars are considered normal patterns. Frames with occurrences of people running (individually or in the crowd), crowding of people in the middle of the traffic intersection, and people in the counterflow were considered anomalous patterns. Videos of scenes with the occurrence of anomalous frames are labeled as anomalous, while videos without the occurrence of anomalies are marked as normal videos. The High-Quality Fall Simulation Data - HQFS dataset (https://iiw.kuleuven.be/onderzoek/advise/datasets/fall-and-adl-meta-data) is an indoor scenario with five overlapping cameras with the occurrence of fall incidents. We consider a person falling on the floor an uncommon event. We also relabeled the frame annotations to consider the intervals where the person remains lying on the ground after the fall. The multi-class Up-Fall (https://sites.google.com/up.edu.mx/har-up/) detection dataset contains two overlapping camera views and infrared sensors in a laboratory scenario.
Video Feature Extraction
We use Inflated 3D (I3D) features to represent video clips of 16 frames. We use the Video Features library (https://github.com/v-iashin/video_features) that uses a pre-trained model on the Kinetics 400 dataset. For this procedure, the frame sequence length from which to get the video clip feature representation (or window size) and the number of frames to step before extracting the next features were set to 16 frames. After the video extraction process, each video from each camera corresponds to a matrix with dimension n x 1024, where n is a variable number of existing segments and the number of attributes is 1024 (I3D attributes referring to RGB appearance information or I3D attributes referring to Optical Flow information). It is important to note that the videos (\textit{bags}) are divided into clips with a fixed number of \textit{frames}. Consequently, each video \textit{bag} contains a variable number of clips. A clip can be completely normal, completely anomalous, or mixed with normal and anomalous frames. There are three possible deep feature dispositions considered: I3D features generated with only RGB (1024 I3D features from RGB data), Optical Flow (1024 I3D features from optical flow data), and the combination of both (by simple linear concatenation). We also make available 10-crop features (https://pytorch.org/vision/main/generated/torchvision.transforms.TenCrop.html) by yielding 10 crops for a given video clip.
File Description
center-crop.zip: Folder with I3D features of Pets-2009, HQFS and Up-Fall datasets;
10-crop.zip: Folder with I3D features (10-crop) of Pets-2009, HQFS and Up-Fall datasets;
gts.zip: Folder with ground truths at frame-level and video-level of Pets-2009, HQFS and Up-Fall datasets;
splits.zip: Folder with Lists of training and test splits of Pets-2009, HQFS and Up-Fall datasets;
A portion of the preprocessed I3D feature sets was leveraged in the studies outlined in these publications:
Pereira, S. S., & Maia, J. E. B. (2024). MC-MIL: video surveillance anomaly detection with multi-instance learning and multiple overlapped cameras. Neural Computing and Applications, 36(18), 10527-10543. Available at https://link.springer.com/article/10.1007/s00521-024-09611-3.
Pereira, S. S. L., Maia, J. E. B., & Proença, H. (2024, September). Video Anomaly Detection in Overlapping Data: The More Cameras, the Better?. In 2024 IEEE International Joint Conference on Biometrics (IJCB) (pp. 1-10). IEEE. Available at https://ieeexplore.ieee.org/document/10744502.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1984321%2Fee10abf5409ea4eaaad3dfaa9514a4bb%2FScreenshot_2021-08-06_at_16.15.03.png?generation=1694441423300452&alt=media" alt="">
The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. Homepage.
The kinetics dataset is licensed by Google Inc. under a Creative Commons Attribution 4.0 International License. Published. May 22, 2017.