Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
About
We provide a comprehensive talking-head video dataset with over 50,000 videos, totaling more than 500 hours of footage and featuring 23,841 unique identities from around the world.
Distribution
Detailing the format, size, and structure of the dataset: Data Volume: -Total Size: 2.5TB
-Total Videos: 47,200
-Identities Covered: 23,000
-Resolution: 60% 4k(1980), 33% fullHD(1080)
-Formats: MP4
-Full-length videos with visible mouth movements in every frame.
-Minimum face size of 400 pixels.
-Video durations range from 20 seconds to 5 minutes.
-Faces have not been cut out, full screen videos including backgrounds.
Usage
This dataset is ideal for a variety of applications:
Face Recognition & Verification: Training and benchmarking facial recognition models.
Action Recognition: Identifying human activities and behaviors.
Re-Identification (Re-ID): Tracking identities across different videos and environments.
Deepfake Detection: Developing methods to detect manipulated videos.
Generative AI: Training high-resolution video generation models.
Lip Syncing Applications: Enhancing AI-driven lip-syncing models for dubbing and virtual avatars.
Background AI Applications: Developing AI models for automated background replacement, segmentation, and enhancement.
Coverage
Explaining the scope and coverage of the dataset:
Geographic Coverage: Worldwide
Time Range: Time range and size of the videos have been noted in the CSV file.
Demographics: Includes information about age, gender, ethnicity, format, resolution, and file size.
Languages Covered (Videos):
English: 23,038 videos
Portuguese: 1,346 videos
Spanish: 677 videos
Norwegian: 1,266 videos
Swedish: 1,056 videos
Korean: 848 videos
Polish: 1,807 videos
Indonesian: 1,163 videos
French: 1,102 videos
German: 1,276 videos
Japanese: 1,433 videos
Dutch: 1,666 videos
Indian: 1,163 videos
Czech: 590 videos
Chinese: 685 videos
Italian: 975 videos
Who Can Use It
List examples of intended users and their use cases:
Data Scientists: Training machine learning models for video-based AI applications.
Researchers: Studying human behavior, facial analysis, or video AI advancements.
Businesses: Developing facial recognition systems, video analytics, or AI-driven media applications.
Additional Notes
Ensure ethical usage and compliance with privacy regulations. The dataset’s quality and scale make it valuable for high-performance AI training. Potential preprocessing (cropping, down sampling) may be needed for different use cases. Dataset has not been completed yet and expands daily, please contact for most up to date CSV file. The dataset has been divided into 100GB zipped files and is hosted on a private server (with the option to upload to the cloud if needed). To verify the dataset's quality, please contact me for the full CSV file. I’d be happy to provide example videos selected by the potential buyer.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Study how YouTube videos become viral or, more in general, how they evolve in terms of views, likes and subscriptions is a topic of interest in many disciplines. With this dataset you can study such phenomena, with statistics about 1 million YouTube videos. The information was collected in 2013 when YouTube was exposing the data publicly: they removed this functionality in the years and now it's possible to have such statistics only to the owner of the video. This makes this dataset unique.
This Dataset has been generated with YOUStatAnalyzer, a tool developed by myself (Mattia Zeni) when I was working for CREATE-NET (www.create-net.org) within the framework of the CONGAS FP7 project (http://www.congas-project.eu). For the project we needed to collect and analyse the dynamics of YouTube videos popularity. The dataset contains statistics of more than 1 million Youtube videos, chosen accordingly to random keywords extracted from the WordNet library (http://wordnet.princeton.edu).
The motivation that led us to the development of the YOUStatAnalyser data collection tool and the creation of this dataset is that there's an active research community working on the interplay among user individual preferences, social dynamics, advertising mechanisms and a common problem is the lack of open large-scale datasets. At the same time, no tool was present at that time. Today, YouTube removed the possibility to visualize these data on each video's page, making this dataset unique.
When using our dataset for research purposes, please cite it as:
@INPROCEEDINGS{YOUStatAnalyzer,
author={Mattia Zeni and Daniele Miorandi and Francesco {De Pellegrini}},
title = {{YOUStatAnalyzer}: a Tool for Analysing the Dynamics of {YouTube} Content Popularity},
booktitle = {Proc.\ 7th International Conference on Performance Evaluation Methodologies and Tools
(Valuetools, Torino, Italy, December 2013)},
address = {Torino, Italy},
year = {2013}
}
The dataset contains statistics and metadata of 1 million YouTube videos, collected in 2013. The videos have been chosen accordingly to random keywords extracted from the WordNet library (http://wordnet.princeton.edu).
The structure of a dataset is the following:
{
u'_id': u'9eToPjUnwmU',
u'title': u'Traitor Compilation # 1 (Trouble ...',
u'description': u'A traitor compilation by one are ...',
u'category': u'Games',
u'commentsNumber': u'6',
u'publishedDate': u'2012-10-09T23:42:12.000Z',
u'author': u'ServilityGaming',
u'duration': u'208',
u'type': u'video/3gpp',
u'relatedVideos': [u'acjHy7oPmls', u'EhW2LbCjm7c', u'UUKigFAQLMA', ...],
u'accessControl': {
u'comment': {u'permission': u'allowed'},
u'list': {u'permission': u'allowed'},
u'videoRespond': {u'permission': u'moderated'},
u'rate': {u'permission': u'allowed'},
u'syndicate': {u'permission': u'allowed'},
u'embed': {u'permission': u'allowed'},
u'commentVote': {u'permission': u'allowed'},
u'autoPlay': {u'permission': u'allowed'}
},
u'views': {
u'cumulative': {
u'data': [15.0, 25.0, 26.0, 26.0, ...]
},
u'daily': {
u'data': [15.0, 10.0, 1.0, 0.0, ..]
}
},
u'shares': {
u'cumulative': {
u'data': [0.0, 0.0, 0.0, 0.0, ...]
},
u'daily': {
u'data': [0.0, 0.0, 0.0, 0.0, ...]
}
},
u'watchtime': {
u'cumulative': {
u'data': [22.5666666667, 36.5166666667, 36.7, 36.7, ...]
},
u'daily': {
u'data': [22.5666666667, 13.95, 0.166666666667, 0.0, ...]
}
},
u'subscribers': {
u'cumulative': {
u'data': [0.0, 0.0, 0.0, 0.0, ...]
},
u'daily': {
u'data': [-1.0, 0.0, 0.0, 0.0, ...]
}
},
u'day': {
u'data': [1349740800000.0, 1349827200000.0, 1349913600000.0, 1350000000000.0, ...]
}
}
From the structure above is possible to see which fields an entry in the dataset has. It is possible to divide them into 2 sections:
1) Video Information.
_id -> Corresponding to the video ID and to the unique identifier of an entry in the database.
title -> Te video's title.
description -> The video's description.
category -> The YouTube category the video is inserted in.
commentsNumber -> The number of comments posted by users.
publishedDate -> The date the video has been published.
author -> The author of the video.
duration -> The video duration in seconds.
type -> The encoding type of the video.
relatedVideos -> A list of related videos.
accessControl -> A list of access policies for different aspects related to the video.
2) Video Statistics.
Each video can have 4 different statistics variables: views, shares, subscribers and watchtime. Recent videos have all of them while older video can have only the 'views' variable. Each variable has 2 dimensions, daily and cumulative.
views -> number of views collected by the video.
shares -> number of sharing operations performed by users.
watchtime -> the time spent by users watching the video, in minute.
subscribers -> number of subscriptions to the channel the video is inserted in, caused by the selected video.
day -> a list of days indicating the analysed period for the statistic.
In the case you are using mongoDB as database system, you can import our dataset using the command:
mongoimport --db [MONGODB_NAME] --collection [MONGODB_COLLECTION] --file dataset.json
Once you imported the Dataset in your DB, you can access the data performing queries. Let's present some example code in python in order to perform queries.
The following code will perform a query without research parameters, returning all the entries in the database, each one saved into the variable entry:
client = MongoClient('localhost', 27017)
db = client[MONGODB_NAME]
collection = db[MONGODB_COLLECTION]
for entry in db.collection.find():
print entry["day"]["data"]
If you want to restrict the results to some entries that answer to a specified query you can use:
client = MongoClient('localhost', 27017)
db = client[MONGODB_NAME]
collection = db[MONGODB_COLLECTION]
for entry in (db.collection.find({"watchtime":{ "$exists": True }})) and (db.collection.find({"category":"Music"})):
print entry["day"]["data"]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Monitoring animals in their natural habitat is essential for the advancement of animal behavioural studies, especially in pollination studies. We present a novel hybrid detection and tracking algorithm "HyDaT" to monitor unmarked insects outdoors. Our software can detect an insect, identify when a tracked insect becomes occluded from view and when it re-emerges, determine when an insect exits the camera field of view, and our software assembles a series of insect locations into a coherent trajectory. The insect detecting component of the software uses background subtraction and deep learning-based detection together to accurately and efficiently locate the insect.This dataset includes videos of honeybees foraging in two ground-covers Scaevola and Lamb's-ear, comprising of complex background detail, wind-blown foliage, and honeybees moving into and out of occlusion beneath leaves and among three-dimensional plant structures. Honeybee tracks and associated outputs of experiments extracted using HyDaT algorithm are included in the dataset. The dataset also contains annotated images and pre-trained YOLOv2 object detection models of honeybees.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
YouTube maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”.
Note that this dataset is a structurally improved version of this dataset.
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the IN, US, GB, DE, CA, FR, RU, BR, MX, KR, and JP regions (India, USA, Great Britain, Germany, Canada, France, Russia, Brazil, Mexico, South Korea, and, Japan respectively), with up to 200 listed trending videos per day.
Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.
The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the 11 regions in the dataset.
For more information on specific columns in the dataset refer to the column metadata.
This dataset was collected using the YouTube API. This dataset is the updated version of Trending YouTube Video Statistics.
Possible uses for this dataset could include: - Sentiment analysis in a variety of forms - Categorizing YouTube videos based on their comments and statistics. - Training ML algorithms like RNNs to generate their own YouTube comments. - Analyzing what factors affect how popular a YouTube video will be. - Statistical analysis over time.
For further inspiration, see the kernels on this dataset!
Every recording captures a single subject performing slow head sweeps (left ↔ right ↕ up ↕ down) while counting “one … ten” in English—yielding synchronized face, lips, and voice data.
Participants: 2403 (≈1.28 clip per person)
Capture Protocol
The dataset was assembled through a GDPR-compliant crowdsourcing task focused on secure-transaction AI. Contributors followed a strict brief:
Environment – indoor, even lighting, plain or uncluttered background, no back-lighting or shadows.
Appearance – full face visible; no glasses, hats, masks, filters, or overlays.
Action – look straight at the camera, then slowly rotate head left, right, up, down while maintaining gaze; finish by speaking the ten-count.
Duration – ~20 s continuous take, 30 fps or higher.
Framing – single person, shoulders-up composition; no other people, pets, or distractions.
All submissions passed automated and manual QC for framing, focus, lighting, and audio intelligibility.
Demographic Breakdown
Gender: Male - 84.2 % Female - 15.8 %
Ethnicity: African - 69.3% South Asian - 10.3% South-East Asian - 9.7% European - 4.0% Middle East - 3.4% Arab - 1.9% Latino - 1.2% East Asian - 0.2%
Age distribution: <18 - 5.42% 18-25 - 48.88% 25-30 - 20.92% 30-40 - 18.17% 40-50 - 5.35% 60+ - 1.26%
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset consists of 10,000+ files featuring 7,000+ people, providing a comprehensive resource for research in deepfake detection and deepfake technology. It includes real videos of individuals with AI-generated faces overlaid, specifically designed to enhance liveness detection systems.
By utilizing this dataset, researchers can advance their understanding of deepfake generation and improve the performance of detection methods. - Get the data
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2F7f47885f0afdca5c22f9f47e81307b95%2FFrame%201%20(8).png?generation=1742726304761567&alt=media" alt="">
Dataset was created by generating fake faces and overlaying them onto authentic video clips sourced from platforms such as aisaver.io, faceswapvideo.ai, and magichour.ai.Videos featuring different individuals, backgrounds, and scenarios, making it suitable for various research applications.
Researchers can leverage this dataset to enhance their understanding of deepfake detection and contribute to the development of more robust detection methods that can effectively combat the challenges posed by deepfake technology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparisons between different video camera outputs as well as the background (*: denotes statistical significance below the p = 0.05 level).
According to a study on podcast listening in the United States in 2024, ** percent of weekly podcast listeners stated to have consumed video podcasts. While ** percent of the respondents indicated to watch video podcasts actively, ** percent also stated to let the video content play in the background while they were listening to the audio.
Action Recognition in video is known to be more challenging than image recognition problems. Unlike image recognition models which use 2D convolutional neural blocks, action classification models require additional dimensionality to capture the spatio-temporal information in video sequences. This intrinsically makes video action recognition models computationally intensive and significantly more data-hungry than image recognition counterparts. Unequivocally, existing video datasets such as Kinetics, AVA, Charades, Something-Something, HMDB51, and UFC101 have had tremendous impact on the recently evolving video recognition technologies. Artificial Intelligence models trained on these datasets have largely benefited applications such as behavior monitoring in elderly people, video summarization, and content-based retrieval. However, this growing concept of action recognition has yet to be explored in Intelligent Transportation System (ITS), particularly in vital applications such as incidents detection. This is partly due to the lack of availability of annotated dataset adequate for training models suitable for such direct ITS use cases. In this paper, the concept of video action recognition is explored to tackle the problem of highway incident detection and classification from live surveillance footage. First, a novel dataset - HWID12 (Highway Incidents Detection) dataset is introduced. The HWAD12 consists of 11 distinct highway incidents categories, and one additional category for negative samples representing normal traffic. The proposed dataset also includes 2780+ video segments of 3 to 8 seconds on average each, and 500k+ temporal frames. Next, the baseline for highway accident detection and classification is established with a state-of-the-art action recognition model trained on the proposed HWID12 dataset. Performance benchmarking for 12-class (normal traffic vs 11 accident categories), and 2-class (incident vs normal traffic) settings is performed. This benchmarking reveals a recognition accuracy of up to 88% and 98% for 12-class and 2-class recognition setting, respectively.
The Proposed Highway Incidents Detection Dataset (HWID12) is the first of its kind dataset aimed at fostering experimentation of video action recognition technologies to solve the practical problem of real-time highway incident detections which currently challenges intelligent transportation systems. The lack of such dataset has limited the expansion of the recent breakthroughs in video action classification for practical uses cases in intelligent transportation systems.. The proposed dataset contains more than 2780 video clips of length varying between 3 to 8 seconds. These video clips capture moments leading to, up until right after an incident occurred. The clips were manually segmented from accident compilations videos sourced from YouTube and other videos data platforms.
There is one main zip file available for download. The zip file contains 2780+ video clips.
1) 12 folders
2) each folder represents an incident category. One of the classes represent the negative sample class which simulates normal traffic.
Any publication using this database must reference to the following journal manuscript:
Note: if the link is broken, please use http instead of https.
In Chrome, use the steps recommended in the following website to view the webpage if it appears to be broken https://www.technipages.com/chrome-enabledisable-not-secure-warning
Other relevant datasets VCoR dataset: https://www.kaggle.com/landrykezebou/vcor-vehicle-color-recognition-dataset VRiV dataset: https://www.kaggle.com/landrykezebou/vriv-vehicle-recognition-in-videos-dataset
For any enquires regarding the HWID12 dataset, contact: landrykezebou@gmail.com
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video co-segmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for our paper "WormSwin: Instance Segmentation of C. elegans using Vision Transformer".
This publication is divided into three parts:
The CSB-1 Dataset consists of frames extracted from videos of Caenorhabditis elegans (C. elegans) annotated with binary masks. Each C. elegans is separately annotated, providing accurate annotations even for overlapping instances. All annotations are provided in binary mask format and as COCO Annotation JSON files (see COCO website).
The videos are named after the following pattern:
<"worm age in hours"_"mutation"_"irradiated (binary)"_"video index (zero based)">
For mutation the following values are possible:
An example video name would be 24_1_1_2 meaning it shows C. elegans with csb-1 mutation, being 24h old which got irradiated.
Video data was provided by M. Rieckher; Instance Segmentation Annotations were created under supervision of K. Bozek and M. Deserno.
The Synthetic Images Dataset was created by cutting out C. elegans (foreground objects) from the CSB-1 Dataset and placing them randomly on background images also taken from the CSB-1 Dataset. Foreground objects were flipped, rotated and slightly blurred before placed on the background images.
The same was done with the binary mask annotations taken from CSB-1 Dataset so that they match the foreground objects in the synthetic images. Additionally, we added rings of random color, size, thickness and position to the background images to simulate petri-dish edges.
This synthetic dataset was generated by M. Deserno.
The Mating Dataset (MD) consists of 450 grayscale image patches of 1,012 x 1,012 px showing C. elegans with high overlap, crawling on a petri-dish.
We took the patches from a 10 min. long video of size 3,036 x 3,036 px. The video was downsampled from 25 fps to 5 fps before selecting 50 random frames for annotating and patching.
Like the other datasets, worms were annotated with binary masks and annotations are provided as COCO Annotation JSON files.
The video data was provided by X.-L. Chu; Instance Segmentation Annotations were created under supervision of K. Bozek and M. Deserno.
Further details about the datasets can be found in our paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Small-scale Deepfake Forgery Video Dataset (SDFVD) is a custom dataset consisting of real and deepfake videos with diverse contexts designed to study and benchmark deepfake detection algorithms. The dataset comprising of a total of 106 videos, with 53 original and 53 deepfake videos. Equal number of real and deepfake videos, ensures balance for machine learning model training and evaluation. The original videos were collected from Pexels: a well- known provider of stock photography and stock footage(video). These videos include a variety of backgrounds, and the subjects represent different genders and ages, reflecting a diverse range of scenarios. The input videos have been pre-processed by cropping them to a length of approximately 4 to 5 seconds and resizing them to 720p resolution, ensuring a consistent and uniform format across the dataset. Deepfake videos were generated using Remaker AI employing face-swapping techniques. Remaker AI is an AI-powered platform that can generate images, swap faces in photos and videos, and edit content. The source face photos for these swaps were taken from Freepik: is an image bank website provides contents such as photographs, illustrations and vector images. SDFVD was created due to the lack of availability of any such comparable small-scale deepfake video datasets. Key benefits of such datasets are: • In educational settings or smaller research labs, smaller datasets can be particularly useful as they require fewer resources, allowing students and researchers to conduct experiments with limited budgets and computational resources. • Researchers can use small-scale datasets to quickly prototype new ideas, test concepts, and refine algorithms before scaling up to larger datasets. Overall, SDFVD offers a compact but diverse collection of real and deepfake videos, suitable for a variety of applications, including research, security, and education. It serves as a valuable resource for exploring the rapidly evolving field of deepfake technology and its impact on society.
Video service providers (cable) are required to compensate municipalities for the use of public rights-of-way. This compensation is used by the City of Bloomington for a number of communications and information technology projects. This data reflects the payments of wireline video service providers in the City of Bloomington. Attached is an Excel report using this dataset.
https://www.kcl.ac.uk/researchsupport/assets/DataAccessAgreement-Description.pdfhttps://www.kcl.ac.uk/researchsupport/assets/DataAccessAgreement-Description.pdf
This dataset contains annotated images for object detection for containers and hands in a first-person view (egocentric view) during drinking activities. Both YOLOV8 format and COCO format are provided.Please refer to our paper for more details.Purpose: Training and testing the object detection model.Content: Videos from Session 1 of Subjects 1-20.Images: Extracted from the videos of Subjects 1-20 Session 1.Additional Images:~500 hand/container images from Roboflow Open Source data.~1500 null (background) images from VOC Dataset and MIT Indoor Scene Recognition Dataset:1000 indoor scenes from 'MIT Indoor Scene Recognition'400 other unrelated objects from VOC DatasetData Augmentation:Horizontal flipping±15% brightness change±10° rotationFormats Provided:COCO formatPyTorch YOLOV8 formatImage Size: 416x416 pixelsTotal Images: 16,834Training: 13,862Validation: 1,975Testing: 997Instance Numbers:Containers: Over 10,000Hands: Over 8,000
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
The Video Background Remover market is rapidly evolving, driven by the growing demand for high-quality content in digital media production, social media marketing, and virtual communications. This technology enables users to seamlessly remove or alter backgrounds in videos without extensive editing skills, providing
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Stock Video Market size was valued at USD 5.99 Billion in 2024 and is projected to reach USD 9.98 Billion by 2032, growing at a CAGR of 8.75% during the forecast period 2026-2032.
Stock Video Market: Definition/ Overview
Stock video is pre-recorded material available for license to filmmakers, video producers and content developers. These movies include a wide range of subjects and scenarios, from natural scenes to urban landscapes and are utilized to supplement video projects without requiring original filming. Stock videos save time and resources by providing high-quality visuals quickly.
Stock video assets are adaptable and can be utilized in a variety of media projects. They improve marketing campaigns, social media postings and advertising by providing professional quality without the cost of specialized shoots. Filmmakers and video developers use them for B-roll, background scenes and visual storytelling. They can also be used in educational videos, presentations and website designs to interest and inform viewers.
Stock video offers the potential to transform content development by allowing for quick, cost-effective production in marketing, education and entertainment. It benefits a wide range of industries, including advertising and movies by strengthening storytelling with high-quality images. As AI progresses, personalized and dynamic stock footage will enhance user experiences making it a useful tool for both creators and corporations.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Introduction
The ComplexVAD dataset consists of 104 training and 113 testing video sequences taken from a static camera looking at a scene of a two-lane street with sidewalks on either side of the street and another sidewalk going across the street at a crosswalk. The videos were collected over a period of a few months on the campus of the University of South Florida using a camcorder with 1920 x 1080 pixel resolution. Videos were collected at various times during the day and on each day of the week. Videos vary in duration with most being about 12 minutes long. The total duration of all training and testing videos is a little over 34 hours. The scene includes cars, buses and golf carts driving in two directions on the street, pedestrians walking and jogging on the sidewalks and crossing the street, people on scooters, skateboards and bicycles on the street and sidewalks, and cars moving in the parking lot in the background. Branches of a tree also move at the top of many frames.
The 113 testing videos have a total of 118 anomalous events consisting of 40 different anomaly types.
Ground truth annotations are provided for each testing video in the form of bounding boxes around each anomalous event in each frame. Each bounding box is also labeled with a track number, meaning each anomalous event is labeled as a track of bounding boxes. A single frame can have more than one anomaly labeled.
At a Glance
License
The ComplexVAD dataset is released under CC-BY-SA-4.0 license.
All data:
Created by Mitsubishi Electric Research Laboratories (MERL), 2024
SPDX-License-Identifier: CC-BY-SA-4.0
Financial overview and grant giving statistics of The Video Game History Foundation Inc
On June 17, 2016, Korean education brand Pinkfong released their video "Baby Shark Dance", and the rest is history. In January 2021, Baby Shark Dance became the first YouTube video to surpass 10 billion views, after snatching the crown of most-viewed YouTube video of all time from the former record holder "Despacito" one year before. "Baby Shark Dance" currently has over 15 billion lifetime views on YouTube. Music videos on YouTube “Baby Shark Dance” might be the current record-holder in terms of total views, but Korean artist Psy’s “Gangnam Style” video remained on the top spot for longest (1,689 days or 4.6 years) before ceding its spot to its successor. With figures like these, it comes as little surprise that the majority of the most popular videos on YouTube are music videos. Since 2010, all but one the most-viewed videos on YouTube have been music videos, signifying the platform’s shift in focus from funny, viral videos to professionally produced content. As of 2022, about 40 percent of the U.S. digital music audience uses YouTube Music. Popular video content on YouTube Music fans are also highly engaged audiences and it is not uncommon for music videos to garner significant amounts of traffic within the first 24 hours of release. Other popular types of videos that generate lots of views after their first release are movie trailers, especially superhero movies related to the MCU (Marvel Cinematic Universe). The first official trailer for the upcoming film “Avengers: Endgame” generated 289 million views within the first 24 hours of release, while the movie trailer for Spider-Man: No Way Home generated over 355 views on the first day from release, making it the most viral movie trailer.
These dataset is made up of images from 8 different environments. 37 video sources have been processed, every 1 second an image is extracted (frame at 0.5s, 1.5s, 2.5s ... and so on) and to accompany that image, the MFCC audio statistics are also extracted from the relevant second of video.
In this dataset, you will notice some common errors from single classifiers. For example, in the video of London, the image classifier confuses the environment with "FOREST" when a lady walks past with flowing hair. Likewise, the audio classifier gets confused by "RIVER" when we walk past a large fountain in Las Vegas due to the sounds of flowing water. Both of these errors can be fixed by a multi-modal approach, where fusion allows for the correction of errors. In our study, both of these issues were classified as "CITY" since multimodality can provide a solution for single-modal errors due to anomalous data occurring.
Look and Listen: A Multi-Modal Late Fusion Approach to Scene Classification for Autonomous Machines Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Aniko Ekart, and George Vogiatzis
In this challenge, we can learn environments ("Where am I?") from either images, audio, or take a multimodal approach to fuse the data.
Multi-modal fusion often requires far fewer computing resources than temporal models, but sometimes at the cost of classification ability. Can a method of fusion overcome this? Let's find out!
Class data are given as strings in dataset.csv
Each row of the dataset contains a path to the image, as well as the MFCC data extracted from the second of video that accompany the frame.
(copied and pasted from the paper) we extract the the Mel-Frequency Cepstral Coefficients (MFCC) of the audio clips through a set of sliding windows 0.25s in length (ie frame size of 4K sampling points) and an additional set of overlapping windows, thus producing 8 sliding windows, 8 frames/sec. From each audio-frame, we extract 13 MFCC attributes, producing 104 attributes per 1 second clip.
These are numbered in sequence from MFCC_1
The original study deals with Class 2 (the actual environment, 8 classes) but we have included Class 1 also. Class 1 is a much easier binary classification problem of "Outdoors" and "Indoors"
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
About
We provide a comprehensive talking-head video dataset with over 50,000 videos, totaling more than 500 hours of footage and featuring 23,841 unique identities from around the world.
Distribution
Detailing the format, size, and structure of the dataset: Data Volume: -Total Size: 2.5TB
-Total Videos: 47,200
-Identities Covered: 23,000
-Resolution: 60% 4k(1980), 33% fullHD(1080)
-Formats: MP4
-Full-length videos with visible mouth movements in every frame.
-Minimum face size of 400 pixels.
-Video durations range from 20 seconds to 5 minutes.
-Faces have not been cut out, full screen videos including backgrounds.
Usage
This dataset is ideal for a variety of applications:
Face Recognition & Verification: Training and benchmarking facial recognition models.
Action Recognition: Identifying human activities and behaviors.
Re-Identification (Re-ID): Tracking identities across different videos and environments.
Deepfake Detection: Developing methods to detect manipulated videos.
Generative AI: Training high-resolution video generation models.
Lip Syncing Applications: Enhancing AI-driven lip-syncing models for dubbing and virtual avatars.
Background AI Applications: Developing AI models for automated background replacement, segmentation, and enhancement.
Coverage
Explaining the scope and coverage of the dataset:
Geographic Coverage: Worldwide
Time Range: Time range and size of the videos have been noted in the CSV file.
Demographics: Includes information about age, gender, ethnicity, format, resolution, and file size.
Languages Covered (Videos):
English: 23,038 videos
Portuguese: 1,346 videos
Spanish: 677 videos
Norwegian: 1,266 videos
Swedish: 1,056 videos
Korean: 848 videos
Polish: 1,807 videos
Indonesian: 1,163 videos
French: 1,102 videos
German: 1,276 videos
Japanese: 1,433 videos
Dutch: 1,666 videos
Indian: 1,163 videos
Czech: 590 videos
Chinese: 685 videos
Italian: 975 videos
Who Can Use It
List examples of intended users and their use cases:
Data Scientists: Training machine learning models for video-based AI applications.
Researchers: Studying human behavior, facial analysis, or video AI advancements.
Businesses: Developing facial recognition systems, video analytics, or AI-driven media applications.
Additional Notes
Ensure ethical usage and compliance with privacy regulations. The dataset’s quality and scale make it valuable for high-performance AI training. Potential preprocessing (cropping, down sampling) may be needed for different use cases. Dataset has not been completed yet and expands daily, please contact for most up to date CSV file. The dataset has been divided into 100GB zipped files and is hosted on a private server (with the option to upload to the cloud if needed). To verify the dataset's quality, please contact me for the full CSV file. I’d be happy to provide example videos selected by the potential buyer.