40 datasets found

i
Data from: YouTube Video Network Dataset for Israel-Hamas War
ieee-dataport.org
Updated Dec 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thejas T (2023). YouTube Video Network Dataset for Israel-Hamas War [Dataset]. https://ieee-dataport.org/documents/youtube-video-network-dataset-israel-hamas-war
Explore at:
Dataset updated
Dec 23, 2023
Authors
Thejas T
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Israel, YouTube
Description
Over the past few years YouTube has became a popular site for video broadcasting and earning money by publishing various different skills in the form of videos. For some people it has become a main source to earn money. Getting the videos trending among the viewers is one of the major tasks which each and every content creator wants. Popularity of any video and its reach to the audience is completely based on YouTube's Recommendation algorithm. This document is a dataset descriptor for the dataset collected over the time span of about 45 days during the Israel-Hamas War
P
Long Video Dataset Dataset
paperswithcode.com
Updated Nov 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yongqing Liang; Xin Li; Navid Jafari; Qin Chen (2020). Long Video Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/long-video-dataset
Explore at:
Dataset updated
Nov 18, 2020
Authors
Yongqing Liang; Xin Li; Navid Jafari; Qin Chen
Description
We randomly selected three videos from the Internet, that are longer than 1.5K frames and have their main objects continuously appearing. Each video has 20 uniformly sampled frames manually annotated for evaluation.
d
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...
search.dataone.org
Updated Sep 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thakur, Nirmalya; Su, Vanessa; Shao, Mingchen; Patel, Kesha A.; Jeong, Hongseok; Knieling, Victoria; Bian, Andrew (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles [Dataset]. http://doi.org/10.7910/DVN/QTJ9HC
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/QTJ9HC
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Thakur, Nirmalya; Su, Vanessa; Shao, Mingchen; Patel, Kesha A.; Jeong, Hongseok; Knieling, Victoria; Bian, Andrew
Time period covered
Jan 1, 2024 - May 31, 2024
Area covered
YouTube
Description
Please cite the following paper when using this dataset: N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A.Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” arXiv [cs.CY], 2024. Available: http://arxiv.org/abs/2406.07693 Abstract This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
Average daily time spent on social media worldwide 2012-2025
statista.com
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Average daily time spent on social media worldwide 2012-2025 [Dataset]. https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/
Explore at:
Dataset updated
Jun 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
How much time do people spend on social media? As of 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
d
Custom dataset from any website on the Internet
datarade.ai
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ScrapeLabs (2022). Custom dataset from any website on the Internet [Dataset]. https://datarade.ai/data-products/custom-dataset-from-any-website-on-the-internet-scrapelabs
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Sep 21, 2022
Dataset authored and provided by
ScrapeLabs
Area covered
Kazakhstan, Bulgaria, India, Argentina, Tunisia, Lebanon, Jordan, Guinea-Bissau, Aruba, Turks and Caicos Islands
Description
We'll extract any data from any website on the Internet. You don't have to worry about buying and maintaining complex and expensive software, or hiring developers.

Some common use cases our customers use the data for: • Data Analysis • Market Research • Price Monitoring • Sales Leads • Competitor Analysis • Recruitment

We can get data from websites with pagination or scroll, with captchas, and even from behind logins. Text, images, videos, documents.

Receive data in any format you need: Excel, CSV, JSON, or any other.
Facebook Spam Dataset
kaggle.com
Updated Apr 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaja Hussain SK (2021). Facebook Spam Dataset [Dataset]. https://www.kaggle.com/khajahussainsk/facebook-spam-dataset/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Khaja Hussain SK
Description
Context Collection of Facebook spam-legit profile and content-based data. It can be used for classification tasks.

Content The dataset can be used for building machine learning models. To collect the dataset, Facebook API and Facebook Graph API are used and the data is collected from public profiles. There are 500 legit profiles and 100 spam profiles. The list of features is as follows with Label (0-legit, 1-spam). 1. Number of friends 2. Number of followings 3. Number of Community 4. The age of the user account (in days) 5. Total number of posts shared 6. Total number of URLs shared 7. Total number of photos/videos shared 8. Fraction of the posts containing URLs 9. Fraction of the posts containing photos/videos 10. Average number of comments per post 11. Average number of likes per post 12. Average number of tags in a post (Rate of tagging) 13. Average number of hashtags present in a post

Inspiration Dataset helps the community to understand how features can help to differ Facebook legit users from spam users.
youtube
kaggle.com
Updated Sep 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramin Rahimzada (2020). youtube [Dataset]. https://www.kaggle.com/raminrahimzada/youtube/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ramin Rahimzada
Area covered
YouTube
Description
Context

A portion of data grabbed from Youtube

Content

Dataset contains youtube channels-videos-comments

Acknowledgements

Data shown in dataset such as likes count may be different from now so we described a GrabDate column
d
Replication Data for: Automated Coding of Political Campaign Advertisement...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarr, Alex; Imai, Kosuke; Hwang, June (2023). Replication Data for: Automated Coding of Political Campaign Advertisement Videos: An Empirical Validation Study [Dataset]. http://doi.org/10.7910/DVN/6SWKPR
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/6SWKPR
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Tarr, Alex; Imai, Kosuke; Hwang, June
Description
Video advertisements, either through television or the Internet, play an essential role in modern political campaigns. For over two decades, researchers have studied television video ads by analyzing the hand-coded data from the Wisconsin Advertising Project and its successor, the Wesleyan Media Project (WMP). Unfortunately, manually coding more than a hundred of variables, such as issue mentions, opponent appearance, and negativity, for many videos is a laborious and expensive process. We propose to automatically code campaign advertisement videos. Applying state-of-the-art machine learning methods, we extract various audio and image features from each video file. We show that our machine coding is comparable to human coding for many variables of the WMP data sets. Since many candidates make their advertisement videos available on the Internet, automated coding can dramatically improve the efficiency and scope of campaign advertisement research. Open-source software package is available for implementing the proposed methodology.
P
EDUVSUM Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junaid Ahmed Ghauri; Sherzod Hakimov; Ralph Ewerth, EDUVSUM Dataset [Dataset]. https://paperswithcode.com/dataset/eduvsum
Explore at:
Authors
Junaid Ahmed Ghauri; Sherzod Hakimov; Ralph Ewerth
Description
EDUVSUM contains educational videos with subtitles from three popular e-learning platforms: Edx,YouTube, and TIB AV-Portal that cover the following topics: crash course on history of science and engineering, computer science, python and web programming, machine learning and computer vision, Internet of things (IoT), and software engineering. In total, the current version of the dataset contains 98 videos with ground truth values annotated by a user with an academic background in computer science.
RECOD.ai Events Dataset
zenodo.org
data.niaid.nih.gov
application/gzip, pdf
Updated Jul 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Nascimento; José Nascimento; Anderson Rocha; Anderson Rocha (2024). RECOD.ai Events Dataset [Dataset]. http://doi.org/10.5281/zenodo.5547606
Explore at:
pdf, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5547606
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
José Nascimento; José Nascimento; Anderson Rocha; Anderson Rocha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This data set consists of links to social network items for 34 different forensic events that took place between August 14th, 2018 and January 06th, 2021. The majority of the text and images are from Twitter (a minor part is from Flickr, Facebook and Google+), and every video is from YouTube.

Data Collection

We used Social Tracker (https://github.com/MKLab-ITI/mmdemo-dockerized), along with the social medias' APIs, to gather most of the collections. For a minor part, we used Twint (https://github.com/twintproject/twint). In both cases, we provided keywords related to the event to receive the data.

It is important to mention that, in procedures like this one, usually only a small fraction of the collected data is in fact related to the event and useful for a further forensic analysis.

Content

We have data from 34 events, and for each of them we provide the files:

items_full.csv: It contains links to any social media post that was collected.

images.csv: Enlists the images collected. In some files there is a field called "ItemUrl", that refers to the social network post (e.g., a tweet) that mentions that media.

video.csv: Urls of YouTube videos that were gathered about the event.

video_tweet.csv: This file contains IDs of tweets and IDs of YouTube videos. A tweet whose ID is in this file has a video in its content. In turn, the link of a Youtube video whose ID is in this file was mentioned by at least one collected tweet. Only two collections have this file.

description.txt: Contains some standard information about the event, and possibly some comments about any specific issue related to it.

In fact, most of the collections do not have all the files above. Such an issue is due to changes in our collection procedure throughout the time of this work.

Events

We divided the events into six groups. They are,

1. Fire

Devastating fire is the main issue of the event, therefore most of the informative pictures show flames or burned constructions

14 Events

2. Collapse

Most of the relevant images depict collapsed buildings, bridges, etc. (not caused by fire).

5 Events

3. Shooting

Likely images of guns and police officers. Few or no destruction of the environment.

5 Events

4. Demonstration

Plethora of people on the streets. Possibly some problem took place on that, but in most cases the demonstration is the actual event.

7 Events

5. Collision

Traffic collision. Pictures of damaged vehicles on an urban landscape. Possibly there are images with victims on the street.

1 Event

6. Flood

Events that range from fierce rain to a tsunami. Many pictures depict water.

2 Events

We enlist the events in the file recod-ai-events-dataset-list.pdf

Media Content

Due to the terms of use from the social networks, we do not make publicly available the texts, images and videos that were collected. However, we can provide some extra piece of media content related to one (or more) events by contacting the authors.

Funding

DéjàVu thematic project, São Paulo Research Foundation (grants 2017/12646-3, 2018/18264-8 and 2020/02241-9)
f
Data from: Faces in the wild: A naturalistic study of children’s facial...
tandf.figshare.com
avi
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael M. Shuster; Linda A. Camras; Adam Grabell; Susan B. Perlman (2023). Faces in the wild: A naturalistic study of children’s facial expressions in response to an Internet prank [Dataset]. http://doi.org/10.6084/m9.figshare.8121359.v2
Explore at:
aviAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8121359.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Michael M. Shuster; Linda A. Camras; Adam Grabell; Susan B. Perlman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There is surprisingly little empirical evidence supporting theoretical and anecdotal claims regarding the spontaneous production of prototypic facial expressions used in numerous emotion recognition studies. Proponents of innate prototypic expressions believe that this lack of evidence may be due to ethical restrictions against presenting powerful elicitors in the lab. The current popularity of internet platforms designed for public sharing of videos allows investigators to shed light on this debate by examining naturally-occurring facial expressions outside the laboratory. An Internet prank (“Scary Maze”) has provided a unique opportunity to observe children reacting to a consistent fear- and surprise-inducing stimulus: The unexpected presentation of a “scary face” during an online maze game. The purpose of this study was to examine children’s facial expressions in this naturalistic setting. Emotion ratings of non-facial behaviour (provided by untrained undergraduates) and anatomically-based facial codes were obtained from 60 videos of children (ages 4–7) found on YouTube. Emotion ratings were highest for fear and surprise. Correspondingly, children displayed more facial expressions of fear and surprise than for other emotions (e.g. anger, joy). These findings provide partial support for the ecological validity of fear and surprise expressions. Still prototypic expressions were produced by fewer than half the children.
UMAHand: Hand Activity Dataset (Universidad de Málaga)
figshare.com
portaldelainvestigacion.uma.es
zip
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Casilari; Jennifer Barbosa-Galeano; Francisco Javier González-Cañete (2024). UMAHand: Hand Activity Dataset (Universidad de Málaga) [Dataset]. http://doi.org/10.6084/m9.figshare.25638246.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25638246.v3
Dataset updated
Jul 2, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Eduardo Casilari; Jennifer Barbosa-Galeano; Francisco Javier González-Cañete
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objective of the UMAHand dataset is to provide a systematic, Internet-accessible benchmarking database for evaluating algorithms for the automatic identification of manual activities. The database was created by monitoring 29 predefined activities involving specific movements of the dominant hand. These activities were performed by 25 participants, each completing a certain number of repetitions. During each movement, participants wore a 'mote' or Shimmer sensor device on their dominant hand's wrist. This sensor, comparable in weight and volume to a wristwatch, was attached with an elastic band according to a predetermined orientation.The Shimmer device contains an Inertial Measurement Unit (IMU) with a triaxial accelerometer, gyroscope, magnetometer, and barometer. These sensors recorded measurements of acceleration, angular velocity, magnetic field, and atmospheric pressure at a constant sampling frequency of 100 Hz during each movement.The UMAHand Dataset comprises a main directory and three subdirectories: TRACES (containing measurements), VIDEOS (containing video sequences) and SCRIPTS (with two scripts that automate the downloading, unzipping and processing of the dataset). The main directory also includes three descriptive plain text files and an image:• "readme.txt": this file is a brief guide of the dataset which describes the basic characteristics of the database, the testbed or experimental framework used to generate it and the organization of the data file.• "user_characteristics.txt": which contains a line of six numerical (comma-separated) values for each participant describing their personal characteristics in the following order: 1) an abstract user identifier (a number from 01 to 25), 2) a binary value indicating whether the participant is left-handed (0) or right-handed (1), 3) a numerical value indicating gender: male (0), female (1), undefined or undisclosed (2), 4) the weight in kg, and 5) the height in cm and 6) the age in years.• "activity_description.txt": For each activity, this text file incorporates a line with the activity identifier (numbered from 01 to 29) and an alphanumeric string that briefly describes the performed action.• "sensor_orientation.jpg": a JPEG-type image file illustrating the way the sensor is carried and the orientation of the measurement axes.The TRACE subfolder with the data is, in turn, organized into 25 secondary subfolders, one for each participant, named with the word "output" followed by underscore symbol (_) and the corresponding participant identifier (a number from 1 to 25). Each subdirectory contains one CSV (Comma Separated Values) file for each trial (each repetition of any activity) performed by the corresponding volunteer.The filenames with the monitored data follow the following format: "user_XX_activity_YY_trial_ZZ.csv" where XX, YY, and ZZ represent the identifiers of the participant (XX), the activity (YY) and the repetition number (ZZ), respectively.In the files, which do not include any header, each line corresponds to a sample taken by the sensing node. Thus, each line of the CSV files presents a set of the simultaneous measurements captured by the sensors of the Shimmer mote at a certain instant. The values in each line are arranged as follows:•Timestamp, Ax, Ay, Az, Gx, Gy, Gz, Mx, My, Mz, Pwhere:-Timestamp is the time indication of the moment when the following measurements were taken. Time is measured in milliseconds elapsed since the start of the recording. Therefore, the first sample, in the first line of the file, has a zero value while the rest of the timestamps in the file are relative to this first sample.-Ax, Ay, Az are the measurements of the three axes of the triaxial accelerometer (in g units).-Gx, Gy, Gz indicate the components of the angular velocity measured by the triaxial gyroscope (in degrees per second or dps).-Mx, My, Mz represent the 3-axis data in microteslas (µT) captured by the magnetometer.-P is the measurement of pressure in millibars.Besides, the VIDEOS directory includes 29 anonymized video clips that illustrate with the corresponding examples the 29 manual activities carried out by the participants. The video files are encoded in MPEG4 format and named according to the format "Example_Activity_XX.mp4", where XX indicates the identifier of the movement (as described in the activity_description.txt file).Finally, the SCRIPTS subfolder comprises two scripts written in Python and Matlab. These two programs (named Load_traces), which perform the same function, are designed to automate the downloading and processing of the data. Specifically, these scripts perform the following tasks:1. Download the database from the public repository as a single compressed zip file.2. Unzip the aforementioned file and create the subfolder structure of the dataset in a specific directory named UMAHand_Dataset. As previously commented, in the subfolder named TRACES, one CSV trace file per each experiment (i.e. per each movement, user, and trial) is created.3. Read all the CSV files and store their information in a list of dictionaries (Python) or a matrix of structures (Matlab) named datasetTraces. Each element in that list/matrix has two fields: the filename (which identifies the user, the type of performed activity, and the trial number) and a numerical array of 11 columns containing the timestamps and the measurements of the sensors for that experiment (arranged as mentioned above).All experiments and data acquisition were conducted in private home environments. Participants were asked to perform those activities involving sustained or continuous hand movements (e.g. clapping hands) for at least 10 seconds. In the case of brief and punctual movements, which might require less than 10 seconds (e.g. picking up an object from the floor), volunteers were simply asked to execute the action until its conclusion. Thus, a total of 752 samples were collected, with durations ranging from 1.98 to 119.98 seconds.
m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
P
Shot2Story20K Dataset
paperswithcode.com
Updated May 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingfei Han; Linjie Yang; Xiaojun Chang; Heng Wang (2024). Shot2Story20K Dataset [Dataset]. https://paperswithcode.com/dataset/shot2story20k
Explore at:
Dataset updated
May 30, 2024
Authors
Mingfei Han; Linjie Yang; Xiaojun Chang; Heng Wang
Description
A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it.

In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions.

Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an underexplored setting of video understanding with detailed summaries.
j
Data from: Television shows based on video games 1975-2019: Original data...
jyx.jyu.fi
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tero Kerttula (2025). Television shows based on video games 1975-2019: Original data and preliminary analysis [Dataset]. http://doi.org/10.17011/jyx/dataset/71622
Explore at:
Unique identifier
https://doi.org/10.17011/jyx/dataset/71622
Dataset updated
Mar 23, 2025
Authors
Tero Kerttula
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of data of different types of television shows based on video games from years mentioned in the title. The data has been used in articles and conference presentations before (e.g. Kerttula 2019; Kerttula 2020). The data is free to use in any future publications with proper references to the author and the original data. Should the data be used in further research, it is to be noted that the dataset is not 100% complete. The reasons to this are difficulties with language and cultural barriers. It also needs to be mentioned, that some of the television shows and production companies have probably being forgotten over time, which means that a complete list would quite likely prove to be very difficult to gather. Some of the data included is missing classification information. This is because in these cases, the data needed was not available or hard to figure out. For example, the time slot data was missing for these shows, or there was not enough information available to make conclusions about the structure of the show. This applies only for a handful of shows, however. This data does not compromise or endanger any copyrights or personal information. All the data gathered here is publically available from different internet sources. No personal information, such as addresses, phone numbers or contact persons was recorded in the data. Some shows feature episodes from video depositories around internet, but if the production company wants to take the episodes offline, it does not harm the dataset.
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...
zenodo.org
data.niaid.nih.gov
+2more
csv
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and other sources about the 2024 outbreak of Measles [Dataset]. http://doi.org/10.5281/zenodo.11711230
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11711230
Dataset updated
Jul 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 15, 2024
Area covered
YouTube
Description
Please cite the following paper when using this dataset:

N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)

Abstract

This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
n
youtube
networkrepository.com
csv
Updated Jun 23, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Network Data Repository (2016). youtube [Dataset]. https://networkrepository.com/soc-youtube.php
Explore at:
csvAvailable download formats
Dataset updated
Jun 23, 2016
Dataset authored and provided by
Network Data Repository
License
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Area covered
YouTube
Description
Youtube online social network - Youtube is a video-sharing web site that includes a social network. The dataset contains a list of all of the user-to-user links.
P
QUILT-1M Dataset
paperswithcode.com
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). QUILT-1M Dataset [Dataset]. https://paperswithcode.com/dataset/quilt-1m
Explore at:
Dataset updated
Feb 10, 2025
Description
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has halted similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 768,826 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: Quilt-1M, with 1M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new pathology images across 13 diverse patch-level datasets of 8 different sub-pathologies and cross-modal retrieval tasks.
h
activities-videos-of-120-uinque-people-different-skin-tones
huggingface.co
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AIxBlock (2025). activities-videos-of-120-uinque-people-different-skin-tones [Dataset]. https://huggingface.co/datasets/AIxBlock/activities-videos-of-120-uinque-people-different-skin-tones
Explore at:
Dataset updated
May 20, 2025
Authors
AIxBlock
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset is provided by AIxBlock and features 40 sets, each containing 120 unique individuals with diverse skin tones. Each set includes multiple participants performing a variety of activities across different backgrounds. All videos were recorded using smartphones following a strict guideline—none of the footage was sourced from the internet. These videos are designed for training computer vision (CV) models to recognize human activities and movement across a range of environments and… See the full description on the dataset page: https://huggingface.co/datasets/AIxBlock/activities-videos-of-120-uinque-people-different-skin-tones.
d
Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data
catalog.data.gov
data.transportation.gov
+5more
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Highway Administration (2025). Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data [Dataset]. https://catalog.data.gov/dataset/next-generation-simulation-ngsim-vehicle-trajectories-and-supporting-data
Explore at:
Dataset updated
Jun 16, 2025
Dataset provided by
Federal Highway Administration
Description
Click “Export” on the right to download the vehicle trajectory data. The associated metadata and additional data can be downloaded below under "Attachments". Researchers for the Next Generation Simulation (NGSIM) program collected detailed vehicle trajectory data on southbound US 101 and Lankershim Boulevard in Los Angeles, CA, eastbound I-80 in Emeryville, CA and Peachtree Street in Atlanta, Georgia. Data was collected through a network of synchronized digital video cameras. NGVIDEO, a customized software application developed for the NGSIM program, transcribed the vehicle trajectory data from the video. This vehicle trajectory data provided the precise location of each vehicle within the study area every one-tenth of a second, resulting in detailed lane positions and locations relative to other vehicles. Click the "Show More" button below to find additional contextual data and metadata for this dataset. For site-specific NGSIM video file datasets, please see the following: - NGSIM I-80 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-I-80-Vide/2577-gpny - NGSIM US-101 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-US-101-Vi/4qzi-thur - NGSIM Lankershim Boulevard Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Lankershi/uv3e-y54k - NGSIM Peachtree Street Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Peachtree/mupt-aksf

Facebook

Twitter

Click to copy link

Link copied

Cite

Thejas T (2023). YouTube Video Network Dataset for Israel-Hamas War [Dataset]. https://ieee-dataport.org/documents/youtube-video-network-dataset-israel-hamas-war

Data from: YouTube Video Network Dataset for Israel-Hamas War

Explore at:

Dataset updated

Dec 23, 2023

Authors

Thejas T

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Israel, YouTube

Description

Over the past few years YouTube has became a popular site for video broadcasting and earning money by publishing various different skills in the form of videos. For some people it has become a main source to earn money. Getting the videos trending among the viewers is one of the major tasks which each and every content creator wants. Popularity of any video and its reach to the audience is completely based on YouTube's Recommendation algorithm. This document is a dataset descriptor for the dataset collected over the time span of about 45 days during the Israel-Hamas War

Clear search

Close search

Google apps

Main menu

Data from: YouTube Video Network Dataset for Israel-Hamas War

Long Video Dataset Dataset

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

Average daily time spent on social media worldwide 2012-2025

Custom dataset from any website on the Internet

Facebook Spam Dataset

youtube

Context

Content

Acknowledgements

Replication Data for: Automated Coding of Political Campaign Advertisement...

EDUVSUM Dataset

RECOD.ai Events Dataset

Data from: Faces in the wild: A naturalistic study of children’s facial...

UMAHand: Hand Activity Dataset (Universidad de Málaga)

Network traffic and code for machine learning classification

Shot2Story20K Dataset

Data from: Television shows based on video games 1975-2019: Original data...

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

youtube

QUILT-1M Dataset

activities-videos-of-120-uinque-people-different-skin-tones

Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data

Data from: YouTube Video Network Dataset for Israel-Hamas War