As of June 2022, more than *** hours of video were uploaded to YouTube every minute. This equates to approximately ****** hours of newly uploaded content per hour. The amount of content on YouTube has increased dramatically as consumer’s appetites for online video has grown. In fact, the number of video content hours uploaded every 60 seconds grew by around ** percent between 2014 and 2020. YouTube global users Online video is one of the most popular digital activities worldwide, with ** percent of internet users worldwide watching more than ** hours of online videos on a weekly basis in 2023. It was estimated that in 2023 YouTube would reach approximately *** million users worldwide. In 2022, the video platform was one of the leading media and entertainment brands worldwide, with a value of more than ** billion U.S. dollars. YouTube video content consumption The most viewed YouTube channels of all time have racked up billions of viewers, millions of subscribers and cover a wide variety of topics ranging from music to cosmetics. The YouTube channel owner with the most video views is Indian music label T-Series, which counted ****** billion lifetime views. Other popular YouTubers are gaming personalities such as PewDiePie, DanTDM and Markiplier.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used the YouTube Data API to augment the YouTube 8M corpus by crawling a variety of meta data for the videos.
First point of interest was the "video resource," which comprises data about the video, such as the video’s title, description, uploader name, tags, view count, and more. Also included in the meta data is whether comments have been left for the video. If so, we downloaded them as well, including information about their authors, likes, dislikes, and responses.
There is no property which specifies a video’s language, since this information is not mandatory when uploading a video. Also, the API provides only information about the available captions, but not the captions themselves. Only the uploader of a video is given access to its captions via the API; we extracted them using youtube-dl. For each video, all manually created captions were downloaded, and auto-generated captions in the "default" language and English. The "default" auto-generated caption gives perhaps the only hint at a video’s original language.
Finally, we downloaded all thumbnails used to advertise a video, which are not available via the API, but only via a canonical URL. Our corpus provides the possibility to recreate the way a video is presented on YouTube (meta data and thumbnail), what the actual content is ((sub)titles and descriptions), and how its viewers reacted (comments).
If you use this dataset in your publication, please cite the dataset as outlined in the right column.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the statistics for the Top 10 songs of various spotify artists and their YouTube videos. The Creators above generated the data and uploaded it to Kaggle on February 6-7 2023. The license to use this data is "CC0: Public Domain", allowing the data to be copied, modified, distributed, and worked on without having to ask permission. The data is in numerical and textual CSV format as attached. This dataset contains the statistics and attributes of the top 10 songs of various artists in the world. As described by the creators above, it includes 26 variables for each of the songs collected from spotify. These variables are briefly described next:
Track: name of the song, as visible on the Spotify platform. Artist: name of the artist. Url_spotify: the Url of the artist. Album: the album in wich the song is contained on Spotify. Album_type: indicates if the song is relesead on Spotify as a single or contained in an album. Uri: a spotify link used to find the song through the API. Danceability: describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. Energy: is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. Key: the key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. Loudness: the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db. Speechiness: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. Acousticness: a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. Instrumentalness: predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. Liveness: detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. Valence: a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). Tempo: the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. Duration_ms: the duration of the track in milliseconds. Stream: number of streams of the song on Spotify. Url_youtube: url of the video linked to the song on Youtube, if it have any. Title: title of the videoclip on youtube. Channel: name of the channel that have published the video. Views: number of views. Likes: number of likes. Comments: number of comments. Description: description of the video on Youtube. Licensed: Indicates whether the video represents licensed content, which means that the content was uploaded to a channel linked to a YouTube content partner and then claimed by that partner. official_video: boolean value that indicates if the video found is the official video of the song. The data was last updated on February 7, 2023.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The collection “Protests Belarus 2015 – 2018” contains 78 videos (mp4) on protests on March, 25th in 2015 – 2018 (mainly Minsk area). We have downloaded all data in March to April 2018 and made screenshots (pdf) of websites so that the discussion and comments on the single video posts can be followed. All data is processed in an MS Excel database with metadata.
We collect all videos that are 1) event related AND show actions of this event, 2) downloadable, 3) we can find with our search words during a particular period. We strictly aim at a systematic and objective selection and organized storage of protest-related videos. We identify particular event-related search words after intense research on the event. According to the snowball principle, we then start the collection of videos with the help of these search words and try to download as much relevant content as possible. However, we cannot guarantee the completeness of protest videos on the particular event. We search the videos and include them into the collection until a particular degree of saturation has been reached. Due to copyright restrictions, we are only allowed to give access to the database of the collected video files including the hyperlinks with its metadata and not to the videos themselves.
The videos have been posted mainly by the participants of the events. Therefore, the material is only an extract and biased by the perspective of the single creator.
The collection is part of a larger and ongoing collection of videos on protest events in the post-Soviet region.
The dataset contains the following files:
Filename | Data Format | Description |
01_dataset_scholarly_references_on_YouTube.json.gz | JSON Lines | An integrated dataset of scholarly references in YouTube video descriptions, covering videos posted up to the end of December 2023. This dataset combines the Altmetric dataset and the YA Domain Dataset and is the basis for identifying references to retracted articles. This dataset contains 743,529 scholarly references (386,628 unique DOIs) found in 322,521 YouTube videos uploaded by 77,974 channels. |
02_dataset_references_to_retracted_articles_on_YouTube.json.gz | JSON Lines |
A dataset of retracted articles referenced in YouTube videos, used as the primary source for analysis in this paper. The dataset was created by cross-referencing the integrated reference dataset with the Retraction Watch database. It includes metadata such as DOI, article title, retraction reason, and severity classification (Severe, Moderate, or Minor) based on Woo and Walsh (2024), along with video- and channel-level statistics (e.g., view counts and subscriber counts) retrieved via the YouTube Data API v3 as of April 22, 2025. This dataset contains 1,002 retracted articles (360 unique DOIs) found in 956 YouTube videos uploaded by 714 channels. |
03_full_list_table3_sorted_by_reference_count_retracted_articles_on_YouTube.json.gz | JSON Lines |
Complete list corresponding to Table 3, "Top 7 retracted articles ranked by the number of YouTube videos in which they are referenced." in the paper. |
04_full_list_table5_top10_most-viewed_video.json.gz | JSON Lines |
Complete list corresponding to Table 5, "Top 10 most-viewed YouTube videos that reference retracted articles, sorted by video view count." in the paper. |
05_detailed_manual_coding_40_sampled_retracted_articles.xlsx | XLSX |
This file provides detailed annotations for a manually coded sample of 40 YouTube videos referencing retracted scholarly articles. The sample includes 10 randomly selected videos from each of the four analytical groups categorized by publication timing (before/after retraction) and retraction severity (Moderate/Severe). The file includes reference stance for each video, visual/verbal mention of the article, and relevant timestamps when applicable. This dataset supplements the manual analysis results presented in Tables 6 and 7 in paper. |
Due to concerns over potential misuse (e.g., identification or harassment of individual content creators), this dataset is not made publicly available.
Researchers who wish to use this dataset for scholarly purposes may contact the authors to request access.
References
Fundings
JSPS KAKENHI Grant Numbers JP22K18147 and JP23K11761.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Jonathan A. [source]
This dataset provides valuable insights into crisis actor videos and their corresponding recommendations on YouTube. It consists of a total of 8823 videos, accounting for an astounding 3,956,454,363 views. These videos were retrieved from YouTube's API and cover various categories and topics.
Specifically, this dataset focuses on crisis actor videos related to mass shootings, false flags, and other conspiracy theories that comprise around 20% of the collection. The remaining 80% explores conspiracies revolving around history, government institutions, and religions.
The dataset includes essential information such as the name and channel of the video uploader. Additionally, it provides details about viewer engagement through likes and dislikes counts. Furthermore, each video is assigned a category or topic to facilitate analysis.
It is important to note that approximately 100 music videos were excluded from the initial data set to maintain relevance to crisis actors.
Overall, this project aims to shed light on the prevalent issue of crisis actors on YouTube by providing researchers with a comprehensive dataset for further exploration and analysis. This highly informative dataset serves as a valuable resource for investigating trends within crisis actor content while contributing towards raising public awareness surrounding this topic
- Understanding the Dataset:
The dataset comprises several columns that provide specific information about each video and its corresponding recommendations. Here's a brief overview of the key columns:
- name: The title or name of the YouTube video.
- channel: The name of the YouTube channel that uploaded the video.
- category: The category or topic of the video.
- views: The number of views the video has received.
- likes: The number of likes received by each video.
dislikes: The number of dislikes received by each video.
Exploring Categories:
One way to analyze this dataset is by examining different categories mentioned in each video entry. This could involve identifying patterns within categories or comparing engagement metrics (views, likes, dislikes) across various topics.
For example, you might want to investigate how crisis actor videos are categorized compared to other conspiracy-related videos present in this dataset.
- Analyzing Engagement Metrics:
To gain insights into users' response towards different videos related to crisis actors or conspiracy theories, it is recommended that you examine engagement metrics such as views, likes, and dislikes.
You can compare these metrics between individual videos within specific categories or observe trends across all entries.
- Investigating Popularity:
Understanding which channels have maximum viewership within this particular subject area can offer valuable information for further analysis.
Examining which channels have consistently high views or engagement metrics (likes/dislikes) can help identify influential content creators related to crisis actors or conspiracy theories.
- Identifying Recommendations:
The dataset also provides information about the recommendations associated with each video entry. By analyzing these recommendations, you can gain insights into the video content YouTube suggests to users who view crisis actor videos.
You could focus on specific keywords within recommendation titles or explore patterns in terms of topic relevance or common recommendations across multiple entries.
- Cross-Referencing External Information:
As this dataset does not provide detailed descriptions or context for each video, it is advisable to cross-reference external sources to gather additional information if needed.
By using the provided video titles and channel names, you can search for more details about specific videos
- Analyzing the correlation between likes, dislikes, and views: This dataset can be used to analyze the relationship between the number of likes and dislikes a video receives and its overall views. By examining this relationship, one could gain insights into factors that contribute to increased engagement or disinterest in crisis actor videos.
- Identifying popular YouTube channels in the crisis actor category: By analyzing the dataset, one can identify which YouTube channels have uploaded the most crisis actor videos and have gained high viewership. Th...
📺 YouTube-CC-BY-Music Annoted 📺 YouTube-CC-BY-Music is a comprehensive collection of prompt and metadata for 50,000 music tracks shared on YouTube. The data has been processed with https://github.com/WaveGenAI/pipelines.
Content
The dataset includes a prompt that describe the music, a youtube descriptions, tags, and other metadata associated with 50,000 music videos uploaded to YouTube under the CC-BY license. These videos come from a diverse range of artists and genres, providing… See the full description on the dataset page: https://huggingface.co/datasets/WaveGenAI/youtube-cc-by-music_annoted.
Motivation
The rise of online media has enabled users to choose various unethical and artificial ways of gaining social growth to boost their credibility (number of followers/retweets/views/likes/subscriptions) within a short time period. In this work, we present ABOME, a novel data repository consisting of datasets collected from multiple platforms for the analysis of blackmarket-driven collusive activities, which are prevalent but often unnoticed in online media. ABOME contains data related to tweets and users on Twitter, YouTube videos, YouTube channels. We believe ABOME is a unique data repository that one can leverage to identify and analyze blackmarket based temporal fraudulent activities in online media as well as the network dynamics.
License
Creative Commons License.
Description of the dataset
- Historical Data
We collected the metadata of each entity present in the historical data
Twitter:
We collected the following fields for retweets and followers on Twitter:
user_details
: A JSON object representing a Twitter user.
tweet_details
: A JSON object representing a tweet.
tweet_retweets
: A JSON list of tweet objects representing the most recent 100 retweets of a given tweet.
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object↩︎
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object↩︎
YouTube:
We collected the following fields for YouTube likes and comments:
is_family_friendly:
Whether the video is marked as family friendly or not.
genre:
Genre of the video.
duration:
Duration of the video in ISO 8601 format (duration type). This format is generally used when the duration denotes the amount of intervening time in a time interval.
description:
Description of the video.
upload_date:
Date that the video was uploaded.
is_paid:
Whether the video is paid or not.
is_unlisted:
The privacy status of the video, i.e., whether the video is unlisted or not. Here, the flag unlisted indicates that the video can only be accessed by people who have a direct link to it.
statistics:
A JSON object containing the number of dislikes, views and likes for the video.
comments:
A list of comments for the video. Each element in the list is a JSON object of the text (the comment text) and time (the time when the comment was posted).
We collected the following fields for YouTube channels:
channel_description:
Description of the channel.
hidden_subscriber_count:
Total number of hidden subscribers of the channel.
published_at:
Time when the channel was created. The time is specified in ISO 8601 format (YYYY-MM-DDThh:mm:ss.sZ).
video_count:
Total number of videos uploaded to the channel.
subscriber_count:
Total number of subscribers of the channel.
view_count:
The number of times the channel has been viewed.
kind:
The API resource type (e.g., youtube#channel for YouTube channels).
country:
The country the channel is associated with.
comment_count:
Total number of comments the channel has received.
etag:
The ETag of the channel which is an HTTP header used for web browser cache validation.
The historical data is stored in five directories named according to the type of data inside it. Each directory contains json files corresponding to the data described above.
- Time-series Data
We collect the following time-series data for retweets and followers on Twitter:
user_timeline
: This is a JSON list of tweet objects in the user’s timeline, which consists of the tweets posted, retweeted and quoted by the user. The file created at each time interval contains the new tweets posted by the user during each time interval.
user_followers
: This is a JSON file containing the user ids of all the followers of a user that were added or removed from the follower list during each time interval.
user_followees
: This is a JSON file consisting of the user ids of all the users followed by a user, i.e., the followees of a user, that were added or removed from the followee list during each time interval.
tweet_details
: This is a JSON object representing a given tweet, collected after every time interval.
tweet_retweets
: This is a JSON list of tweet objects representing the most recent 100 retweets of a given tweet, collected after every time interval.
The time-series data is stored in directories named according to the timestamp of the collection time. Each directory contains sub-directories corresponding to the data described above.
Data Anonymization
The data is anonymized by removing all Personally Identifiable Information (PII) and generating pseud-IDs corresponding to the original IDs. A consistent mapping between the original and pseudo-IDs is maintained to maintain the integrity of the data.
Anansi Masters - the story continues
The Anansi Masters project is developed by Vista Far Reaching Visuals (Mr. Jean Hellwig) and partners. It is designed as a public digital platform at http://www.anansimasters.net and opened in 2007. At the website one can find information about the story character of Nanzi (or Anansi or Kweku Ananse), with English and Dutch subtitled video recordings of storytelling in several countries in different languages, educational modules about storytelling for use at schools and academies, and digital issues of the Anansi Masters Journal published since the beginning of the project. All storytelling videos and videos that were made for documentation or marketing purposes are published on Youtube. Since 2012 all films of Anansi Masters were uploaded to Youtube and linked to the Anansi Masters website. Their display is embedded in the website together with the respective metadata that are entered through a custom made content management system (CMS).
In March 2012, public storytelling events were organized by Drs. Jean Hellwig (Hellwig Productions AV / Vista Far Reaching Visuals Foundation) on the islands of Curacao and Aruba. Any professional or non-professional storyteller was invited to tell a story in front of the Anansi Masters camera and the available audience. Storytellers were free to choose their story and language. Each storyteller had to agree that the video registration of their story could be made available for open access. Storytellers were asked in front of the camera to answer a few questions about who they are and how they selected the story that they told. The Anansi Masters project started in 2007 with the registration of Kweku Ananse stories in Ghana and The Netherlands. The storytelling events organized on Curacao and Aruba in 2012 were part of the second phase 'Anansi Masters - the story continues'. The project registers contemporary ways of storytelling from an old tradition and aims to stimulate and revitalize the Nanzi storytelling by making the storytelling videos available to a large international audience. In 2008 a dvd in Dutch was released with 22 stories from Ghana and The Netherlands. In 2013 a dvd in English is released with all 32 stories that were recorded on Curaçao and Aruba.
The stories of the Anansi tradition originate in Africa and were exported to other parts of the world through slave trade and migration. In Anansi Masters, the similarities and differences between the stories and storytellers, who tell in their own language, can be found. Anansi Masters initiates different activities all over the world where stories from this oral tradition can be found. The founder has the ambition to film as many stories from this tradition as possible in as many countries as possible. Anansi Masters collaborates with writers, theatre makers, filmmakers, researchers, schools and of course with many many storytellers.
This dataset contains the documentation, video files, documents and pictures that were made to document the second phase of the Anansi Masters project with the subtitle 'the story continues'. These files were produced to report the process and results to the sponsoring funds and to be used in marketing through Facebook.
This dataset contains the following: - report in Dutch with separate appendices - videos with datasheets 0015 - 0022 reflecting some of performances in the media to market the storytelling events - short video impression with datasheet 0023 of a musical performance at the storytelling event in Curacao - a list with names and codes of the recorded stories and storytellers
For each storyteller and their stories a new dataset has been created. Links to these datasets can be found under 'Relations'.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains visual features extracted from 12875 movie trailers. The visual features are extracted from key-frames of movie trailers with the VGG-19 CNN, pre-trained on ImageNet.
Movies in the datset are identified by their MovieLens movieId.
Features_sparse.zip contains the 4096-dimensional feature vectors of each key-frame from every movie.
Visual labels.zip contains the1000 dimensional label feature vectors of each key-frame from every movie.
DeepCineProp-f.p has combined the label features of each movie into a vector space model with the use of tf-idf.
CineSub.p contains the subtitles of each movie represented in a vector space model pre-processed with various nlp techniques and produced using tf-idf.
Abstract:
When a movie is uploaded to a movie Recommender System (e.g., YouTube), the system can exploit various forms of descriptive features (e.g., tags and genre) in order to generate personalized recommendation for users. However, there are situations where the descriptive features are missing or very limited and the system may fail to include such a movie in the recommendation list, known as Cold-start problem. This thesis investigates recommendation based on a novel form of content features, extracted from movies, in order to generate recommendation for users. Such features represent the visual aspects of movies, based on Deep Learning models, and hence, do not require any human annotation when extracted. The proposed technique has been evaluated in both offline and online evaluations using a large dataset of movies. The online evaluation has been carried out in a evaluation framework developed for this thesis. Results from the offline and online evaluation (N=150) show that automatically extracted visual features can mitigate the cold-start problem by generating recommendation with a superior quality compared to different baselines, including recommendation based on human-annotated features. The results also point to subtitles as a high-quality future source of automatically extracted features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The associated files have been created for and is analysed in a fortcoming article entitled Automatic Identification of Hate Speech – A Case-Study of Alt-Right YouTube Videos'. The material is divided into six tables as follows:
Sentence top 5% | The 19th 20-quantile predicted most hateful sentences |
Sentence bottom 5% | The bottom 20-quantile predicted moste hatefull sentences (the least likely to contain hatespeech) |
Paragraphs | Prediction and annotation of paragraphs |
Video top 10% | Titles of the top decile predicted hateful videos |
Video bottom 10% | Titles of the bottom decile predicted hateful videos |
Video bottom 10% - Alt right | Titles of the bottom decile predicted hateful videos without History |
The data is uploaded in two formats:
Excel file: Automatic_Detection_of_Hate_Speech_a_Case-Study_of_Alt-Right_Videos.xlsx contains all six tables in one file, with a supplementary codebook.
Tab Separated Values (TSV): Each file correspond to a single sheet from the excel file, and are named accordingly. UTF-8 Encoded.
VGGSound
VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube.
Homepage: https://www.robots.ox.ac.uk/~vgg/data/vggsound/ Paper: https://arxiv.org/abs/2004.14368 Github: https://github.com/hche11/VGGSound
Analysis
310+ classes: VGG-Sound contains audios spanning a large number of challenging acoustic environments and noise characteristics of real applications. 200,000+ videos: All… See the full description on the dataset page: https://huggingface.co/datasets/Loie/VGGSound.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract This study aimed to analyze the videos available on YouTube related to dentistry and the novel coronavirus (COVID-19), as there is no such analysis in the existing literature. The terms “dental” and “COVID-19” were searched on YouTube on May 9, 2020. The top 116 English-language videos with at least 300 views were analyzed by two observers. Data was saved for each video, including target audience, source, country of origin, content, number of views, time watched, average views, duration, like/dislike ratio, and usefulness. Total video information and quality index (VIQI) scores were calculated, consisting of flow, information, accuracy, quality, and precision indices. Non-parametric tests were used for analysis. The analyzed videos were viewed 375,000 times and totaled 20 h of content. Most videos were uploaded by dentists (45.7%), originated from the United States (79.3%), and contained information targeted towards patients (48.3%). Nearly half of the videos (47.4%) were moderately useful. For the usefulness of the videos, statistically significant differences were found for all indices as well as total VIQI scores. A comparison of the indices according to the relevance of the videos showed statistically significant differences in the videos’ information and precision indices and total VIQI scores. The results of this study showed that dentistry YouTube videos related to COVID-19 had high view numbers; however, the videos were generally moderate in quality and usefulness.
Paper: Barari, Soubhik, and Tyler Simko. "LocalView, a database of public meetings for the study of local politics and policy-making in the United States." Nature: Scientific Data 10.1 (2023): 135. Abstract: Despite the fundamental importance of American local governments for service provision in areas like education and public health, local policy-making remains difficult and expensive to study at scale due to a lack of centralized data. This article introduces LocalView , the largest existing dataset of real-time local government public meetings – the central policy-making process in local government. In sum, the dataset currently covers 139,616 videos and their corresponding textual and audio transcripts of local government meetings publicly uploaded to YouTube – the world’s largest public video-sharing website – from 1,012 places and 2,861 distinct governments across the United States between 2006-2022. The data are processed, downloaded, cleaned, and publicly disseminated (at localview.net) for analysis across places and over time. We validate this dataset using a variety of methods and demonstrate how it can be used to map local governments’ attention to policy areas of interest. Finally, we discuss how LocalView may be used by journalists, academics, and other users for understanding how local communities deliberate crucial policy questions on topics including climate change, public health, and immigration.
VoxCeleb 2
VoxCeleb2 contains over 1 million utterances for 6,112 celebrities, extracted from videos uploaded to YouTube.
Verification Split
train validation test
5,994 5,994 118
982,808 109,201 36,237
Data Fields
ID (string): The ID of the sample with format
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains two basic attributes from which you can extract an arrangement of exciting features, starting from DateTime-based features up to text-based features.
The first is the time in the video in which the comment was posted; it is important to note that the EST time the live stream started is 2:15.
The second is the comment that was posted; here, it is important to note that non-english comments were removed.
I think it might be interesting to get a better understanding of how people around the world reacted to the rover landing on Mars and the content shown in the video. There were many points where the video lagged, or the site crashed.
During the first half of 2023, the majority of copyright claims received by YouTube were spotted by the platform's Content ID tool, which cross-checks uploaded videos against a larger file database. Over 2.75 million claims were submitted via Copyright Match Tool, while approximately of two million claims were submitted to the platform via webforms.
Facebook received 73,390 user data requests from federal agencies and courts in the United States during the second half of 2023. The social network produced some user data in 88.84 percent of requests from U.S. federal authorities. The United States accounts for the largest share of Facebook user data requests worldwide.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
As of June 2022, more than *** hours of video were uploaded to YouTube every minute. This equates to approximately ****** hours of newly uploaded content per hour. The amount of content on YouTube has increased dramatically as consumer’s appetites for online video has grown. In fact, the number of video content hours uploaded every 60 seconds grew by around ** percent between 2014 and 2020. YouTube global users Online video is one of the most popular digital activities worldwide, with ** percent of internet users worldwide watching more than ** hours of online videos on a weekly basis in 2023. It was estimated that in 2023 YouTube would reach approximately *** million users worldwide. In 2022, the video platform was one of the leading media and entertainment brands worldwide, with a value of more than ** billion U.S. dollars. YouTube video content consumption The most viewed YouTube channels of all time have racked up billions of viewers, millions of subscribers and cover a wide variety of topics ranging from music to cosmetics. The YouTube channel owner with the most video views is Indian music label T-Series, which counted ****** billion lifetime views. Other popular YouTubers are gaming personalities such as PewDiePie, DanTDM and Markiplier.