https://brightdata.com/licensehttps://brightdata.com/license
Use our YouTube Videos dataset to extract detailed information from public videos and filter by video title, views, upload date, or likes. Data points include video URL, title, description, thumbnail, upload date, view count, like count, comment count, tags, and more. You can purchase the entire dataset or a customized subset, tailored to your needs. Popular use cases for this dataset include trend analysis, content performance tracking, brand monitoring, and influencer campaign optimization.
As of June 2022, more than *** hours of video were uploaded to YouTube every minute. This equates to approximately ****** hours of newly uploaded content per hour. The amount of content on YouTube has increased dramatically as consumer’s appetites for online video has grown. In fact, the number of video content hours uploaded every 60 seconds grew by around ** percent between 2014 and 2020. YouTube global users Online video is one of the most popular digital activities worldwide, with ** percent of internet users worldwide watching more than ** hours of online videos on a weekly basis in 2023. It was estimated that in 2023 YouTube would reach approximately *** million users worldwide. In 2022, the video platform was one of the leading media and entertainment brands worldwide, with a value of more than ** billion U.S. dollars. YouTube video content consumption The most viewed YouTube channels of all time have racked up billions of viewers, millions of subscribers and cover a wide variety of topics ranging from music to cosmetics. The YouTube channel owner with the most video views is Indian music label T-Series, which counted ****** billion lifetime views. Other popular YouTubers are gaming personalities such as PewDiePie, DanTDM and Markiplier.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
YouTube is the world's largest video-sharing platform, launched in 2005. It allows users to upload, view, and share videos, and has grown to be a central hub for content creators across various fields, including entertainment, education, music, and more. With over 2 billion logged-in users monthly, YouTube has become an essential platform for digital content and marketing.
The Top 1000 YouTube Channels Dataset captures detailed information about the top-performing YouTube channels globally. This dataset includes the following columns:
This dataset is invaluable for analyzing trends, understanding content strategies, and benchmarking channel performances within the YouTube ecosystem.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the statistics for the Top 10 songs of various spotify artists and their YouTube videos. The Creators above generated the data and uploaded it to Kaggle on February 6-7 2023. The license to use this data is "CC0: Public Domain", allowing the data to be copied, modified, distributed, and worked on without having to ask permission. The data is in numerical and textual CSV format as attached. This dataset contains the statistics and attributes of the top 10 songs of various artists in the world. As described by the creators above, it includes 26 variables for each of the songs collected from spotify. These variables are briefly described next:
Track: name of the song, as visible on the Spotify platform. Artist: name of the artist. Url_spotify: the Url of the artist. Album: the album in wich the song is contained on Spotify. Album_type: indicates if the song is relesead on Spotify as a single or contained in an album. Uri: a spotify link used to find the song through the API. Danceability: describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. Energy: is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. Key: the key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. Loudness: the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db. Speechiness: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. Acousticness: a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. Instrumentalness: predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. Liveness: detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. Valence: a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). Tempo: the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. Duration_ms: the duration of the track in milliseconds. Stream: number of streams of the song on Spotify. Url_youtube: url of the video linked to the song on Youtube, if it have any. Title: title of the videoclip on youtube. Channel: name of the channel that have published the video. Views: number of views. Likes: number of likes. Comments: number of comments. Description: description of the video on Youtube. Licensed: Indicates whether the video represents licensed content, which means that the content was uploaded to a channel linked to a YouTube content partner and then claimed by that partner. official_video: boolean value that indicates if the video found is the official video of the song. The data was last updated on February 7, 2023.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
YouTube was created in 2005, with the first video – Me at the Zoo - being uploaded on 23 April 2005. Since then, 1.3 billion people have set up YouTube accounts. In 2018, people watch nearly 5 billion videos each day. People upload 300 hours of video to the site every minute.
According to 2016 research undertaken by Pexeso, music only accounts for 4.3% of YouTube’s content. Yet it makes 11% of the views. Clearly, an awful lot of people watch a comparatively small number of music videos. It should be no surprise, therefore, that the most watched videos of all time on YouTube are predominantly music videos.
On August 13, BTS became the most-viewed artist in YouTube history, accumulating over 26.7 billion views across all their official channels. This count includes all music videos and dance practice videos.
Justin Bieber and Ed Sheeran now hold the records for second and third-highest views, with over 26 billion views each.
Currently, BTS’s most viewed videos are their music videos for “**Boy With Luv**,” “**Dynamite**,” and “**DNA**,” which all have over 1.4 billion views.
Headers of the Dataset Total = Total views (in millions) across all official channels Avg = Current daily average of all videos combined 100M = Number of videos with more than 100 million views
YouTube is an American online video-sharing platform headquartered in San Bruno, California. The service, created in February 2005 by three former PayPal employees—Chad Hurley, Steve Chen, and Jawed Karim—was bought by Google in November 2006 for US$1.65 billion and now operates as one of the company's subsidiaries. YouTube is the second most-visited website after Google Search, according to Alexa Internet rankings.
YouTube allows users to upload, view, rate, share, add to playlists, report, comment on videos, and subscribe to other users. Available content includes video clips, TV show clips, music videos, short and documentary films, audio recordings, movie trailers, live streams, video blogging, short original videos, and educational videos.
YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments, and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.
This dataset is a daily record of the top trending YouTube videos.
Note that this dataset is a structurally improved version of this dataset.
This dataset was collected using the YouTube API. This Description is cited in Wikipedia.
This dataset is designed to explore multistreaming social media video as a research method used to collect semi-structured interview data. The data are provided by Dr Karen E. Sutherland and Ms Krisztina Morris from the School of Business and Creative Industries at the University of the Sunshine Coast in Queensland, Australia. The dataset is drawn from the publicly available video recording of an interview undertaken as part of the research project called: ‘Like, Share, Follow’, a multistreaming show, featuring Dr Sutherland interviewing university graduates about their career journeys, that is broadcast across Facebook, LinkedIn, and Twitter and later uploaded to YouTube. This dataset examines how multistreaming video interview data can be used to answer research questions and the benefits and challenges this specific method of data collection can pose in the process of data analysis. The video example is accompanied by a teaching guide and a student guide.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains detailed information about IShowSpeed's YouTube channel, including metadata for all his videos and associated performance metrics. IShowSpeed, widely known for his energetic and entertaining content, has amassed a massive following on YouTube. This dataset provides insights into his content strategy, video performance, and audience engagement patterns.
Use Cases: - Analyze patterns in video uploads and performance over time. - Study audience interaction through likes, comments, and views. - Forecast future growth and engagement based on historical data. - Identify video features that resonate most with viewers.
Check out Notebook here
The dataset contains the following files:
Filename | Data Format | Description |
01_dataset_scholarly_references_on_YouTube.json.gz | JSON Lines | An integrated dataset of scholarly references in YouTube video descriptions, covering videos posted up to the end of December 2023. This dataset combines the Altmetric dataset and the YA Domain Dataset and is the basis for identifying references to retracted articles. This dataset contains 743,529 scholarly references (386,628 unique DOIs) found in 322,521 YouTube videos uploaded by 77,974 channels. |
02_dataset_references_to_retracted_articles_on_YouTube.json.gz | JSON Lines |
A dataset of retracted articles referenced in YouTube videos, used as the primary source for analysis in this paper. The dataset was created by cross-referencing the integrated reference dataset with the Retraction Watch database. It includes metadata such as DOI, article title, retraction reason, and severity classification (Severe, Moderate, or Minor) based on Woo and Walsh (2024), along with video- and channel-level statistics (e.g., view counts and subscriber counts) retrieved via the YouTube Data API v3 as of April 22, 2025. This dataset contains 1,002 retracted articles (360 unique DOIs) found in 956 YouTube videos uploaded by 714 channels. |
03_full_list_table3_sorted_by_reference_count_retracted_articles_on_YouTube.json.gz | JSON Lines |
Complete list corresponding to Table 3, "Top 7 retracted articles ranked by the number of YouTube videos in which they are referenced." in the paper. |
04_full_list_table5_top10_most-viewed_video.json.gz | JSON Lines |
Complete list corresponding to Table 5, "Top 10 most-viewed YouTube videos that reference retracted articles, sorted by video view count." in the paper. |
05_detailed_manual_coding_40_sampled_retracted_articles.xlsx | XLSX |
This file provides detailed annotations for a manually coded sample of 40 YouTube videos referencing retracted scholarly articles. The sample includes 10 randomly selected videos from each of the four analytical groups categorized by publication timing (before/after retraction) and retraction severity (Moderate/Severe). The file includes reference stance for each video, visual/verbal mention of the article, and relevant timestamps when applicable. This dataset supplements the manual analysis results presented in Tables 6 and 7 in paper. |
Due to concerns over potential misuse (e.g., identification or harassment of individual content creators), this dataset is not made publicly available.
Researchers who wish to use this dataset for scholarly purposes may contact the authors to request access.
References
Fundings
JSPS KAKENHI Grant Numbers JP22K18147 and JP23K11761.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset is first introduced in the following paper: Siqi Wu, Marian-Andrei Rizoiu, and Lexing Xie. Beyond Views: Measuring and Predicting Engagement in Online Videos. In AAAI International Conference on Weblogs and Social Media (ICWSM), 2018. Tweeted videos dataset This dataset contains YouTube videos published between July 1st and August 31st, 2016. To be collected, the video needs (a) be mentioned on Twitter during aforementioned collection period; (b) have insight statistics available; (c) have at least 100 views within the first 30 days after upload. Quality videos datasets These datasets contain videos deemed of high quality by domain experts. Vevo videos: Videos of verified Vevo artists, as of August 31st, 2016. Billboard16 videos: Videos of 2016 Billboard Hot 100 chart. Top news videos: Videos of top 100 most viewed News channels. freebase_mid_type_name.csv It maps a freebase mid to a real-world entity. See more details in this data description.
During the first half of 2023, the majority of copyright claims received by YouTube were spotted by the platform's Content ID tool, which cross-checks uploaded videos against a larger file database. Over 2.75 million claims were submitted via Copyright Match Tool, while approximately of two million claims were submitted to the platform via webforms.
Our dataset offers a unique blend of attributes from YouTube and Google Maps, empowering users with comprehensive insights into online content and geographical reach. Let's delve into what makes our data stand out:
Unique Attributes: - From YouTube: Detailed video information including title, description, upload date, video ID, and channel URL. Video metrics such as views, likes, comments, and duration are also provided. - Creator Info: Access author details like name and channel URL. - Channel Information: Gain insights into channel title, description, location, join date, and visual branding elements like logo and banner URLs. - Channel Metrics: Understand a channel's performance with metrics like total views, subscribers, and video count. - Google Maps Integration: Explore business ratings from Google My Business and location data from Google Maps.
Data Sourcing: - Our data is meticulously sourced from publicly available information on YouTube and Google Maps, ensuring accuracy and reliability.
Primary Use-Cases: - Marketing: Analyze video performance metrics to optimize content strategies. - Research: Explore trends in creator behavior and audience engagement. - Location-Based Insights: Utilize Google Maps data for market research, competitor analysis, and location-based targeting.
Fit within Broader Offering: - This dataset complements our broader data offering by providing rich insights into online content consumption and geographical presence. It enhances decision-making processes across various industries, including marketing, advertising, research, and business intelligence.
Usage Examples: - Marketers can identify popular video topics and optimize advertising campaigns accordingly. - Researchers can analyze audience engagement patterns to understand viewer preferences. - Businesses can assess their Google My Business ratings and geographical distribution for strategic planning.
With scalable solutions and high-quality data, our dataset offers unparalleled depth for extracting actionable insights and driving informed decisions in the digital landscape.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description:
Column1: Video id of 11 characters. Column2: uploader of the video of string data type. Column3: Interval between day of establishment of Youtube and the date of uploading of the video of integer data type. Column4: Category of the video of String data type. Column5: Length of the video of integer data type. Column6: Number of views for the video of integer data type. Column7: Rating on the video of float data type. Column8: Number of ratings given on the video. Column9: Number of comments on the videos in integer data type. Column10: Related video ids with the uploaded video.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used the YouTube Data API to augment the YouTube 8M corpus by crawling a variety of meta data for the videos.
First point of interest was the "video resource," which comprises data about the video, such as the video’s title, description, uploader name, tags, view count, and more. Also included in the meta data is whether comments have been left for the video. If so, we downloaded them as well, including information about their authors, likes, dislikes, and responses.
There is no property which specifies a video’s language, since this information is not mandatory when uploading a video. Also, the API provides only information about the available captions, but not the captions themselves. Only the uploader of a video is given access to its captions via the API; we extracted them using youtube-dl. For each video, all manually created captions were downloaded, and auto-generated captions in the "default" language and English. The "default" auto-generated caption gives perhaps the only hint at a video’s original language.
Finally, we downloaded all thumbnails used to advertise a video, which are not available via the API, but only via a canonical URL. Our corpus provides the possibility to recreate the way a video is presented on YouTube (meta data and thumbnail), what the actual content is ((sub)titles and descriptions), and how its viewers reacted (comments).
If you use this dataset in your publication, please cite the dataset as outlined in the right column.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are publishing a dataset we created for the HTTPS traffic classification.
Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).
During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.
We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:
Live Video Stream Twitch, Czech TV, YouTube Live
Video Player DailyMotion, Stream.cz, Vimeo, YouTube
Music Player AppleMusic, Spotify, SoundCloud
File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive
Website and Other Traffic Websites from Alexa Top 1M list
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The data is scrapped using the Youtube API.
videoId: A unique video ID of the Youtube Video. publishedAt: Date of upload of the video. channelID: A unique channel ID of the Youtube Channel. title: The title of the youtube video. channelTitle: The name of the channel. channelType: The Youtube Category ID of the Channel Type.
Motivation
The rise of online media has enabled users to choose various unethical and artificial ways of gaining social growth to boost their credibility (number of followers/retweets/views/likes/subscriptions) within a short time period. In this work, we present ABOME, a novel data repository consisting of datasets collected from multiple platforms for the analysis of blackmarket-driven collusive activities, which are prevalent but often unnoticed in online media. ABOME contains data related to tweets and users on Twitter, YouTube videos, YouTube channels. We believe ABOME is a unique data repository that one can leverage to identify and analyze blackmarket based temporal fraudulent activities in online media as well as the network dynamics.
License
Creative Commons License.
Description of the dataset
- Historical Data
We collected the metadata of each entity present in the historical data
Twitter:
We collected the following fields for retweets and followers on Twitter:
user_details
: A JSON object representing a Twitter user.
tweet_details
: A JSON object representing a tweet.
tweet_retweets
: A JSON list of tweet objects representing the most recent 100 retweets of a given tweet.
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object↩︎
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object↩︎
YouTube:
We collected the following fields for YouTube likes and comments:
is_family_friendly:
Whether the video is marked as family friendly or not.
genre:
Genre of the video.
duration:
Duration of the video in ISO 8601 format (duration type). This format is generally used when the duration denotes the amount of intervening time in a time interval.
description:
Description of the video.
upload_date:
Date that the video was uploaded.
is_paid:
Whether the video is paid or not.
is_unlisted:
The privacy status of the video, i.e., whether the video is unlisted or not. Here, the flag unlisted indicates that the video can only be accessed by people who have a direct link to it.
statistics:
A JSON object containing the number of dislikes, views and likes for the video.
comments:
A list of comments for the video. Each element in the list is a JSON object of the text (the comment text) and time (the time when the comment was posted).
We collected the following fields for YouTube channels:
channel_description:
Description of the channel.
hidden_subscriber_count:
Total number of hidden subscribers of the channel.
published_at:
Time when the channel was created. The time is specified in ISO 8601 format (YYYY-MM-DDThh:mm:ss.sZ).
video_count:
Total number of videos uploaded to the channel.
subscriber_count:
Total number of subscribers of the channel.
view_count:
The number of times the channel has been viewed.
kind:
The API resource type (e.g., youtube#channel for YouTube channels).
country:
The country the channel is associated with.
comment_count:
Total number of comments the channel has received.
etag:
The ETag of the channel which is an HTTP header used for web browser cache validation.
The historical data is stored in five directories named according to the type of data inside it. Each directory contains json files corresponding to the data described above.
- Time-series Data
We collect the following time-series data for retweets and followers on Twitter:
user_timeline
: This is a JSON list of tweet objects in the user’s timeline, which consists of the tweets posted, retweeted and quoted by the user. The file created at each time interval contains the new tweets posted by the user during each time interval.
user_followers
: This is a JSON file containing the user ids of all the followers of a user that were added or removed from the follower list during each time interval.
user_followees
: This is a JSON file consisting of the user ids of all the users followed by a user, i.e., the followees of a user, that were added or removed from the followee list during each time interval.
tweet_details
: This is a JSON object representing a given tweet, collected after every time interval.
tweet_retweets
: This is a JSON list of tweet objects representing the most recent 100 retweets of a given tweet, collected after every time interval.
The time-series data is stored in directories named according to the timestamp of the collection time. Each directory contains sub-directories corresponding to the data described above.
Data Anonymization
The data is anonymized by removing all Personally Identifiable Information (PII) and generating pseud-IDs corresponding to the original IDs. A consistent mapping between the original and pseudo-IDs is maintained to maintain the integrity of the data.
The videoviewer extension for CKAN aims to enhance the data catalog's capabilities by enabling direct video viewing or embedding of videos associated with datasets. While the provided documentation is minimal, it suggests the extension focuses on facilitating the integration and playback of video resources within the CKAN platform. It appears to allow CKAN to better handle and present video-based data resources, making them more accessible to users. Key Features (Inferred from the context of a video viewer extension): Video Resource Integration: Likely allows linking or embedding video resources (e.g., from YouTube, Vimeo, or direct file uploads) to datasets within CKAN. Inline Video Playback: Potentially provides a built-in video player within the CKAN interface, allowing users to view videos directly without leaving the platform. Configuration Settings (Assumed): May offer configuration options for specifying supported video formats, player settings, or integration with third-party video hosting services. Metadata Display (Inferred): Could display video-related metadata, such as duration, resolution, or upload date, alongside the video player. Theming Integration (Expected): Should seamlessly integrate with CKAN's theming system to provide a consistent user experience. Technical Integration: Though specific details are not provided, installation instructions suggest integrating the extension by adding videoviewer to the ckan.plugins setting in the CKAN configuration file. Activation involves installing the Python package and restarting CKAN. Benefits & Impact (Predicted): While the documentation is sparse, based on common video viewer features, we can assume the video viewer extension would improve the accessibility and utility of video-based data resources managed within CKAN. This enhancement will likely increase user engagement and provide a richer data discovery experience.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The people from Czech are publishing a dataset for the HTTPS traffic classification.
Since the data were captured mainly in the real backbone network, they omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).
During research, they divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.
They have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. They also used several popular websites that primarily focus on the audience in Czech. The identified traffic classes and their representatives are provided below:
Live Video Stream Twitch, Czech TV, YouTube Live Video Player DailyMotion, Stream.cz, Vimeo, YouTube Music Player AppleMusic, Spotify, SoundCloud File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive Website and Other Traffic Websites from Alexa Top 1M list
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reasons for excluding videos for each search term.
https://brightdata.com/licensehttps://brightdata.com/license
Use our YouTube Videos dataset to extract detailed information from public videos and filter by video title, views, upload date, or likes. Data points include video URL, title, description, thumbnail, upload date, view count, like count, comment count, tags, and more. You can purchase the entire dataset or a customized subset, tailored to your needs. Popular use cases for this dataset include trend analysis, content performance tracking, brand monitoring, and influencer campaign optimization.