Throughout 2024, the majority of copyright claims received by YouTube were spotted by the platform's Content ID tool, which cross-checks uploaded videos against a larger file database. Over 2.2 billion claims were submitted via Copyright Match Tool, while approximately three million claims were submitted to the platform via webforms.
https://brightdata.com/licensehttps://brightdata.com/license
Use our YouTube profiles dataset to extract both business and non-business information from public channels and filter by channel name, views, creation date, or subscribers. Datapoints include URL, handle, banner image, profile image, name, subscribers, description, video count, create date, views, details, and more. You may purchase the entire dataset or a customized subset, depending on your needs. Popular use cases for this dataset include sentiment analysis, brand monitoring, influencer marketing, and more.
What contender will emerge as the next big creator economy company? To find out, we've built a database of more than 500 global startups serving the millions of individuals making money off their online followings. Many founders see an opportunity to help creators connect with fans. Others have developed artificial intelligent tools or financial management services for creators. U.S. creator startups have raised more than $9.8 billion since early 2021, and creator startups based outside the U.S. have raised more than $4 billion in that period. The database comes from our reporting, founders and investors, and estimates from PitchBook.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Study how YouTube videos become viral or, more in general, how they evolve in terms of views, likes and subscriptions is a topic of interest in many disciplines. With this dataset you can study such phenomena, with statistics about 1 million YouTube videos. The information was collected in 2013 when YouTube was exposing the data publicly: they removed this functionality in the years and now it's possible to have such statistics only to the owner of the video. This makes this dataset unique.
This Dataset has been generated with YOUStatAnalyzer, a tool developed by myself (Mattia Zeni) when I was working for CREATE-NET (www.create-net.org) within the framework of the CONGAS FP7 project (http://www.congas-project.eu). For the project we needed to collect and analyse the dynamics of YouTube videos popularity. The dataset contains statistics of more than 1 million Youtube videos, chosen accordingly to random keywords extracted from the WordNet library (http://wordnet.princeton.edu).
The motivation that led us to the development of the YOUStatAnalyser data collection tool and the creation of this dataset is that there's an active research community working on the interplay among user individual preferences, social dynamics, advertising mechanisms and a common problem is the lack of open large-scale datasets. At the same time, no tool was present at that time. Today, YouTube removed the possibility to visualize these data on each video's page, making this dataset unique.
When using our dataset for research purposes, please cite it as:
@INPROCEEDINGS{YOUStatAnalyzer,
author={Mattia Zeni and Daniele Miorandi and Francesco {De Pellegrini}},
title = {{YOUStatAnalyzer}: a Tool for Analysing the Dynamics of {YouTube} Content Popularity},
booktitle = {Proc.\ 7th International Conference on Performance Evaluation Methodologies and Tools
(Valuetools, Torino, Italy, December 2013)},
address = {Torino, Italy},
year = {2013}
}
The dataset contains statistics and metadata of 1 million YouTube videos, collected in 2013. The videos have been chosen accordingly to random keywords extracted from the WordNet library (http://wordnet.princeton.edu).
The structure of a dataset is the following:
{
u'_id': u'9eToPjUnwmU',
u'title': u'Traitor Compilation # 1 (Trouble ...',
u'description': u'A traitor compilation by one are ...',
u'category': u'Games',
u'commentsNumber': u'6',
u'publishedDate': u'2012-10-09T23:42:12.000Z',
u'author': u'ServilityGaming',
u'duration': u'208',
u'type': u'video/3gpp',
u'relatedVideos': [u'acjHy7oPmls', u'EhW2LbCjm7c', u'UUKigFAQLMA', ...],
u'accessControl': {
u'comment': {u'permission': u'allowed'},
u'list': {u'permission': u'allowed'},
u'videoRespond': {u'permission': u'moderated'},
u'rate': {u'permission': u'allowed'},
u'syndicate': {u'permission': u'allowed'},
u'embed': {u'permission': u'allowed'},
u'commentVote': {u'permission': u'allowed'},
u'autoPlay': {u'permission': u'allowed'}
},
u'views': {
u'cumulative': {
u'data': [15.0, 25.0, 26.0, 26.0, ...]
},
u'daily': {
u'data': [15.0, 10.0, 1.0, 0.0, ..]
}
},
u'shares': {
u'cumulative': {
u'data': [0.0, 0.0, 0.0, 0.0, ...]
},
u'daily': {
u'data': [0.0, 0.0, 0.0, 0.0, ...]
}
},
u'watchtime': {
u'cumulative': {
u'data': [22.5666666667, 36.5166666667, 36.7, 36.7, ...]
},
u'daily': {
u'data': [22.5666666667, 13.95, 0.166666666667, 0.0, ...]
}
},
u'subscribers': {
u'cumulative': {
u'data': [0.0, 0.0, 0.0, 0.0, ...]
},
u'daily': {
u'data': [-1.0, 0.0, 0.0, 0.0, ...]
}
},
u'day': {
u'data': [1349740800000.0, 1349827200000.0, 1349913600000.0, 1350000000000.0, ...]
}
}
From the structure above is possible to see which fields an entry in the dataset has. It is possible to divide them into 2 sections:
1) Video Information.
_id -> Corresponding to the video ID and to the unique identifier of an entry in the database.
title -> Te video's title.
description -> The video's description.
category -> The YouTube category the video is inserted in.
commentsNumber -> The number of comments posted by users.
publishedDate -> The date the video has been published.
author -> The author of the video.
duration -> The video duration in seconds.
type -> The encoding type of the video.
relatedVideos -> A list of related videos.
accessControl -> A list of access policies for different aspects related to the video.
2) Video Statistics.
Each video can have 4 different statistics variables: views, shares, subscribers and watchtime. Recent videos have all of them while older video can have only the 'views' variable. Each variable has 2 dimensions, daily and cumulative.
views -> number of views collected by the video.
shares -> number of sharing operations performed by users.
watchtime -> the time spent by users watching the video, in minute.
subscribers -> number of subscriptions to the channel the video is inserted in, caused by the selected video.
day -> a list of days indicating the analysed period for the statistic.
In the case you are using mongoDB as database system, you can import our dataset using the command:
mongoimport --db [MONGODB_NAME] --collection [MONGODB_COLLECTION] --file dataset.json
Once you imported the Dataset in your DB, you can access the data performing queries. Let's present some example code in python in order to perform queries.
The following code will perform a query without research parameters, returning all the entries in the database, each one saved into the variable entry:
client = MongoClient('localhost', 27017)
db = client[MONGODB_NAME]
collection = db[MONGODB_COLLECTION]
for entry in db.collection.find():
print entry["day"]["data"]
If you want to restrict the results to some entries that answer to a specified query you can use:
client = MongoClient('localhost', 27017)
db = client[MONGODB_NAME]
collection = db[MONGODB_COLLECTION]
for entry in (db.collection.find({"watchtime":{ "$exists": True }})) and (db.collection.find({"category":"Music"})):
print entry["day"]["data"]
Success.ai’s LinkedIn Data for Creative Industry Professionals enables businesses and organizations to connect with global creators, designers, and innovators in the digital, artistic, and creative fields. With access to over 700 million verified LinkedIn profiles, this dataset provides actionable insights and contact details for graphic designers, content creators, photographers, artists, and other professionals in the creative space. Whether your goal is to identify collaborators, market tools tailored to creatives, or analyze emerging trends in the industry, Success.ai ensures your outreach is supported by accurate, enriched, and continuously updated data.
Why Choose Success.ai’s LinkedIn Data for Creative Industry Professionals? Comprehensive Professional Profiles
Access verified LinkedIn profiles of creative professionals, including designers, illustrators, animators, content marketers, photographers, and digital creators. Gain AI-driven validation for accuracy, ensuring minimal bounce rates and effective communication. Global Coverage Across Creative Sectors
Includes professionals from various industries, such as advertising, media, entertainment, technology, and fashion. Covers key markets like North America, Europe, APAC, and emerging creative hubs worldwide. Continuously Updated Dataset
Reflects real-time professional updates, role changes, and new industry trends to keep your targeting relevant and effective. Tailored for Creative Insights
Enriched profiles include work history, professional achievements, areas of expertise, and creative specialties for deeper audience understanding. Data Highlights: 700M+ Verified LinkedIn Profiles: Access a vast network of verified creative professionals worldwide. 100M+ Work Emails: Direct communication with designers, creators, and industry leaders. Enriched Professional Histories: Gain insights into career trajectories, collaborations, and creative projects. Industry-Specific Segmentation: Target creatives in advertising, film, tech, and more with precision filters. Key Features of the Dataset: Creative Industry Profiles
Identify and connect with graphic designers, UX/UI specialists, motion graphic artists, video editors, photographers, and other creative professionals. Engage with individuals who drive innovation in marketing, branding, and design. Detailed Firmographic Data
Leverage firmographic insights, including company size, industry focus, and regional activity, to tailor your approach to specific creative segments. Advanced Filters for Targeting
Refine your search by job title, creative specialty, region, or years of experience for precision outreach. Customize campaigns based on emerging design trends, content needs, or artistic expertise. AI-Driven Enrichment
Enhanced datasets deliver actionable data for personalized campaigns, highlighting creative portfolios, awards, and career milestones. Strategic Use Cases: Product Marketing and Outreach
Promote design software, content creation tools, or creative platforms to designers, video editors, and content strategists. Engage with professionals who shape marketing campaigns, advertising, and digital media production. Talent Acquisition and Recruitment
Target creative recruiters, agency leads, and in-house HR professionals seeking designers, animators, and content creators. Simplify hiring for roles requiring artistic and technical expertise. Collaboration and Partnerships
Identify collaborators for design projects, creative campaigns, or artistic ventures. Build partnerships with agencies, freelance networks, and individual creators for co-branded initiatives. Market Research and Trend Analysis
Explore shifts in creative technologies, design aesthetics, and artistic practices across global markets. Use insights to refine product development and marketing strategies. Why Choose Success.ai? Best Price Guarantee
Get industry-leading data quality at unmatched pricing, ensuring your campaigns are cost-effective and impactful. Seamless Integration
Easily integrate LinkedIn Data into your CRM or marketing platforms with downloadable formats or API access. AI-Validated Accuracy
Rely on 99% data accuracy to minimize waste and maximize engagement outcomes in your campaigns. Customizable Solutions
Tailor datasets to focus on specific creative fields, industry verticals, or geographical areas, ensuring a perfect fit for your objectives. Strategic APIs for Enhanced Campaigns: Data Enrichment API
Update your internal records with verified creative profiles for better audience targeting and engagement. Lead Generation API
Automate lead generation to maintain a steady flow of qualified creative professionals, scaling your campaigns efficiently. Success.ai’s LinkedIn Data for Creative Industry Professionals empowers you to connect with the creative minds shaping today’s industries. With verified contact details, enriched prof...
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
I created this dataset as part of a data analysis project and concluded that it might be relevant for others who are interested in examining in analyzing content on YouTube. This dataset is a collection of over 6000 videos having the columns:
Comments: comments count for the video
Through the YouTube API and using Python, I collect data about some of these popular channels' videos that provide educational content about Machine Learning and Data Science in order to extract insights about which topics had been popular within the last couple of years. Featured in the dataset are the following creators:
Krish Naik
Nicholas Renotte
Sentdex
DeepLearningAI
Artificial Intelligence — All in One
Siraj Raval
Jeremy Howard
Applied AI Course
Daniel Bourke
Jeff Heaton
DeepLearning.TV
Arxiv Insights
These channels are features in multiple top AI channels to subscribe to lists and have seen a big growth in the last couple of years on YouTube. They all have a creation date since or before 2018.
With barely 10 seconds and 61 million likes, Bella Poach's lip syncing "M to the B" by Millie B was the most engaging video on TikTok as of March 2023. Bella Poarch, who as of the beginning of 2023 was the third-most followed creator on the popular social video platform, rose to popularity as a singer and content creators since opening a TikTok account in January 2020. Second ranked "dancing in front of the bathroom mirror," by user @jamie32bsh generated almost 52 million likes between its upload time - January 2022 and March 2023.
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Influencer marketing campaigns: Brands can use the contact data of Indian content creators to reach out to them for influencer marketing campaigns. Influencers can create content for the brand, promote it on their social media channels, and help increase brand awareness and engagement.
Product reviews and sponsorships: Companies can use the contact data of Indian content creators to send them their products for review or to sponsor their content. This can help increase brand exposure and generate positive word-of-mouth for the product.
Brand partnerships and collaborations: Brands can use the contact data of Indian content creators to collaborate with them on brand partnerships. This can include creating co-branded content, sponsored posts, or joint events.
Content creation services: Companies can use the contact data of Indian content creators to hire them for content creation services. This can include creating social media posts, blog articles, videos, or other types of content.
Market research: Companies can use the contact data of Indian content creators to conduct market research. They can ask influencers to participate in surveys or focus groups to get insights into their target audience's preferences and behaviors.
Overall, the contact data of Indian content creators can be a valuable resource for companies and brands looking to leverage the power of influencer marketing and content creation.
Social Media Audiences
contact,influencers,marketing,sponsorship,collaboration
2779
$500.00
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Explore the "Largest News Articles Dataset from CNBC," a comprehensive collection of news articles published by CNBC, one of the leading global news sources for business, finance, and current affairs.
This dataset includes thousands of articles covering a wide range of topics, such as financial markets, economic trends, technology, politics, health, and more. Each article in the dataset provides detailed information, including headlines, publication dates, authors, article content, and categories, offering valuable insights for researchers, data analysts, and media professionals.
Key Features:
Whether you're conducting research on financial markets, analyzing media trends, or developing new content, the "Largest News Articles Dataset from CNBC" is an invaluable resource that provides detailed insights and comprehensive coverage of the latest news.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides comprehensive information on effective local SEO content creation strategies for businesses in Colorado Springs, Colorado. It covers topics such as local keyword research, neighborhood targeting, community-based marketing, and creating content that resonates with the local audience. The dataset includes details on the creator, publisher, spatial and temporal coverage, variable metrics, data sources, and usage information.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The CNBC Economy Articles Dataset is an invaluable collection of data extracted from CNBC’s economy section, offering deep insights into global and U.S. economic trends, market dynamics, financial policies, and industry developments.
This dataset encompasses a diverse array of economic articles on critical topics like GDP growth, inflation rates, employment statistics, central bank policies, and major global events influencing the market. Designed for researchers, analysts, and businesses, it serves as an essential resource for understanding economic patterns, conducting sentiment analysis, and developing financial forecasting models.
Each record in the dataset is meticulously structured and includes:
This rich combination of fields ensures seamless integration into data science projects, research papers, and market analyses.
Interested in additional structured news datasets for your research or analytics needs? Check out our news dataset collection to find datasets tailored for diverse analytical applications.
According to our latest research, the global audio dataset market size reached USD 6.7 billion in 2024, driven by surging demand for machine learning and AI-powered audio applications. The market is experiencing robust expansion with a CAGR of 21.4% from 2025 to 2033, with forecasts indicating the market will attain USD 48.1 billion by 2033. Key growth factors include the proliferation of voice-activated technologies, increased adoption of smart devices, and the widespread integration of audio analytics in diverse sectors such as healthcare, automotive, and media & entertainment.
The primary growth driver for the audio dataset market is the exponential rise in the adoption of automatic speech recognition (ASR) and natural language processing (NLP) technologies. With businesses and consumers increasingly relying on voice assistants, chatbots, and virtual agents, the demand for high-quality, diverse, and annotated audio datasets has soared. These datasets are fundamental to training and refining AI models for voice recognition, transcription, and sentiment analysis. The integration of audio datasets into customer service, accessibility solutions for the differently-abled, and language learning platforms further amplifies market growth. Additionally, advancements in deep learning algorithms are enabling the extraction of more nuanced information from audio data, making datasets more valuable and broadening their use cases.
Another significant factor fueling the audio dataset market is the surge in smart device penetration and IoT adoption across industries. The proliferation of smart speakers, connected vehicles, wearable devices, and intelligent home appliances has created a massive influx of audio data. Organizations are leveraging this data to enhance user experience, personalize services, and enable real-time decision-making. In sectors like automotive, audio datasets are instrumental in developing advanced driver assistance systems (ADAS) and in-car voice assistants. In healthcare, audio datasets support the development of diagnostic tools and remote patient monitoring solutions. The convergence of audio datasets with big data analytics and cloud computing is unlocking new business models and revenue streams, further propelling market expansion.
The media & entertainment industry is also playing a pivotal role in the growth of the audio dataset market. The demand for music information retrieval, sound event detection, and content recommendation systems is at an all-time high. Streaming platforms, broadcasters, and content creators are increasingly utilizing audio datasets to optimize content delivery, improve audience engagement, and automate content moderation. The emergence of immersive audio experiences, such as spatial audio and 3D sound, is creating new opportunities for dataset providers. Furthermore, regulatory mandates for accessibility, such as closed captioning and audio descriptions, are compelling organizations to invest in robust audio datasets, driving further market growth.
Regionally, North America holds the largest share of the audio dataset market, attributed to early technology adoption, high R&D investments, and the presence of major AI and tech companies. However, the Asia Pacific region is witnessing the fastest growth, fueled by rapid digital transformation, increasing smartphone penetration, and government initiatives to promote AI research. Europe is also a significant market, driven by stringent data privacy regulations and a strong focus on innovation in automotive and healthcare sectors. Latin America and the Middle East & Africa are emerging markets, with growing investments in digital infrastructure and AI-driven applications. The global landscape is characterized by intense competition, continuous innovation, and a focus on developing multilingual and culturally diverse audio datasets.
The audio dataset market is segmented by dataset type into speech, music, environmental sounds,
This dataset contains information about trending YouTube videos from multiple countries, providing valuable insights for predicting video popularity based on various attributes. The dataset includes both numerical and categorical features that are essential for analyzing viewer behavior, engagement, and trends in content creation. The original source of this dataset can be found at : https://www.kaggle.com/datasets/datasnaek/youtube-new/data
title: The title of the YouTube video.
channel_title: Name of the channel that published the video.
trending_date: The date the video started trending.
publish_date: The original upload date of the video.
publish_time: The exact time the video was published.
views: The total number of views the video received.
likes: The number of likes the video received.
dislikes: The number of dislikes the video received.
comment_count: The total number of comments on the video.
tags: Keywords or tags associated with the video, helping discoverability.
description: A detailed text description provided by the uploader.
category_id: The category assigned to the video (e.g., Music, Gaming, News).
Predicting the number of views on youtube videos based on video attributes. The goal is to develop a model that can accurately predict the number of views a video will receive, using various video attributes such as likes, shares, comments, video duration, and more.
RMSE (Root Mean Squared Error) RMSE is a metric that measures the magnitude of the error between the values predicted by the model (Predicted Views) and the actual values (Actual Views). The lower the RMSE value, the more accurate the model's predictions.
R² (Coefficient of Determination) R² measures the extent to which the model can explain the variation in the data. R² values range from 0 to 1, where 1 means the model can explain all the variation in the number of views based on the given attributes, and 0 means the model cannot explain the variation. The higher the R², the better the model is at predicting views and the more relevant the features used in the model.
The machine learning model was evaluated using several approaches, including different pre-processing techniques and multiple ML models. Ultimately, the chosen model for this analysis is the Random Forest Regressor. The final evaluation results show an RMSE of 630.741, indicating an average prediction error of approximately 630.741 units. Additionally, the R² score is 0.9623, meaning that the model explains 96.23% of the variance in the data (number of views). These results were deemed satisfactory and were selected as the final modeling approach for the system and its potential future applications.
https://brightdata.com/licensehttps://brightdata.com/license
We'll tailor a BuzzFeed dataset to meet your unique needs, encompassing article titles, reader engagement metrics, content types, demographic data of readers, social media shares, comment statistics, and other pertinent metrics.
Leverage our BuzzFeed datasets for diverse applications to bolster strategic planning and market analysis. Scrutinizing these datasets enables organizations to grasp reader preferences and digital media trends, facilitating nuanced content creation and marketing initiatives. Customize your access to the entire dataset or specific subsets as per your business requisites.
Popular use cases involve optimizing content strategy based on engagement insights, enhancing marketing strategies through targeted audience segmentation, and identifying and forecasting trends to stay ahead in the digital media landscape.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 4 semi-structured interviews on the topics of content creation in later life as a pathway to support digital participation. It also includes selected quotes from a workshop.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About the NUDA Dataset
Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.
General
This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.
The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.
For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.
Description of the Data Files
This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:
NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
Statistics.png: contains all Umami statistics for NewsUnravel's usage data
Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if given
Article.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %
Participant.csv: holds the participant IDs and data processing consent
Collection Process
Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.
Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.
So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.
The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.
The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of videos and comments related to the invasion of Ukraine, published on TikTok by a number of users over the year of 2022. It was compiled by Benjamin Steel, Sara Parker and Derek Ruths at the Network Dynamics Lab, McGill University. We created this dataset to facilitate the study of TikTok, and the nature of social interaction on the platform relevant to a major political event.
The dataset has been released here on Zenodo: https://doi.org/10.5281/zenodo.7534952 as well as on Github: https://github.com/networkdynamics/data-and-code/tree/master/ukraine_tiktok
To create the dataset, we identified hashtags and keywords explicitly related to the conflict to collect a core set of videos (or ”TikToks”). We then compiled comments associated with these videos. All of the data captured is publically available information, and contains personally identifiable information. In total we collected approximately 16 thousand videos and 12 million comments, from approximately 6 million users. There are approximately 1.9 comments on average per user captured, and 1.5 videos per user who posted a video. The author personally collected this data using the web scraping PyTok library, developed by the author: https://github.com/networkdynamics/pytok.
Due to scraping duration, this is just a sample of the publically available discourse concerning the invasion of Ukraine on TikTok. Due to the fuzzy search functionality of the TikTok, the dataset contains videos with a range of relatedness to the invasion.
We release here the unique video IDs of the dataset in a CSV format. The data was collected without the specific consent of the content creators, so we have released only the data required to re-create it, to allow users to delete content from TikTok and be removed from the dataset if they wish. Contained in this repository are scripts that will automatically pull the full dataset, which will take the form of JSON files organised into a folder for each video. The JSON files are the entirety of the data returned by the TikTok API. We include a script to parse the JSON files into CSV files with the most commonly used data. We plan to further expand this dataset as collection processes progress and the war continues. We will version the dataset to ensure reproducibility.
To build this dataset from the IDs here:
Go to https://github.com/networkdynamics/pytok and clone the repo locally
Run pip install -e . in the pytok directory
Run pip install pandas tqdm to install these libraries if not already installed
Run get_videos.py to get the video data
Run video_comments.py to get the comment data
Run user_tiktoks.py to get the video history of the users
Run hashtag_tiktoks.py or search_tiktoks.py to get more videos from other hashtags and search terms
Run load_json_to_csv.py to compile the JSON files into two CSV files, comments.csv and videos.csv
If you get an error about the wrong chrome version, use the command line argument get_videos.py --chrome-version YOUR_CHROME_VERSION Please note pulling data from TikTok takes a while! We recommend leaving the scripts running on a server for a while for them to finish downloading everything. Feel free to play around with the delay constants to either speed up the process or avoid TikTok rate limiting.
Please do not hesitate to make an issue in this repo to get our help with this!
The videos.csv will contain the following columns:
video_id: Unique video ID
createtime: UTC datetime of video creation time in YYYY-MM-DD HH:MM:SS format
author_name: Unique author name
author_id: Unique author ID
desc: The full video description from the author
hashtags: A list of hashtags used in the video description
share_video_id: If the video is sharing another video, this is the video ID of that original video, else empty
share_video_user_id: If the video is sharing another video, this the user ID of the author of that video, else empty
share_video_user_name: If the video is sharing another video, this is the user name of the author of that video, else empty
share_type: If the video is sharing another video, this is the type of the share, stitch, duet etc.
mentions: A list of users mentioned in the video description, if any
The comments.csv will contain the following columns:
comment_id: Unique comment ID
createtime: UTC datetime of comment creation time in YYYY-MM-DD HH:MM:SS format
author_name: Unique author name
author_id: Unique author ID
text: Text of the comment
mentions: A list of users that are tagged in the comment
video_id: The ID of the video the comment is on
comment_language: The language of the comment, as predicted by the TikTok API
reply_comment_id: If the comment is replying to another comment, this is the ID of that comment
The date can be compiled into a user interaction network to facilitate study of interaction dynamics. There is code to help with that here: https://github.com/networkdynamics/polar-seeds. Additional scripts for further preprocessing of this data can be found there too.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accounts-Payable Time Series for Qyou Media Inc. QYOU Media Inc., through its subsidiaries, curates, produces, and distributes content created by social media stars and digital content creators in the United States, Canada, and India. The company offers chatterbox, an influencer and marketing platform agency that connects brands and categories with social media influencers; and distribute contents to connected TV, mobile, and app-based platforms. It also creates and manages influencer marketing campaigns for film studios, game publishers, and other brands. In addition, the company operates digital channels, such as The Q Kahaniyan, QGameX, QToonz, and RDC Movies. It distributes its content through social platforms, including TikTok, YouTube, Instagram, Snapchat, and X. QYOU Media Inc. is headquartered in Toronto, Canada.
In the fourth quarter of 2024, TikTok generated around 186 million downloads from users worldwide. Initially launched in China first by ByteDance as Douyin, the short-video format was popularized by TikTok and took over the global social media environment in 2020. In the first quarter of 2020, TikTok downloads peaked at over 313.5 million worldwide, up by 62.3 percent compared to the first quarter of 2019.
TikTok interactions: is there a magic formula for content success?
In 2024, TikTok registered an engagement rate of approximately 4.64 percent on video content hosted on its platform. During the same examined year, the social video app recorded over 1,100 interactions on average. These interactions were primarily composed of likes, while only recording less than 20 comments per piece of content on average in 2024.
The platform has been actively monitoring the issue of fake interactions, as it removed around 236 million fake likes during the first quarter of 2024. Though there is no secret formula to get the maximum of these metrics, recommended video length can possibly contribute to the success of content on TikTok.
It was recommended that tiny TikTok accounts with up to 500 followers post videos that are around 2.6 minutes long as of the first quarter of 2024. While, the ideal video duration for huge TikTok accounts with over 50,000 followers was 7.28 minutes. The average length of TikTok videos posted by the creators in 2024 was around 43 seconds.
What’s trending on TikTok Shop?
Since its launch in September 2023, TikTok Shop has become one of the most popular online shopping platforms, offering consumers a wide variety of products. In 2023, TikTok shops featuring beauty and personal care items sold over 370 million products worldwide.
TikTok shops featuring womenswear and underwear, as well as food and beverages, followed with 285 and 138 million products sold, respectively. Similarly, in the United States market, health and beauty products were the most-selling items,
accounting for 85 percent of sales made via the TikTok Shop feature during the first month of its launch. In 2023, Indonesia was the market with the largest number of TikTok Shops, hosting over 20 percent of all TikTok Shops. Thailand and Vietnam followed with 18.29 and 17.54 percent of the total shops listed on the famous short video platform, respectively.
Throughout 2024, the majority of copyright claims received by YouTube were spotted by the platform's Content ID tool, which cross-checks uploaded videos against a larger file database. Over 2.2 billion claims were submitted via Copyright Match Tool, while approximately three million claims were submitted to the platform via webforms.