Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of videos and comments related to the invasion of Ukraine, published on TikTok by a number of users over the year of 2022. It was compiled by Benjamin Steel, Sara Parker and Derek Ruths at the Network Dynamics Lab, McGill University. We created this dataset to facilitate the study of TikTok, and the nature of social interaction on the platform relevant to a major political event.
The dataset has been released here on Zenodo: https://doi.org/10.5281/zenodo.7926959 as well as on Github: https://github.com/networkdynamics/data-and-code/tree/master/ukraine_tiktok
To create the dataset, we identified hashtags and keywords explicitly related to the conflict to collect a core set of videos (or ”TikToks”). We then compiled comments associated with these videos. All of the data captured is publically available information, and contains personally identifiable information. In total we collected approximately 16 thousand videos and 12 million comments, from approximately 6 million users. There are approximately 1.9 comments on average per user captured, and 1.5 videos per user who posted a video. The author personally collected this data using the web scraping PyTok library, developed by the author: https://github.com/networkdynamics/pytok.
Due to scraping duration, this is just a sample of the publically available discourse concerning the invasion of Ukraine on TikTok. Due to the fuzzy search functionality of the TikTok, the dataset contains videos with a range of relatedness to the invasion.
We release here the unique video IDs of the dataset in a CSV format. The data was collected without the specific consent of the content creators, so we have released only the data required to re-create it, to allow users to delete content from TikTok and be removed from the dataset if they wish. Contained in this repository are scripts that will automatically pull the full dataset, which will take the form of JSON files organised into a folder for each video. The JSON files are the entirety of the data returned by the TikTok API. We include a script to parse the JSON files into CSV files with the most commonly used data. We plan to further expand this dataset as collection processes progress and the war continues. We will version the dataset to ensure reproducibility.
To build this dataset from the IDs here:
pip install -e .
in the pytok directorypip install pandas tqdm
to install these libraries if not already installedget_videos.py
to get the video datavideo_comments.py
to get the comment datauser_tiktoks.py
to get the video history of the usershashtag_tiktoks.py
or search_tiktoks.py
to get more videos from other hashtags and search termsload_json_to_csv.py
to compile the JSON files into two CSV files, comments.csv
and videos.csv
If you get an error about the wrong chrome version, use the command line argument get_videos.py --chrome-version YOUR_CHROME_VERSION
Please note pulling data from TikTok takes a while! We recommend leaving the scripts running on a server for a while for them to finish downloading everything. Feel free to play around with the delay constants to either speed up the process or avoid TikTok rate limiting.
Please do not hesitate to make an issue in this repo to get our help with this!
The videos.csv
will contain the following columns:
video_id
: Unique video ID
createtime
: UTC datetime of video creation time in YYYY-MM-DD HH:MM:SS format
author_name
: Unique author name
author_id
: Unique author ID
desc
: The full video description from the author
hashtags
: A list of hashtags used in the video description
share_video_id
: If the video is sharing another video, this is the video ID of that original video, else empty
share_video_user_id
: If the video is sharing another video, this the user ID of the author of that video, else empty
share_video_user_name
: If the video is sharing another video, this is the user name of the author of that video, else empty
share_type
: If the video is sharing another video, this is the type of the share, stitch, duet etc.
mentions
: A list of users mentioned in the video description, if any
The comments.csv
will contain the following columns:
comment_id
: Unique comment ID
createtime
: UTC datetime of comment creation time in YYYY-MM-DD HH:MM:SS format
author_name
: Unique author name
author_id
: Unique author ID
text
: Text of the comment
mentions
: A list of users that are tagged in the comment
video_id
: The ID of the video the comment is on
comment_language
: The language of the comment, as predicted by the TikTok API
reply_comment_id
: If the comment is replying to another comment, this is the ID of that comment
The date can be compiled into a user interaction network to facilitate study of interaction dynamics. There is code to help with that here: https://github.com/networkdynamics/polar-seeds. Additional scripts for further preprocessing of this data can be found there too.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
TikTok-10M Dataset
Dataset Description
TikTok-10M is a large-scale dataset containing 10 million short-form posts from TikTok, designed for video understanding, multimodal learning, and social media content analysis. The dataset was curated to bridge the gap between academic video datasets and actual user-generated content, providing researchers with authentic patterns and characteristics of modern short-form video content that dominates social media platforms.… See the full description on the dataset page: https://huggingface.co/datasets/The-data-company/TikTok-10M.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used TikTok’s built-in account analytics to download and record video and account metrics for the period between 10/8/2021 and 2/6/2022. We collected the following summary data for each individual video: video views, likes, comments, shares, total cumulative play time, average duration the video was watched, percentage of viewers who watched the full video, unique reached audience, and the percentage of video views by section (For You, personal profile, Following, hashtags).
We evaluated the “success” of videos based on reach and engagement metrics, as well as viewer retention (how long a video is watched). We used metrics of reach (number of unique users the video was seen by) and engagement (likes, comments, and shares) to calculate the engagement rate of each video. The engagement rate is calculated as the engagement parameter as a percentage of total reach (e.g., Likes / Audience Reached *100).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The TikHarm dataset is a curated collection of TikTok videos designed to train models for classifying harmful content. The dataset is in the format of UCF101
, and it is specifically focused on content accessible to children, with the aim of distinguishing between different types of potentially harmful material.
Data was gathered from TikTok, targeting videos that are accessible to children to ensure the dataset reflects the type of content they are likely to encounter.
Collected videos were manually labeled into four predefined categories: - Harmful Content: Videos that depict violence, dangerous actions that children might imitate, or other harmful behavior. - Adult Content: Videos containing sexual content or other material deemed inappropriate for children. - Safe: Videos that are appropriate and safe for children to view: popular cartoon, etc. - Suicide: Videos that depict, suggest, or discuss suicidal behavior or ideation.
Subset | Samples | Min Duration (s) | Max Duration (s) | Avg Duration (s) | Total Duration (h) |
---|---|---|---|---|---|
Train | 2762 | 3.88 | 600 | 38.71 | 29.71 |
Dev | 790 | 5.04 | 600 | 38.57 | 4.24 |
Test | 396 | 1.95 | 600 | 38.77 | 8.51 |
Class | Samples | Min Duration (s) | Max Duration (s) | Avg Duration (s) | Total Duration (h) |
---|---|---|---|---|---|
Safe | 997 | 5.04 | 568.8 | 65.36 | 18.1 |
Adult | 977 | 1.95 | 600 | 36.25 | 9.84 |
Harmful | 990 | 4.8 | 600 | 35.92 | 9.88 |
Suicide | 984 | 3.88 | 181.23 | 16.96 | 4.63 |
These tables present the duration statistics for each subset and class within the TikHarm dataset.
This comprehensive dataset is invaluable for developing robust video classification models to automatically detect and categorize harmful content on social media platforms.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Tik Tak Tok - Est. 2023
Model
HotshotXL
Voice
Julian
Orientation
Portrait
Tags
Short Dancing
Style
tiktok video, instagram, beautiful, sharp, detailed
Music
mainstream pop music
Prompt
A channel generating short vertical videos, between 20 seconds and 60 seconds Most videos are about people dancing, doing choregraphy, or talking selfies, filming their cats, daily life (eg. going to a cafe… See the full description on the dataset page: https://huggingface.co/datasets/jbilcke-hf/ai-tube-tik-tak-tok.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A ranked dataset of the most viral TikTok videos in 2024, based on total views and creator engagement.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📲 Example Dataset: TikTok Scraper Tool
👉 Start Scraping TikTok: TikTok Scraper Tool
✨ Key Features
⚡ Instant Transcription – Turn any TikTok video into an AI-ready transcript
🎯 Metadata – Get the title, language description, and video hashtags
🔗 URL-Based Access – Just drop in a TikTok video URL to start scraping
🧩 LLM-Ready Output – Receive clean JSON ready for agents, RAG, or AI tools
💸 Free Tier – Use up to 100 queries during the beta period
💫 Easy… See the full description on the dataset page: https://huggingface.co/datasets/MasaFoundation/TikTok_Most_Shared_Video_Transcription_Example.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social media platforms use short, highly engaging videos to catch users’ attention. While the short-form video feeds popularized by TikTok are rapidly spreading to other platforms, we do not yet understand their impact on cognitive functions. We conducted a between-subjects experiment (𝑁 = 60) investigating the impact of engaging with TikTok, Twitter, and YouTube while performing a Prospective Memory task (i.e., executing a previously planned action). The study required participants to remember intentions over interruptions. We found that the TikTok condition significantly degraded the users’ performance in this task. As none of the other conditions (Twitter, YouTube, no activity) had a similar effect, our results indicate that the combination of short videos and rapid context-switching impairs intention recall and execution. We contribute a quantified understanding of the effect of social media feed format on Prospective Memory and outline consequences for media technology designers not to harm the users’ memory and wellbeing. Description of the Dataset Data frame: The ./data/rt.csv provides the data frame of reaction times. The ./data/acc.csv provides the data frame of reaction accuracy scores. The ./data/q.csv provides the data frame collected from questionnaires. The ./data/ddm.csv is the learned DDM features using ./appendix2_ddm_fitting.ipynb, which is then used in ./3.ddm_anova.ipynb. Figures: All figures appeared in the paper are placed in ./figures and can be reproduced using *_vis.ipynb files.
MovingFashion Dataset
MovingFashion is the first publicly available benchmark designed to address the video-to-shop challenge in computer vision, where the goal is to retrieve fashion items worn in social media videos (e.g., Instagram, TikTok) by matching them to corresponding e-commerce product images. GitHub Repository license: cc-by-nc-4.0
Overview
Total Videos: 14,855 social videos
Source Platforms: Instagram, TikTok, and Net-A-Porter
Associated Shop Images:… See the full description on the dataset page: https://huggingface.co/datasets/christianjoppi/MovingFashion.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Myanmar Celebrity Voices
A high-quality speech dataset extracted from the official TikTok channel of Myanmar Celebrity TV.
Myanmar Celebrity Voices is a collection of 69,781 short audio segments (≈46 hours total) derived from public TikTok videos by The Official TikTok Channel of Myanmar Celebrity TV — one of the most popular digital media platforms in Myanmar. The source channel regularly publishes:
Interviews with Myanmar’s top movie actors and actresses Behind-the-scenes… See the full description on the dataset page: https://huggingface.co/datasets/freococo/myanmar_cele_voices.
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
We propose a large-scale multimodal video log database (LMVD) for identifying depression in the wild. In LMVD, there were 1823 samples, capturing 214 hours of 1475 participants from four multimedia platforms (Sina Weibo, Bilibili, Tiktok, and YouTube). For all collected data, we extract video features and audio features separately. For audio features, use a pre trained VGGish41 model. For visual features, use FAU, facial markers, eye gaze, and head posture features. It is worth mentioning that our LMVD is the largest dataset for identifying visual and auditory depression in an individual's daily life, which is a positive contribution to the field of emotional computing.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset presents a comprehensive compilation of the most streamed songs on Spotify in 2024. It provides extensive insights into each track's attributes, popularity, and presence on various music platforms, offering a valuable resource for music analysts, enthusiasts, and industry professionals. The dataset includes information such as track name, artist, release date, ISRC, streaming statistics, and presence on platforms like YouTube, TikTok, and more.
Here is the link for the 2023 data: "https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023">Most Streamed Spotify Songs 2023 🟢
- Track Name: Name of the song.
- Album Name: Name of the album the song belongs to.
- Artist: Name of the artist(s) of the song.
- Release Date: Date when the song was released.
- ISRC: International Standard Recording Code for the song.
- All Time Rank: Ranking of the song based on its all-time popularity.
- Track Score: Score assigned to the track based on various factors.
- Spotify Streams: Total number of streams on Spotify.
- Spotify Playlist Count: Number of Spotify playlists the song is included in.
- Spotify Playlist Reach: Reach of the song across Spotify playlists.
- Spotify Popularity: Popularity score of the song on Spotify.
- YouTube Views: Total views of the song's official video on YouTube.
- YouTube Likes: Total likes on the song's official video on YouTube.
- TikTok Posts: Number of TikTok posts featuring the song.
- TikTok Likes: Total likes on TikTok posts featuring the song.
- TikTok Views: Total views on TikTok posts featuring the song.
- YouTube Playlist Reach: Reach of the song across YouTube playlists.
- Apple Music Playlist Count: Number of Apple Music playlists the song is included in.
- AirPlay Spins: Number of times the song has been played on radio stations.
- SiriusXM Spins: Number of times the song has been played on SiriusXM.
- Deezer Playlist Count: Number of Deezer playlists the song is included in.
- Deezer Playlist Reach: Reach of the song across Deezer playlists.
- Amazon Playlist Count: Number of Amazon Music playlists the song is included in.
- Pandora Streams: Total number of streams on Pandora.
- Pandora Track Stations: Number of Pandora stations featuring the song.
- Soundcloud Streams: Total number of streams on Soundcloud.
- Shazam Counts: Total number of times the song has been Shazamed.
- TIDAL Popularity: Popularity score of the song on TIDAL.
- Explicit Track: Indicates whether the song contains explicit content.
- Music Analysis: Analyze trends in audio features to understand popular song characteristics.
- Platform Comparison: Compare song popularity across different music platforms.
- Artist Impact: Study the relationship between artist attributes and song success.
- Temporal Trends: Identify changes in music attributes and preferences over time.
- Cross-Platform Presence: Investigate song performance across various streaming services.
Your support through an upvote would be greatly appreciated if you find this dataset useful! ❤️🙂 Thank you.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A structured dataset comparing viral view thresholds and timeframes across major platforms, including TikTok, YouTube (long-form & Shorts), Instagram Reels, Facebook, Twitter (X), LinkedIn Video, and LinkedIn Posts.
https://brightdata.com/licensehttps://brightdata.com/license
Gain valuable insights with our comprehensive Social Media Dataset, designed to help businesses, marketers, and analysts track trends, monitor engagement, and optimize strategies. This dataset provides structured and reliable social media data from multiple platforms.
Dataset Features
User Profiles: Access public social media profiles, including usernames, bios, follower counts, engagement metrics, and more. Ideal for audience analysis, influencer marketing, and competitive research. Posts & Content: Extract posts, captions, hashtags, media (images/videos), timestamps, and engagement metrics such as likes, shares, and comments. Useful for trend analysis, sentiment tracking, and content strategy optimization. Comments & Interactions: Analyze user interactions, including replies, mentions, and discussions. This data helps brands understand audience sentiment and engagement patterns. Hashtag & Trend Tracking: Monitor trending hashtags, topics, and viral content across platforms to stay ahead of industry trends and consumer interests.
Customizable Subsets for Specific Needs Our Social Media Dataset is fully customizable, allowing you to filter data based on platform, region, keywords, engagement levels, or specific user profiles. Whether you need a broad dataset for market research or a focused subset for brand monitoring, we tailor the dataset to your needs.
Popular Use Cases
Brand Monitoring & Reputation Management: Track brand mentions, customer feedback, and sentiment analysis to manage online reputation effectively. Influencer Marketing & Audience Analysis: Identify key influencers, analyze engagement metrics, and optimize influencer partnerships. Competitive Intelligence: Monitor competitor activity, content performance, and audience engagement to refine marketing strategies. Market Research & Consumer Insights: Analyze social media trends, customer preferences, and emerging topics to inform business decisions. AI & Predictive Analytics: Leverage structured social media data for AI-driven trend forecasting, sentiment analysis, and automated content recommendations.
Whether you're tracking brand sentiment, analyzing audience engagement, or monitoring industry trends, our Social Media Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundTikTok is an important channel for consumers to obtain and adopt health information. However, misinformation on TikTok could potentially impact public health. Currently, the quality of content related to GDM on TikTok has not been thoroughly reviewed.ObjectiveThis study aims to explore the information quality of GDM videos on TikTok.MethodsA comprehensive cross-sectional study was conducted on TikTok videos related to GDM. The quality of the videos was assessed using three standardized evaluation tools: DISCERN, the Journal of the American Medical Association (JAMA) benchmarks, and the Global Quality Scale (GQS). The comprehensiveness of the content was evaluated through six questions covering definitions, signs/symptoms, risk factors, evaluation, management, and outcomes. Additionally, a correlational analysis was conducted between video quality and the characteristics of the uploaders and the videos themselves.ResultsA total of 216 videos were included in the final analysis, with 162 uploaded by health professionals, 40 by general users, and the remaining videos contributed by individual science communicators, for-profit organizations, and news agencies. The average DISCERN, JAMA, and GQS scores for all videos were 48.87, 1.86, and 2.06, respectively. The videos uploaded by health professionals scored the highest in DISCERN, while the videos uploaded by individual science communicators scored significantly higher in JAMA and GQS than those from other sources. Correlation analysis between video quality and video features showed DISCERN scores, JAMA scores and GQS scores were positively correlated with video duration (P
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
UGC-VideoCaptioner Dataset
Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio-visual content. However, existing video captioning benchmarks and models remain predominantly visual-centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of full-modality datasets and lightweight, capable models hampers progress in fine-grained, multimodal video… See the full description on the dataset page: https://huggingface.co/datasets/openinterx/UGC-VideoCap.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe development of short popular science video platforms helps people obtain health information, but no research has evaluated the information characteristics and quality of short videos related to cervical cancer. The purpose of this study was to evaluate the quality and reliability of short cervical cancer-related videos on TikTok and Kwai.MethodsThe Chinese keyword "cervical cancer" was used to search for related videos on TikTok and Kwai, and a total of 163 videos were ultimately included. The overall quality of these videos was evaluated by the Global Quality Score (GQS) and the modified DISCERN tool.ResultsA total of 163 videos were included in this study, TikTok and Kwai contributed 82 and 81 videos, respectively. Overall, these videos received much attention; the median number of likes received was 1360 (403–6867), the median number of comments was 147 (40–601), and the median number of collections was 282 (71–1296). In terms of video content, the etiology of cervical cancer was the most frequently discussed topic. Short videos posted on TikTok received more attention than did those posted on Kwai, and the GQS and DISCERN score of videos posted on TikTok were significantly better than those of videos posted on Kwai. In addition, the videos posted by specialists were of the highest quality, with a GQS and DISCERN score of 3 (2–3) and 2 (2–3), respectively. Correlation analysis showed that GQS was significantly correlated with the modified DISCERN scores (p
As of January 2024, Instagram was slightly more popular with men than women, with men accounting for 50.6 percent of the platform’s global users. Additionally, the social media app was most popular amongst younger audiences, with almost 32 percent of users aged between 18 and 24 years.
Instagram’s Global Audience
As of January 2024, Instagram was the fourth most popular social media platform globally, reaching two billion monthly active users (MAU). This number is projected to keep growing with no signs of slowing down, which is not a surprise as the global online social penetration rate across all regions is constantly increasing.
As of January 2024, the country with the largest Instagram audience was India with 362.9 million users, followed by the United States with 169.7 million users.
Who is winning over the generations?
Even though Instagram’s audience is almost twice the size of TikTok’s on a global scale, TikTok has shown itself to be a fierce competitor, particularly amongst younger audiences. TikTok was the most downloaded mobile app globally in 2022, generating 672 million downloads. As of 2022, Generation Z in the United States spent more time on TikTok than on Instagram monthly.
How much time do people spend on social media?
As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in
the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively.
People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general.
During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
As of April 2024, almost 32 percent of global Instagram audiences were aged between 18 and 24 years, and 30.6 percent of users were aged between 25 and 34 years. Overall, 16 percent of users belonged to the 35 to 44 year age group.
Instagram users
With roughly one billion monthly active users, Instagram belongs to the most popular social networks worldwide. The social photo sharing app is especially popular in India and in the United States, which have respectively 362.9 million and 169.7 million Instagram users each.
Instagram features
One of the most popular features of Instagram is Stories. Users can post photos and videos to their Stories stream and the content is live for others to view for 24 hours before it disappears. In January 2019, the company reported that there were 500 million daily active Instagram Stories users. Instagram Stories directly competes with Snapchat, another photo sharing app that initially became famous due to it’s “vanishing photos” feature.
As of the second quarter of 2021, Snapchat had 293 million daily active users.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of videos and comments related to the invasion of Ukraine, published on TikTok by a number of users over the year of 2022. It was compiled by Benjamin Steel, Sara Parker and Derek Ruths at the Network Dynamics Lab, McGill University. We created this dataset to facilitate the study of TikTok, and the nature of social interaction on the platform relevant to a major political event.
The dataset has been released here on Zenodo: https://doi.org/10.5281/zenodo.7926959 as well as on Github: https://github.com/networkdynamics/data-and-code/tree/master/ukraine_tiktok
To create the dataset, we identified hashtags and keywords explicitly related to the conflict to collect a core set of videos (or ”TikToks”). We then compiled comments associated with these videos. All of the data captured is publically available information, and contains personally identifiable information. In total we collected approximately 16 thousand videos and 12 million comments, from approximately 6 million users. There are approximately 1.9 comments on average per user captured, and 1.5 videos per user who posted a video. The author personally collected this data using the web scraping PyTok library, developed by the author: https://github.com/networkdynamics/pytok.
Due to scraping duration, this is just a sample of the publically available discourse concerning the invasion of Ukraine on TikTok. Due to the fuzzy search functionality of the TikTok, the dataset contains videos with a range of relatedness to the invasion.
We release here the unique video IDs of the dataset in a CSV format. The data was collected without the specific consent of the content creators, so we have released only the data required to re-create it, to allow users to delete content from TikTok and be removed from the dataset if they wish. Contained in this repository are scripts that will automatically pull the full dataset, which will take the form of JSON files organised into a folder for each video. The JSON files are the entirety of the data returned by the TikTok API. We include a script to parse the JSON files into CSV files with the most commonly used data. We plan to further expand this dataset as collection processes progress and the war continues. We will version the dataset to ensure reproducibility.
To build this dataset from the IDs here:
pip install -e .
in the pytok directorypip install pandas tqdm
to install these libraries if not already installedget_videos.py
to get the video datavideo_comments.py
to get the comment datauser_tiktoks.py
to get the video history of the usershashtag_tiktoks.py
or search_tiktoks.py
to get more videos from other hashtags and search termsload_json_to_csv.py
to compile the JSON files into two CSV files, comments.csv
and videos.csv
If you get an error about the wrong chrome version, use the command line argument get_videos.py --chrome-version YOUR_CHROME_VERSION
Please note pulling data from TikTok takes a while! We recommend leaving the scripts running on a server for a while for them to finish downloading everything. Feel free to play around with the delay constants to either speed up the process or avoid TikTok rate limiting.
Please do not hesitate to make an issue in this repo to get our help with this!
The videos.csv
will contain the following columns:
video_id
: Unique video ID
createtime
: UTC datetime of video creation time in YYYY-MM-DD HH:MM:SS format
author_name
: Unique author name
author_id
: Unique author ID
desc
: The full video description from the author
hashtags
: A list of hashtags used in the video description
share_video_id
: If the video is sharing another video, this is the video ID of that original video, else empty
share_video_user_id
: If the video is sharing another video, this the user ID of the author of that video, else empty
share_video_user_name
: If the video is sharing another video, this is the user name of the author of that video, else empty
share_type
: If the video is sharing another video, this is the type of the share, stitch, duet etc.
mentions
: A list of users mentioned in the video description, if any
The comments.csv
will contain the following columns:
comment_id
: Unique comment ID
createtime
: UTC datetime of comment creation time in YYYY-MM-DD HH:MM:SS format
author_name
: Unique author name
author_id
: Unique author ID
text
: Text of the comment
mentions
: A list of users that are tagged in the comment
video_id
: The ID of the video the comment is on
comment_language
: The language of the comment, as predicted by the TikTok API
reply_comment_id
: If the comment is replying to another comment, this is the ID of that comment
The date can be compiled into a user interaction network to facilitate study of interaction dynamics. There is code to help with that here: https://github.com/networkdynamics/polar-seeds. Additional scripts for further preprocessing of this data can be found there too.