29 datasets found

The Invasion of Ukraine Viewed through TikTok: A Dataset
zenodo.org
bin, csv +1
Updated May 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Steel; Sara Parker; Derek Ruths; Benjamin Steel; Sara Parker; Derek Ruths (2023). The Invasion of Ukraine Viewed through TikTok: A Dataset [Dataset]. http://doi.org/10.5281/zenodo.7926959
Explore at:
text/x-python, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7926959
Dataset updated
May 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Benjamin Steel; Sara Parker; Derek Ruths; Benjamin Steel; Sara Parker; Derek Ruths
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ukraine
Description
This is a dataset of videos and comments related to the invasion of Ukraine, published on TikTok by a number of users over the year of 2022. It was compiled by Benjamin Steel, Sara Parker and Derek Ruths at the Network Dynamics Lab, McGill University. We created this dataset to facilitate the study of TikTok, and the nature of social interaction on the platform relevant to a major political event.

The dataset has been released here on Zenodo: https://doi.org/10.5281/zenodo.7926959 as well as on Github: https://github.com/networkdynamics/data-and-code/tree/master/ukraine_tiktok

To create the dataset, we identified hashtags and keywords explicitly related to the conflict to collect a core set of videos (or ”TikToks”). We then compiled comments associated with these videos. All of the data captured is publically available information, and contains personally identifiable information. In total we collected approximately 16 thousand videos and 12 million comments, from approximately 6 million users. There are approximately 1.9 comments on average per user captured, and 1.5 videos per user who posted a video. The author personally collected this data using the web scraping PyTok library, developed by the author: https://github.com/networkdynamics/pytok.

Due to scraping duration, this is just a sample of the publically available discourse concerning the invasion of Ukraine on TikTok. Due to the fuzzy search functionality of the TikTok, the dataset contains videos with a range of relatedness to the invasion.

We release here the unique video IDs of the dataset in a CSV format. The data was collected without the specific consent of the content creators, so we have released only the data required to re-create it, to allow users to delete content from TikTok and be removed from the dataset if they wish. Contained in this repository are scripts that will automatically pull the full dataset, which will take the form of JSON files organised into a folder for each video. The JSON files are the entirety of the data returned by the TikTok API. We include a script to parse the JSON files into CSV files with the most commonly used data. We plan to further expand this dataset as collection processes progress and the war continues. We will version the dataset to ensure reproducibility.

To build this dataset from the IDs here:

Go to https://github.com/networkdynamics/pytok and clone the repo locally

Run pip install -e . in the pytok directory

Run pip install pandas tqdm to install these libraries if not already installed

Run get_videos.py to get the video data

Run video_comments.py to get the comment data

Run user_tiktoks.py to get the video history of the users

Run hashtag_tiktoks.py or search_tiktoks.py to get more videos from other hashtags and search terms

Run load_json_to_csv.py to compile the JSON files into two CSV files, comments.csv and videos.csv

If you get an error about the wrong chrome version, use the command line argument get_videos.py --chrome-version YOUR_CHROME_VERSION Please note pulling data from TikTok takes a while! We recommend leaving the scripts running on a server for a while for them to finish downloading everything. Feel free to play around with the delay constants to either speed up the process or avoid TikTok rate limiting.

Please do not hesitate to make an issue in this repo to get our help with this!

The videos.csv will contain the following columns:

video_id: Unique video ID

createtime: UTC datetime of video creation time in YYYY-MM-DD HH:MM:SS format

author_name: Unique author name

author_id: Unique author ID

desc: The full video description from the author

hashtags: A list of hashtags used in the video description

share_video_id: If the video is sharing another video, this is the video ID of that original video, else empty

share_video_user_id: If the video is sharing another video, this the user ID of the author of that video, else empty

share_video_user_name: If the video is sharing another video, this is the user name of the author of that video, else empty

share_type: If the video is sharing another video, this is the type of the share, stitch, duet etc.

mentions: A list of users mentioned in the video description, if any

The comments.csv will contain the following columns:

comment_id: Unique comment ID

createtime: UTC datetime of comment creation time in YYYY-MM-DD HH:MM:SS format

author_name: Unique author name

author_id: Unique author ID

text: Text of the comment

mentions: A list of users that are tagged in the comment

video_id: The ID of the video the comment is on

comment_language: The language of the comment, as predicted by the TikTok API

reply_comment_id: If the comment is replying to another comment, this is the ID of that comment

The date can be compiled into a user interaction network to facilitate study of interaction dynamics. There is code to help with that here: https://github.com/networkdynamics/polar-seeds. Additional scripts for further preprocessing of this data can be found there too.
h
TikTok-10M
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataset Company, TikTok-10M [Dataset]. https://huggingface.co/datasets/The-data-company/TikTok-10M
Explore at:
Dataset authored and provided by
Dataset Company
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
TikTok-10M Dataset

Dataset Description

TikTok-10M is a large-scale dataset containing 10 million short-form posts from TikTok, designed for video understanding, multimodal learning, and social media content analysis. The dataset was curated to bridge the gap between academic video datasets and actual user-generated content, providing researchers with authentic patterns and characteristics of modern short-form video content that dominates social media platforms.… See the full description on the dataset page: https://huggingface.co/datasets/The-data-company/TikTok-10M.
f
TikTokData.xlsx
figshare.com
xlsx
Updated Jun 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Zawacki (2022). TikTokData.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.20069333.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20069333.v1
Dataset updated
Jun 14, 2022
Dataset provided by
figshare
Authors
Emily Zawacki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We used TikTok’s built-in account analytics to download and record video and account metrics for the period between 10/8/2021 and 2/6/2022. We collected the following summary data for each individual video: video views, likes, comments, shares, total cumulative play time, average duration the video was watched, percentage of viewers who watched the full video, unique reached audience, and the percentage of video views by section (For You, personal profile, Following, hashtags).
We evaluated the “success” of videos based on reach and engagement metrics, as well as viewer retention (how long a video is watched). We used metrics of reach (number of unique users the video was seen by) and engagement (likes, comments, and shares) to calculate the engagement rate of each video. The engagement rate is calculated as the engagement parameter as a percentage of total reach (e.g., Likes / Audience Reached *100).
TikHarm Dataset
kaggle.com
Updated Jun 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An Hoang Vo (2024). TikHarm Dataset [Dataset]. https://www.kaggle.com/datasets/anhoangvo/tikharm-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
An Hoang Vo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The TikHarm dataset is a curated collection of TikTok videos designed to train models for classifying harmful content. The dataset is in the format of UCF101, and it is specifically focused on content accessible to children, with the aim of distinguishing between different types of potentially harmful material.

Data Collection:

Data was gathered from TikTok, targeting videos that are accessible to children to ensure the dataset reflects the type of content they are likely to encounter.

Data Labeling:

Collected videos were manually labeled into four predefined categories: - Harmful Content: Videos that depict violence, dangerous actions that children might imitate, or other harmful behavior. - Adult Content: Videos containing sexual content or other material deemed inappropriate for children. - Safe: Videos that are appropriate and safe for children to view: popular cartoon, etc. - Suicide: Videos that depict, suggest, or discuss suicidal behavior or ideation.

Dataset Statistics:

Subset Samples Min Duration (s) Max Duration (s) Avg Duration (s) Total Duration (h)
Train 2762 3.88 600 38.71 29.71
Dev 790 5.04 600 38.57 4.24
Test 396 1.95 600 38.77 8.51

Class Samples Min Duration (s) Max Duration (s) Avg Duration (s) Total Duration (h)
Safe 997 5.04 568.8 65.36 18.1
Adult 977 1.95 600 36.25 9.84
Harmful 990 4.8 600 35.92 9.88
Suicide 984 3.88 181.23 16.96 4.63

These tables present the duration statistics for each subset and class within the TikHarm dataset.

This comprehensive dataset is invaluable for developing robust video classification models to automatically detect and categorize harmful content on social media platforms.
h
ai-tube-tik-tak-tok
huggingface.co
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Bilcke (2023). ai-tube-tik-tak-tok [Dataset]. https://huggingface.co/datasets/jbilcke-hf/ai-tube-tik-tak-tok
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 21, 2023
Authors
Julian Bilcke
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Description

Tik Tak Tok - Est. 2023

Model

HotshotXL

Voice

Julian

Orientation

Portrait

Tags

Short Dancing

Style

tiktok video, instagram, beautiful, sharp, detailed

Music

mainstream pop music

Prompt

A channel generating short vertical videos, between 20 seconds and 60 seconds Most videos are about people dancing, doing choregraphy, or talking selfies, filming their cats, daily life (eg. going to a cafe… See the full description on the dataset page: https://huggingface.co/datasets/jbilcke-hf/ai-tube-tik-tak-tok.
l
Top 10 Most Viral TikTok Videos of 2024
learningrevolution.net
html
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jawad Khan (2025). Top 10 Most Viral TikTok Videos of 2024 [Dataset]. https://www.learningrevolution.net/viral-on-tiktok/
Explore at:
htmlAvailable download formats
Dataset updated
Jun 24, 2025
Authors
Jawad Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A ranked dataset of the most viral TikTok videos in 2024, based on total views and creator engagement.
h
TikTok_Most_Shared_Video_Transcription_Example
huggingface.co
Updated Jul 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Masa (2025). TikTok_Most_Shared_Video_Transcription_Example [Dataset]. https://huggingface.co/datasets/MasaFoundation/TikTok_Most_Shared_Video_Transcription_Example
Explore at:
Dataset updated
Jul 17, 2025
Dataset authored and provided by
Masa
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📲 Example Dataset: TikTok Scraper Tool

👉 Start Scraping TikTok: TikTok Scraper Tool

✨ Key Features

⚡ Instant Transcription – Turn any TikTok video into an AI-ready transcript
🎯 Metadata – Get the title, language description, and video hashtags
🔗 URL-Based Access – Just drop in a TikTok video URL to start scraping
🧩 LLM-Ready Output – Receive clean JSON ready for agents, RAG, or AI tools
💸 Free Tier – Use up to 100 queries during the beta period
💫 Easy… See the full description on the dataset page: https://huggingface.co/datasets/MasaFoundation/TikTok_Most_Shared_Video_Transcription_Example.
D
Dataset for "Short-Form Videos Degrade Our Capacity to Retain Intentions:...
darus.uni-stuttgart.de
b2find.eudat.eu
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Chiossi; Luke Haliburton; Changkun Ou; Andreas Butz; Albrecht Schmidt (2024). Dataset for "Short-Form Videos Degrade Our Capacity to Retain Intentions: Effect of Context Switching On Prospective Memory" [Dataset]. http://doi.org/10.18419/DARUS-3327
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-3327
Dataset updated
Sep 16, 2024
Dataset provided by
DaRUS
Authors
Francesco Chiossi; Luke Haliburton; Changkun Ou; Andreas Butz; Albrecht Schmidt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
DFG
Description
Social media platforms use short, highly engaging videos to catch users’ attention. While the short-form video feeds popularized by TikTok are rapidly spreading to other platforms, we do not yet understand their impact on cognitive functions. We conducted a between-subjects experiment (𝑁 = 60) investigating the impact of engaging with TikTok, Twitter, and YouTube while performing a Prospective Memory task (i.e., executing a previously planned action). The study required participants to remember intentions over interruptions. We found that the TikTok condition significantly degraded the users’ performance in this task. As none of the other conditions (Twitter, YouTube, no activity) had a similar effect, our results indicate that the combination of short videos and rapid context-switching impairs intention recall and execution. We contribute a quantified understanding of the effect of social media feed format on Prospective Memory and outline consequences for media technology designers not to harm the users’ memory and wellbeing. Description of the Dataset Data frame: The ./data/rt.csv provides the data frame of reaction times. The ./data/acc.csv provides the data frame of reaction accuracy scores. The ./data/q.csv provides the data frame collected from questionnaires. The ./data/ddm.csv is the learned DDM features using ./appendix2_ddm_fitting.ipynb, which is then used in ./3.ddm_anova.ipynb. Figures: All figures appeared in the paper are placed in ./figures and can be reproduced using *_vis.ipynb files.
h
MovingFashion
huggingface.co
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Joppi (2025). MovingFashion [Dataset]. https://huggingface.co/datasets/christianjoppi/MovingFashion
Explore at:
Dataset updated
Jul 27, 2025
Authors
Christian Joppi
Description
MovingFashion Dataset

MovingFashion is the first publicly available benchmark designed to address the video-to-shop challenge in computer vision, where the goal is to retrieve fashion items worn in social media videos (e.g., Instagram, TikTok) by matching them to corresponding e-commerce product images. GitHub Repository license: cc-by-nc-4.0

Overview

Total Videos: 14,855 social videos
Source Platforms: Instagram, TikTok, and Net-A-Porter
Associated Shop Images:… See the full description on the dataset page: https://huggingface.co/datasets/christianjoppi/MovingFashion.
h
myanmar_cele_voices
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wynn, myanmar_cele_voices [Dataset]. https://huggingface.co/datasets/freococo/myanmar_cele_voices
Explore at:
Authors
Wynn
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Area covered
Myanmar (Burma)
Description
Myanmar Celebrity Voices

A high-quality speech dataset extracted from the official TikTok channel of Myanmar Celebrity TV.

Myanmar Celebrity Voices is a collection of 69,781 short audio segments (≈46 hours total) derived from public TikTok videos by The Official TikTok Channel of Myanmar Celebrity TV — one of the most popular digital media platforms in Myanmar. The source channel regularly publishes:

Interviews with Myanmar’s top movie actors and actresses Behind-the-scenes… See the full description on the dataset page: https://huggingface.co/datasets/freococo/myanmar_cele_voices.
LMVD
figshare.com
bin
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lang He (2024). LMVD [Dataset]. http://doi.org/10.6084/m9.figshare.25698351.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25698351.v1
Dataset updated
Apr 26, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lang He
License
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Description
We propose a large-scale multimodal video log database (LMVD) for identifying depression in the wild. In LMVD, there were 1823 samples, capturing 214 hours of 1475 participants from four multimedia platforms (Sina Weibo, Bilibili, Tiktok, and YouTube). For all collected data, we extract video features and audio features separately. For audio features, use a pre trained VGGish41 model. For visual features, use FAU, facial markers, eye gaze, and head posture features. It is worth mentioning that our LMVD is the largest dataset for identifying visual and auditory depression in an individual's daily life, which is a positive contribution to the field of emotional computing.
Most Streamed Spotify Songs 2024
kaggle.com
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nidula Elgiriyewithana ⚡ (2024). Most Streamed Spotify Songs 2024 [Dataset]. http://doi.org/10.34740/kaggle/dsv/8700156
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8700156
Dataset updated
Jun 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nidula Elgiriyewithana ⚡
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description

This dataset presents a comprehensive compilation of the most streamed songs on Spotify in 2024. It provides extensive insights into each track's attributes, popularity, and presence on various music platforms, offering a valuable resource for music analysts, enthusiasts, and industry professionals. The dataset includes information such as track name, artist, release date, ISRC, streaming statistics, and presence on platforms like YouTube, TikTok, and more.

Here is the link for the 2023 data: "https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023">Most Streamed Spotify Songs 2023 🟢

Key Features

Track Name: Name of the song.

Album Name: Name of the album the song belongs to.

Artist: Name of the artist(s) of the song.

Release Date: Date when the song was released.

ISRC: International Standard Recording Code for the song.

All Time Rank: Ranking of the song based on its all-time popularity.

Track Score: Score assigned to the track based on various factors.

Spotify Streams: Total number of streams on Spotify.

Spotify Playlist Count: Number of Spotify playlists the song is included in.

Spotify Playlist Reach: Reach of the song across Spotify playlists.

Spotify Popularity: Popularity score of the song on Spotify.

YouTube Views: Total views of the song's official video on YouTube.

YouTube Likes: Total likes on the song's official video on YouTube.

TikTok Posts: Number of TikTok posts featuring the song.

TikTok Likes: Total likes on TikTok posts featuring the song.

TikTok Views: Total views on TikTok posts featuring the song.

YouTube Playlist Reach: Reach of the song across YouTube playlists.

Apple Music Playlist Count: Number of Apple Music playlists the song is included in.

AirPlay Spins: Number of times the song has been played on radio stations.

SiriusXM Spins: Number of times the song has been played on SiriusXM.

Deezer Playlist Count: Number of Deezer playlists the song is included in.

Deezer Playlist Reach: Reach of the song across Deezer playlists.

Amazon Playlist Count: Number of Amazon Music playlists the song is included in.

Pandora Streams: Total number of streams on Pandora.

Pandora Track Stations: Number of Pandora stations featuring the song.

Soundcloud Streams: Total number of streams on Soundcloud.

Shazam Counts: Total number of times the song has been Shazamed.

TIDAL Popularity: Popularity score of the song on TIDAL.

Explicit Track: Indicates whether the song contains explicit content.

Potential Use Cases

Music Analysis: Analyze trends in audio features to understand popular song characteristics.

Platform Comparison: Compare song popularity across different music platforms.

Artist Impact: Study the relationship between artist attributes and song success.

Temporal Trends: Identify changes in music attributes and preferences over time.

Cross-Platform Presence: Investigate song performance across various streaming services.

Your support through an upvote would be greatly appreciated if you find this dataset useful! ❤️🙂 Thank you.
l
Viral Views by Platform – How Many Views Is Viral (2025)
learningrevolution.net
html
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jawad Khan (2025). Viral Views by Platform – How Many Views Is Viral (2025) [Dataset]. https://www.learningrevolution.net/how-many-views-is-viral/
Explore at:
htmlAvailable download formats
Dataset updated
Jun 23, 2025
Authors
Jawad Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Platform, Time to Go Viral, Viral Views Threshold
Description
A structured dataset comparing viral view thresholds and timeframes across major platforms, including TikTok, YouTube (long-form & Shorts), Instagram Reels, Facebook, Twitter (X), LinkedIn Video, and LinkedIn Posts.
Social Media Datasets
brightdata.com
.json, .csv, .xlsx
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Social Media Datasets [Dataset]. https://brightdata.com/products/datasets/social-media
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Sep 18, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Gain valuable insights with our comprehensive Social Media Dataset, designed to help businesses, marketers, and analysts track trends, monitor engagement, and optimize strategies. This dataset provides structured and reliable social media data from multiple platforms.

Dataset Features

User Profiles: Access public social media profiles, including usernames, bios, follower counts, engagement metrics, and more. Ideal for audience analysis, influencer marketing, and competitive research. Posts & Content: Extract posts, captions, hashtags, media (images/videos), timestamps, and engagement metrics such as likes, shares, and comments. Useful for trend analysis, sentiment tracking, and content strategy optimization. Comments & Interactions: Analyze user interactions, including replies, mentions, and discussions. This data helps brands understand audience sentiment and engagement patterns. Hashtag & Trend Tracking: Monitor trending hashtags, topics, and viral content across platforms to stay ahead of industry trends and consumer interests.

Customizable Subsets for Specific Needs Our Social Media Dataset is fully customizable, allowing you to filter data based on platform, region, keywords, engagement levels, or specific user profiles. Whether you need a broad dataset for market research or a focused subset for brand monitoring, we tailor the dataset to your needs.

Popular Use Cases

Brand Monitoring & Reputation Management: Track brand mentions, customer feedback, and sentiment analysis to manage online reputation effectively. Influencer Marketing & Audience Analysis: Identify key influencers, analyze engagement metrics, and optimize influencer partnerships. Competitive Intelligence: Monitor competitor activity, content performance, and audience engagement to refine marketing strategies. Market Research & Consumer Insights: Analyze social media trends, customer preferences, and emerging topics to inform business decisions. AI & Predictive Analytics: Leverage structured social media data for AI-driven trend forecasting, sentiment analysis, and automated content recommendations.

Whether you're tracking brand sentiment, analyzing audience engagement, or monitoring industry trends, our Social Media Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
f
Original data set used for the current study.
plos.figshare.com
xlsx
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Genyan Jiang; Lei Chen; Lan Geng; Yuhan Zhang; Zhiqi Chen; Yaqi Zhu; Shuangshuang Ma; Mei Zhao (2025). Original data set used for the current study. [Dataset]. http://doi.org/10.1371/journal.pone.0316242.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316242.s001
Dataset updated
Feb 6, 2025
Dataset provided by
PLOS ONE
Authors
Genyan Jiang; Lei Chen; Lan Geng; Yuhan Zhang; Zhiqi Chen; Yaqi Zhu; Shuangshuang Ma; Mei Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundTikTok is an important channel for consumers to obtain and adopt health information. However, misinformation on TikTok could potentially impact public health. Currently, the quality of content related to GDM on TikTok has not been thoroughly reviewed.ObjectiveThis study aims to explore the information quality of GDM videos on TikTok.MethodsA comprehensive cross-sectional study was conducted on TikTok videos related to GDM. The quality of the videos was assessed using three standardized evaluation tools: DISCERN, the Journal of the American Medical Association (JAMA) benchmarks, and the Global Quality Scale (GQS). The comprehensiveness of the content was evaluated through six questions covering definitions, signs/symptoms, risk factors, evaluation, management, and outcomes. Additionally, a correlational analysis was conducted between video quality and the characteristics of the uploaders and the videos themselves.ResultsA total of 216 videos were included in the final analysis, with 162 uploaded by health professionals, 40 by general users, and the remaining videos contributed by individual science communicators, for-profit organizations, and news agencies. The average DISCERN, JAMA, and GQS scores for all videos were 48.87, 1.86, and 2.06, respectively. The videos uploaded by health professionals scored the highest in DISCERN, while the videos uploaded by individual science communicators scored significantly higher in JAMA and GQS than those from other sources. Correlation analysis between video quality and video features showed DISCERN scores, JAMA scores and GQS scores were positively correlated with video duration (P
h
UGC-VideoCap
huggingface.co
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Memories.ai Research (2025). UGC-VideoCap [Dataset]. https://huggingface.co/datasets/openinterx/UGC-VideoCap
Explore at:
Dataset updated
Jul 16, 2025
Dataset authored and provided by
Memories.ai Research
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
UGC-VideoCaptioner Dataset

Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio-visual content. However, existing video captioning benchmarks and models remain predominantly visual-centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of full-modality datasets and lightweight, capable models hampers progress in fine-grained, multimodal video… See the full description on the dataset page: https://huggingface.co/datasets/openinterx/UGC-VideoCap.
f
Original data set used for the current study.
plos.figshare.com
xlsx
Updated Mar 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juanjuan Zhang; Jun Yuan; Danqin Zhang; Yi Yang; Chaoyun Wang; Zhiqian Dou; Yan Li (2024). Original data set used for the current study. [Dataset]. http://doi.org/10.1371/journal.pone.0300180.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300180.s001
Dataset updated
Mar 8, 2024
Dataset provided by
PLOS ONE
Authors
Juanjuan Zhang; Jun Yuan; Danqin Zhang; Yi Yang; Chaoyun Wang; Zhiqian Dou; Yan Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe development of short popular science video platforms helps people obtain health information, but no research has evaluated the information characteristics and quality of short videos related to cervical cancer. The purpose of this study was to evaluate the quality and reliability of short cervical cancer-related videos on TikTok and Kwai.MethodsThe Chinese keyword "cervical cancer" was used to search for related videos on TikTok and Kwai, and a total of 163 videos were ultimately included. The overall quality of these videos was evaluated by the Global Quality Score (GQS) and the modified DISCERN tool.ResultsA total of 163 videos were included in this study, TikTok and Kwai contributed 82 and 81 videos, respectively. Overall, these videos received much attention; the median number of likes received was 1360 (403–6867), the median number of comments was 147 (40–601), and the median number of collections was 282 (71–1296). In terms of video content, the etiology of cervical cancer was the most frequently discussed topic. Short videos posted on TikTok received more attention than did those posted on Kwai, and the GQS and DISCERN score of videos posted on TikTok were significantly better than those of videos posted on Kwai. In addition, the videos posted by specialists were of the highest quality, with a GQS and DISCERN score of 3 (2–3) and 2 (2–3), respectively. Correlation analysis showed that GQS was significantly correlated with the modified DISCERN scores (p

Subset	Samples	Min Duration (s)	Max Duration (s)	Avg Duration (s)	Total Duration (h)
Train	2762	3.88	600	38.71	29.71
Dev	790	5.04	600	38.57	4.24
Test	396	1.95	600	38.77	8.51

Class	Samples	Min Duration (s)	Max Duration (s)	Avg Duration (s)	Total Duration (h)
Safe	997	5.04	568.8	65.36	18.1
Adult	977	1.95	600	36.25	9.84
Harmful	990	4.8	600	35.92	9.88
Suicide	984	3.88	181.23	16.96	4.63

Instagram: distribution of global audiences 2024, by gender

statista.com
es.statista.com

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Instagram: distribution of global audiences 2024, by gender [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

As of January 2024, Instagram was slightly more popular with men than women, with men accounting for 50.6 percent of the platform’s global users. Additionally, the social media app was most popular amongst younger audiences, with almost 32 percent of users aged between 18 and 24 years.

              Instagram’s Global Audience

              As of January 2024, Instagram was the fourth most popular social media platform globally, reaching two billion monthly active users (MAU). This number is projected to keep growing with no signs of slowing down, which is not a surprise as the global online social penetration rate across all regions is constantly increasing.
              As of January 2024, the country with the largest Instagram audience was India with 362.9 million users, followed by the United States with 169.7 million users.

              Who is winning over the generations?

              Even though Instagram’s audience is almost twice the size of TikTok’s on a global scale, TikTok has shown itself to be a fierce competitor, particularly amongst younger audiences. TikTok was the most downloaded mobile app globally in 2022, generating 672 million downloads. As of 2022, Generation Z in the United States spent more time on TikTok than on Instagram monthly.

Average daily time spent on social media worldwide 2012-2024

statista.com
es.statista.com

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Average daily time spent on social media worldwide 2012-2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

How much time do people spend on social media?

              As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in
              the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively.
              People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general.
              During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.

Instagram: distribution of global audiences 2024, by age group

statista.com
es.statista.com

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Instagram: distribution of global audiences 2024, by age group [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

As of April 2024, almost 32 percent of global Instagram audiences were aged between 18 and 24 years, and 30.6 percent of users were aged between 25 and 34 years. Overall, 16 percent of users belonged to the 35 to 44 year age group.

              Instagram users

              With roughly one billion monthly active users, Instagram belongs to the most popular social networks worldwide. The social photo sharing app is especially popular in India and in the United States, which have respectively 362.9 million and 169.7 million Instagram users each.

              Instagram features

              One of the most popular features of Instagram is Stories. Users can post photos and videos to their Stories stream and the content is live for others to view for 24 hours before it disappears. In January 2019, the company reported that there were 500 million daily active Instagram Stories users. Instagram Stories directly competes with Snapchat, another photo sharing app that initially became famous due to it’s “vanishing photos” feature.
              As of the second quarter of 2021, Snapchat had 293 million daily active users.

Facebook

Twitter

Click to copy link

Link copied

Cite

Benjamin Steel; Sara Parker; Derek Ruths; Benjamin Steel; Sara Parker; Derek Ruths (2023). The Invasion of Ukraine Viewed through TikTok: A Dataset [Dataset]. http://doi.org/10.5281/zenodo.7926959

The Invasion of Ukraine Viewed through TikTok: A Dataset

Explore at:

10 scholarly articles cite this dataset (View in Google Scholar)

text/x-python, bin, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7926959

Dataset updated

May 13, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Benjamin Steel; Sara Parker; Derek Ruths; Benjamin Steel; Sara Parker; Derek Ruths

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Ukraine

Description

This is a dataset of videos and comments related to the invasion of Ukraine, published on TikTok by a number of users over the year of 2022. It was compiled by Benjamin Steel, Sara Parker and Derek Ruths at the Network Dynamics Lab, McGill University. We created this dataset to facilitate the study of TikTok, and the nature of social interaction on the platform relevant to a major political event.

The dataset has been released here on Zenodo: https://doi.org/10.5281/zenodo.7926959 as well as on Github: https://github.com/networkdynamics/data-and-code/tree/master/ukraine_tiktok

To create the dataset, we identified hashtags and keywords explicitly related to the conflict to collect a core set of videos (or ”TikToks”). We then compiled comments associated with these videos. All of the data captured is publically available information, and contains personally identifiable information. In total we collected approximately 16 thousand videos and 12 million comments, from approximately 6 million users. There are approximately 1.9 comments on average per user captured, and 1.5 videos per user who posted a video. The author personally collected this data using the web scraping PyTok library, developed by the author: https://github.com/networkdynamics/pytok.

Due to scraping duration, this is just a sample of the publically available discourse concerning the invasion of Ukraine on TikTok. Due to the fuzzy search functionality of the TikTok, the dataset contains videos with a range of relatedness to the invasion.

We release here the unique video IDs of the dataset in a CSV format. The data was collected without the specific consent of the content creators, so we have released only the data required to re-create it, to allow users to delete content from TikTok and be removed from the dataset if they wish. Contained in this repository are scripts that will automatically pull the full dataset, which will take the form of JSON files organised into a folder for each video. The JSON files are the entirety of the data returned by the TikTok API. We include a script to parse the JSON files into CSV files with the most commonly used data. We plan to further expand this dataset as collection processes progress and the war continues. We will version the dataset to ensure reproducibility.

To build this dataset from the IDs here:

Go to https://github.com/networkdynamics/pytok and clone the repo locally
Run pip install -e . in the pytok directory
Run pip install pandas tqdm to install these libraries if not already installed
Run get_videos.py to get the video data
Run video_comments.py to get the comment data
Run user_tiktoks.py to get the video history of the users
Run hashtag_tiktoks.py or search_tiktoks.py to get more videos from other hashtags and search terms
Run load_json_to_csv.py to compile the JSON files into two CSV files, comments.csv and videos.csv

If you get an error about the wrong chrome version, use the command line argument get_videos.py --chrome-version YOUR_CHROME_VERSION Please note pulling data from TikTok takes a while! We recommend leaving the scripts running on a server for a while for them to finish downloading everything. Feel free to play around with the delay constants to either speed up the process or avoid TikTok rate limiting.

Please do not hesitate to make an issue in this repo to get our help with this!

The videos.csv will contain the following columns:

video_id: Unique video ID

createtime: UTC datetime of video creation time in YYYY-MM-DD HH:MM:SS format

author_name: Unique author name

author_id: Unique author ID

desc: The full video description from the author

hashtags: A list of hashtags used in the video description

share_video_id: If the video is sharing another video, this is the video ID of that original video, else empty

share_video_user_id: If the video is sharing another video, this the user ID of the author of that video, else empty

share_video_user_name: If the video is sharing another video, this is the user name of the author of that video, else empty

share_type: If the video is sharing another video, this is the type of the share, stitch, duet etc.

mentions: A list of users mentioned in the video description, if any

The comments.csv will contain the following columns:

comment_id: Unique comment ID

createtime: UTC datetime of comment creation time in YYYY-MM-DD HH:MM:SS format

author_name: Unique author name

author_id: Unique author ID

text: Text of the comment

mentions: A list of users that are tagged in the comment

video_id: The ID of the video the comment is on

comment_language: The language of the comment, as predicted by the TikTok API

reply_comment_id: If the comment is replying to another comment, this is the ID of that comment

The date can be compiled into a user interaction network to facilitate study of interaction dynamics. There is code to help with that here: https://github.com/networkdynamics/polar-seeds. Additional scripts for further preprocessing of this data can be found there too.

Clear search

Close search

Google apps

Main menu

The Invasion of Ukraine Viewed through TikTok: A Dataset

TikTok-10M

TikTokData.xlsx

TikHarm Dataset

Data Collection:

Data Labeling:

Dataset Statistics:

ai-tube-tik-tak-tok

Top 10 Most Viral TikTok Videos of 2024

TikTok_Most_Shared_Video_Transcription_Example

Dataset for "Short-Form Videos Degrade Our Capacity to Retain Intentions:...

MovingFashion

myanmar_cele_voices

LMVD

Most Streamed Spotify Songs 2024

Description

Key Features

Potential Use Cases

Viral Views by Platform – How Many Views Is Viral (2025)

Social Media Datasets

Original data set used for the current study.

UGC-VideoCap

Original data set used for the current study.

Instagram: distribution of global audiences 2024, by gender

Average daily time spent on social media worldwide 2012-2024

Instagram: distribution of global audiences 2024, by age group

The Invasion of Ukraine Viewed through TikTok: A DatasetSee More Versions

The Invasion of Ukraine Viewed through TikTok: A Dataset