Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides detailed rankings and key metrics for 100+ social media platforms and sites in 2025. It includes information such as user base, popularity trends, and global reach. Ideal for analyzing social media growth, user engagement, and market trends. Whether you're a data scientist, marketer, or researcher, this dataset offers valuable insights into the evolving digital landscape.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description:
The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.
Dataset Breakdown:
Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.
Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.
Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.
Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.
Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.
Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.
Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.
Context and Use Cases:
Researchers, data scientists, and developers can use this dataset to:
Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.
Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.
Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.
Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.
Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.
Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.
The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.
Future Considerations:
As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.
By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 600 synthetic entries simulating social media activity across three major platforms: Twitter, Reddit, and Instagram. The data was generated to analyze trends, sentiments, and user engagement patterns based on hashtags and posts. It can be useful for researchers, data analysts, and machine learning enthusiasts interested in studying social media behavior.
Dataset Structure The dataset includes the following columns:
Date: The date of the post, ranging across a simulated timeline. Platform: The social media platform where the post was made (Twitter, Reddit, or Instagram). Hashtag: The main hashtag associated with the post, such as #AI, #MachineLearning, or #Python. Post Content: The text of the post, crafted to simulate common social media interactions. Sentiment: The sentiment of the post, classified as Positive, Neutral, or Negative. Likes: The number of likes the post received. Shares: The number of shares or retweets the post received. Potential Use Cases Sentiment analysis: Train machine learning models to detect sentiment in text. Hashtag popularity analysis: Determine which hashtags are most commonly used or generate the most engagement. Engagement trends: Explore correlations between post sentiment and engagement metrics (likes/shares). Platform comparison: Compare user behavior across different social media platforms. Acknowledgments This dataset is fully synthetic and was generated using Python. It does not contain any real user data and is intended for educational and research purposes.
Facebook
TwitterHow many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Social Media Usage Dataset(Applications) features patterns and activity indicators that 1,000 users use seven major social media platforms, including Facebook, Instagram, and Twitter.
2) Data Utilization (1) Social Media Usage Dataset(Applications) has characteristics that: • This dataset provides different social media activity data for each user, including daily usage time, number of posts, number of likes received, and number of new followers. (2) Social Media Usage Dataset(Applications) can be used to: • Analysis of User Participation by Platform: You can analyze participation and popular trends by platform by comparing usage time and activity for each social media. • Establish marketing strategy: Based on user activity data, it can be used for targeted marketing, content production, and user retention strategies.
Facebook
Twitterhttps://brightdata.com/licensehttps://brightdata.com/license
Gain valuable insights with our comprehensive Social Media Dataset, designed to help businesses, marketers, and analysts track trends, monitor engagement, and optimize strategies. This dataset provides structured and reliable social media data from multiple platforms.
Dataset Features
User Profiles: Access public social media profiles, including usernames, bios, follower counts, engagement metrics, and more. Ideal for audience analysis, influencer marketing, and competitive research. Posts & Content: Extract posts, captions, hashtags, media (images/videos), timestamps, and engagement metrics such as likes, shares, and comments. Useful for trend analysis, sentiment tracking, and content strategy optimization. Comments & Interactions: Analyze user interactions, including replies, mentions, and discussions. This data helps brands understand audience sentiment and engagement patterns. Hashtag & Trend Tracking: Monitor trending hashtags, topics, and viral content across platforms to stay ahead of industry trends and consumer interests.
Customizable Subsets for Specific Needs Our Social Media Dataset is fully customizable, allowing you to filter data based on platform, region, keywords, engagement levels, or specific user profiles. Whether you need a broad dataset for market research or a focused subset for brand monitoring, we tailor the dataset to your needs.
Popular Use Cases
Brand Monitoring & Reputation Management: Track brand mentions, customer feedback, and sentiment analysis to manage online reputation effectively. Influencer Marketing & Audience Analysis: Identify key influencers, analyze engagement metrics, and optimize influencer partnerships. Competitive Intelligence: Monitor competitor activity, content performance, and audience engagement to refine marketing strategies. Market Research & Consumer Insights: Analyze social media trends, customer preferences, and emerging topics to inform business decisions. AI & Predictive Analytics: Leverage structured social media data for AI-driven trend forecasting, sentiment analysis, and automated content recommendations.
Whether you're tracking brand sentiment, analyzing audience engagement, or monitoring industry trends, our Social Media Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
Facebook
Twitterhttps://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/
Over the five years through 2025-26, industry revenue is forecast to expand at a compound annual rate of 20.3% to reach £12.5 billion. Social media platforms are integral to people's lives, offering ways to communicate, create and view content and share information. According to Ofcom, approximately 89% of UK internet users in 2023 used social media apps or sites. Teenagers and young adults are the biggest users. Advertising is the primary revenue source for social media platforms, although subscription-based services are gaining momentum as platforms seek to diversify their incomes. TikTok is the success story of the past five years, becoming the most downloaded app between 2020 and 2022, according to Apptopia. The short-form video platform has over 30 million monthly users in the UK in 2025. After Musk's takeover, X, formerly known as Twitter, adjusted its content moderation and allowed previously banned accounts to return. As a result, over 600 advertisers pulled their ads from the site because of fears their brand may be associated with malcontent. In response to falling ad revenue, X has introduced a subscription-based service which enables users to verify themselves and boosts the number of people who view their tweets. Meta-owned Facebook and Instagram have responded by introducing a similar service. In 2025, more social media platforms are using AI to boost user engagement. This improves click-through rates and drives higher advertising revenue. Industry revenue is expected to grow by 6.3% in 2025-26. Over the five years through 2030-31, social media platforms' revenue is projected to climb at an estimated 9.2% to reach £19.4 billion. Regulations relating to how data is collected, stored, and shared will force advertisers and platforms to rethink how they can target their desired demographics. The tightening of regulations will raise industry compliance costs, weighing on profit margin. Older age groups present a new revenue opportunity for social media platforms if they can bridge the gap between passive TV consumption and interactive digital engagement. Augmented Reality (AR) technology will move beyond filters to become standard for immersive product trials, interactive ads, and virtual meetups
Facebook
TwitterThis dataset was created by Jigyashu Singh Lodhi
Released under Other (specified in description)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The report provides a snapshot of the social media usage trends amongst online Canadian adults based on an online survey of 1500 participants. Canada continues to be one of the most connected countries in the world. An overwhelming majority of online Canadian adults (94%) have an account on at least one social media platform. However, the 2022 survey results show that the COVID-19 pandemic has ushered in some changes in how and where Canadians are spending their time on social media. Dominant platforms such as Facebook, messaging apps and YouTube are still on top but are losing ground to newer platforms such as TikTok and more niche platforms such as Reddit and Twitch.
Facebook
TwitterThe number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides detailed daily records of social media follower growth for multiple brands across major platforms, including engagement rates, campaign activity, ad spend, and top-performing content. It empowers marketing teams to analyze the impact of campaigns, content strategies, and paid promotions on follower growth, enabling data-driven optimization and benchmarking.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite this paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292Abstract: The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages.For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post intoone of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutralhate or not hateanxiety/stress detected or no anxiety/stress detected.These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.The following is a description of the attributes present in this dataset:Post ID: Unique ID of each Instagram postPost Description: Complete description of each post in the language in which it was originally publishedDate: Date of publication in MM/DD/YYYY formatLanguage: Language of the post as detected using the Google Translate APITranslated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutralHate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hateAnxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset captures the pulse of viral social media trends across TikTok, Instagram, Twitter, and YouTube. It provides insights into the most popular hashtags, content types, and user engagement levels, offering a comprehensive view of how trends unfold across platforms. With regional data and influencer-driven content, this dataset is perfect for:
Dive in to explore what makes content go viral, the behaviors that drive engagement, and how trends evolve on a global scale! 🌍
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets consist of qualitative data collected through semi-structured in-depth interviews as well as a focus group from three different companies with seven industry experts.The data collected was to address the use of social media to enhance organisational learning and also to address the gap that exists in terms of the integration of organisational learning (OL) and social media and also address the lack of guidelines for organisations that would like to implement the use of social media to facilitate OL. The data were triangulated by comparing the results from the three companies.
Facebook
TwitterTiktok network graph with 5,638 nodes and 318,986 unique links, representing up to 790,599 weighted links between labels, using Gephi network analysis software. Source of: Peña-Fernández, Simón, Larrondo-Ureta, Ainara, & Morales-i-Gras, Jordi. (2022). Current affairs on TikTok. Virality and entertainment for digital natives. Profesional De La Información, 31(1), 1–12. https://doi.org/10.5281/zenodo.5962655 Abstract: Since its appearance in 2018, TikTok has become one of the most popular social media platforms among digital natives because of its algorithm-based engagement strategies, a policy of public accounts, and a simple, colorful, and intuitive content interface. As happened in the past with other platforms such as Facebook, Twitter, and Instagram, various media are currently seeking ways to adapt to TikTok and its particular characteristics to attract a younger audience less accustomed to the consumption of journalistic material. Against this background, the aim of this study is to identify the presence of the media and journalists on TikTok, measure the virality and engagement of the content they generate, describe the communities created around them, and identify the presence of journalistic use of these accounts. For this, 23,174 videos from 143 accounts belonging to media from 25 countries were analyzed. The results indicate that, in general, the presence and impact of the media in this social network are low and that most of their content is oriented towards the creation of user communities based on viral content and entertainment. However, albeit with a lesser presence, one can also identify accounts and messages that adapt their content to the specific characteristics of TikTok. Their virality and engagement figures illustrate that there is indeed a niche for current affairs on this social network.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Social Media Analytics Market Size 2025-2029
The social media analytics market size is forecast to increase by USD 21.2 billion, at a CAGR of 35.2% between 2024 and 2029.
The market is experiencing significant growth, driven by the expanding availability and complexity of social media data. Businesses increasingly recognize the value of social media insights to inform marketing strategies, enhance customer engagement, and gauge brand reputation. In response, social media platforms continue to roll out advanced targeting options, enabling more precise audience segmentation and personalized messaging. However, the surging use of social media data also presents challenges. Interpreting unstructured data from various sources remains a formidable task, requiring sophisticated analytics tools and expertise.
Companies must navigate these complexities to effectively harness the power of social media analytics and stay competitive in today's digital landscape. To succeed, organizations need to invest in advanced analytics solutions, cultivate data literacy skills, and establish clear data governance policies. By addressing these challenges, businesses can unlock valuable insights from social media data and capitalize on emerging opportunities in this dynamic market.
What will be the Size of the Social Media Analytics Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample
The market continues to evolve, offering valuable insights for businesses across various sectors. Hashtag tracking and sentiment classification help organizations understand public perception and engagement with their brand. Engagement metrics, share of voice, and trend analysis algorithms provide valuable data for brand reputation management and customer journey mapping. Social media ROI, influencer marketing metrics, and sentiment scoring offer insights into the effectiveness of advertising campaigns. User behavior patterns, predictive modeling, and anomaly detection enable businesses to anticipate trends and respond to crises in real-time. Social media listening, lead generation attribution, influencer identification, and customer satisfaction scores provide actionable insights for community management and crisis communication management.
Data visualization dashboards and social listening tools facilitate effective audience segmentation and conversational AI. Reach forecasting, content performance, keyword analysis, and campaign effectiveness metrics offer valuable insights for optimizing social media strategies. Platform-specific insights enable businesses to tailor their approach to each social media channel. According to recent market research, the market is expected to grow by over 15% annually, reflecting the increasing importance of social media data for businesses. For instance, a retail company used social media listening tools to monitor customer conversations and identified a trend in customer complaints about product packaging. The company responded by redesigning the packaging, resulting in a 12% increase in sales.
This example highlights the potential impact of social media analytics on business performance.
How is this Social Media Analytics Industry segmented?
The social media analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
End-user
Retail
Government
Media and entertainment
Travel
Others
Application
Sales and marketing management
Customer experience management
Competitive intelligence
Risk management
Public safety and law enforcement
Deployment
On-premises
Cloud
Type
Predictive analytics
Prescriptive analytics
Descriptive analytics
Diagnostics analytics
Geography
North America
US
Canada
Europe
France
Germany
Italy
UK
APAC
China
India
Japan
South Korea
Rest of World (ROW)
By End-user Insights
The retail segment is estimated to witness significant growth during the forecast period.
Social media analytics plays a pivotal role in retail marketing, enabling businesses to track and analyze customer engagement, sentiment, and trends in real-time. Tools such as hashtag tracking, sentiment classification, and engagement metrics help retailers understand their audience's preferences and behavior patterns. Share of voice and trend analysis algorithms provide insights into market dynamics and brand reputation management. Customer journey mapping and social media ROI measurement allow businesses to optimize their marketing strategies and improve sales. Influencer marketing metrics, sentiment scoring, and advertising campai
Facebook
TwitterThe Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzing the dynamics of information diffusion cascades and accurately predicting their behavior holds significant importance in various applications. In this paper, we concentrate specifically on a recently introduced contrastive cascade graph learning framework, for the task of predicting cascade popularity. This framework follows a pre-training and fine-tuning paradigm to address cascade prediction tasks. In a previous study, the transferability of pre-trained models within the contrastive cascade graph learning framework was examined solely between two social media datasets. However, in our present study, we comprehensively evaluate the transferability of pre-trained models across 13 real datasets and six synthetic datasets. We construct several pre-trained models using real cascades and synthetic cascades generated by the independent cascade model and the Profile model. Then, we fine-tune these pre-trained models on real cascade datasets and evaluate their prediction accuracy based on the mean squared logarithmic error. The main findings derived from our results are as follows. (1) The pre-trained models exhibit transferability across diverse types of real datasets in different domains, encompassing different languages, social media platforms, and diffusion time scales. (2) Synthetic cascade data prove effective for pre-training purposes. The pre-trained models constructed with synthetic cascade data demonstrate comparable effectiveness to those constructed using real data. (3) Synthetic cascade data prove beneficial for fine-tuning the contrastive cascade graph learning models and training other state-of-the-art popularity prediction models. Models trained using a combination of real and synthetic cascades yield significantly lower mean squared logarithmic error compared to those trained solely on real cascades. Our findings affirm the effectiveness of synthetic cascade data in enhancing the accuracy of cascade popularity prediction.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)
Abstract
The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.
For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.
The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)
There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)
The following is a description of the attributes present in this dataset
Post ID: Unique ID of each Instagram post
Post Description: Complete description of each post in the language in which it was originally published
Date: Date of publication in MM/DD/YYYY format
Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API
Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API
Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral
Open Research Questions
This dataset is expected to be helpful for the investigation of the following research questions and even beyond:
How does sentiment toward COVID-19 vary across different languages?
How has public sentiment toward COVID-19 evolved from 2020 to the present?
How do cultural differences affect social media discourse about COVID-19 across various languages?
How has COVID-19 impacted mental health, as reflected in social media posts across different languages?
How effective were public health campaigns in shifting public sentiment in different languages?
What patterns of vaccine hesitancy or support are present in different languages?
How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?
What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?
How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?
What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?
All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset covers aspects of online politics in 25 democracies: 15 relatively old established European democracies (Austria, Belgium, Denmark, Finland, France, Germany, Iceland, Ireland, Italy, Luxembourg, Netherlands, Norway, Sweden, Switzerland, United Kingdom); five non-European veteran democracies (Australia, Canada, Israel, Japan, New Zealand); two early (Portugal, Spain) and three late (Czech Republic, Hungary, Poland) third-wave (young) European democracies. The research population includes, in each country, parties that won 4% or more of the votes in two consecutive elections before April 2019 (a total of 141 parties and 145 leaders). The dataset includes external party level information such as performance in the last national elections, governmental status, party age, populism affiliation and leadership selection method. It also includes information related to the party leaders such as their term in leadership office and other formal positions. In addition it includes information about online activity mainly on the consumption (user related activities) of the parties and their leaders in Facebook and Twitter two of the most used social media platforms for political purposes.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides detailed rankings and key metrics for 100+ social media platforms and sites in 2025. It includes information such as user base, popularity trends, and global reach. Ideal for analyzing social media growth, user engagement, and market trends. Whether you're a data scientist, marketer, or researcher, this dataset offers valuable insights into the evolving digital landscape.