The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 20,000 synthetic social media posts crafted to mimic realistic user activity on a fictional platform. It simulates various user demographics, post content, hashtags, topics, and detailed engagement metrics such as likes, comments, and shares.
Each record represents a unique social media post made by a user, enriched with features that allow for analysis of trends, behavior, and engagement. The dataset includes:
Column | Description |
---|---|
post_id | Unique identifier for each post |
user_id | Unique identifier for each user |
user_name | Synthetic username |
user_gender | Gender of the user (Male, Female, Other) |
user_age | Age of the user (16–60) |
followers_count | Number of followers the user has |
following_count | Number of accounts the user follows |
account_creation_date | Account registration date |
is_verified | Boolean flag for verified users |
location | City or region where the user is located |
topic | Main topic of the post (e.g., Travel, Food, Fashion, etc.) |
post_content | Actual content of the post |
content_length | Number of characters in the post content |
hashtags | Relevant hashtags used in the post |
has_media | Whether the post includes image or video |
post_date | Timestamp of when the post was made |
device | Device used to make the post (e.g., iPhone, Android) |
language | Language of the post |
likes | Number of likes received |
comments | Number of comments received |
shares | Number of times the post was shared |
engagement_rate | Normalized metric: (likes + comments + shares) / followers_count |
How many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
https://brightdata.com/licensehttps://brightdata.com/license
Gain valuable insights with our comprehensive Social Media Dataset, designed to help businesses, marketers, and analysts track trends, monitor engagement, and optimize strategies. This dataset provides structured and reliable social media data from multiple platforms.
Dataset Features
User Profiles: Access public social media profiles, including usernames, bios, follower counts, engagement metrics, and more. Ideal for audience analysis, influencer marketing, and competitive research. Posts & Content: Extract posts, captions, hashtags, media (images/videos), timestamps, and engagement metrics such as likes, shares, and comments. Useful for trend analysis, sentiment tracking, and content strategy optimization. Comments & Interactions: Analyze user interactions, including replies, mentions, and discussions. This data helps brands understand audience sentiment and engagement patterns. Hashtag & Trend Tracking: Monitor trending hashtags, topics, and viral content across platforms to stay ahead of industry trends and consumer interests.
Customizable Subsets for Specific Needs Our Social Media Dataset is fully customizable, allowing you to filter data based on platform, region, keywords, engagement levels, or specific user profiles. Whether you need a broad dataset for market research or a focused subset for brand monitoring, we tailor the dataset to your needs.
Popular Use Cases
Brand Monitoring & Reputation Management: Track brand mentions, customer feedback, and sentiment analysis to manage online reputation effectively. Influencer Marketing & Audience Analysis: Identify key influencers, analyze engagement metrics, and optimize influencer partnerships. Competitive Intelligence: Monitor competitor activity, content performance, and audience engagement to refine marketing strategies. Market Research & Consumer Insights: Analyze social media trends, customer preferences, and emerging topics to inform business decisions. AI & Predictive Analytics: Leverage structured social media data for AI-driven trend forecasting, sentiment analysis, and automated content recommendations.
Whether you're tracking brand sentiment, analyzing audience engagement, or monitoring industry trends, our Social Media Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
Which county has the most Facebook users?
There are more than 378 million Facebook users in India alone, making it the leading country in terms of Facebook audience size. To put this into context, if India’s Facebook audience were a country then it would be ranked third in terms of largest population worldwide. Apart from India, there are several other markets with more than 100 million Facebook users each: The United States, Indonesia, and Brazil with 193.8 million, 119.05 million, and 112.55 million Facebook users respectively.
Facebook – the most used social media
Meta, the company that was previously called Facebook, owns four of the most popular social media platforms worldwide, WhatsApp, Facebook Messenger, Facebook, and Instagram. As of the third quarter of 2021, there were around 3,5 billion cumulative monthly users of the company’s products worldwide. With around 2.9 billion monthly active users, Facebook is the most popular social media worldwide. With an audience of this scale, it is no surprise that the vast majority of Facebook’s revenue is generated through advertising.
Facebook usage by device
As of July 2021, it was found that 98.5 percent of active users accessed their Facebook account from mobile devices. In fact, almost 81.8 percent of Facebook audiences worldwide access the platform only via mobile phone. Facebook is not only available through mobile browser as the company has published several mobile apps for users to access their products and services. As of the third quarter 2021, the four core Meta products were leading the ranking of most downloaded mobile apps worldwide, with WhatsApp amassing approximately six billion downloads.
By CrowdFlower [source]
Welcome to the disaster tweets dataset! This collection of tweets holds a wealth of information about global disasters and their effects on people, governments, and organizations all over the world. With over 10,000 tweets collected and carefully annotated with labels of whether they reported an actual disaster or not, this dataset provides unique insight into what these events look like in terms of social media conversations.
This information is derived from a variety of key terms related to disaster events, such as “ablaze” and “pandemonium” which was used to gather each individual tweet for analysis. The columns for each tweet include detailed metadata about the user who posted it along with variables such as keyword relevance and location. Alongside all these attributes is the core text belonging to each individual tweet- giving you access to all sorts of stories from natural disasters, contagious disease outbreaks or conflicts between nations that can be found in one place!
So whatever you're looking for - whether it's observations about first-hand accounts or conducting research on public sentiment during a major event - this dataset offers you an invaluable source full of timely information that could potentially save lives down the line. So take your journey through this data now and embark upon discovering what devastation looks like through social media!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains tweets related to disaster events, including the keyword, location, text, tweetid and userid. It provides insights into how people interact with each other on social media during a disaster. Using this dataset you can gain valuable insight into the dynamics of online communication in disasters and provide an important point of reference for future disaster management initiatives.
- Analyzing the effectiveness of disaster relief and humanitarian aid efforts, by mapping tweets against public data of areas affected by disasters and donations made to help those affected.
- Developing advanced statistical models to predict the magnitude and impact of an oncoming natural disaster using keyword analysis in social media posts related to past disasters.
- Creating text-based classifiers to accurately detect disaster-related tweets in real-time, allowing emergency services providers early warning signs before a potential event occurs
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: socialmedia-disaster-tweets-DFE.csv | Column name | Description | |:-----------------------|:-----------------------------------------------------------------------------------| | _golden | A boolean value indicating whether the tweet is a golden tweet or not. (Boolean) | | _unit_state | The state of the tweet (e.g. finalized, judged, etc.). (String) | | _trusted_judgments | The number of trusted judgments for the tweet. (Integer) | | _last_judgment_at | The date and time of the last judgment for the tweet. (DateTime) | | choose_one | The label assigned to the tweet (e.g. relevant, not relevant, etc.). (String) | | choose_one_gold | The gold label assigned to the tweet (e.g. relevant, not relevant, etc.). (String) | | keyword | The keyword associated with the tweet. (String) | | location | The location associated with the tweet. (String) | | text | The text content of the tweet. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit CrowdFlower.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
more info : https://snap.stanford.edu/data/com-Youtube.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is structured as a graph, where nodes represent users and edges capture their interactions, including tweets, retweets, replies, and mentions. Each node provides detailed user attributes, such as unique ID, follower and following counts, and verification status, offering insights into each user's identity, role, and influence in the mental health discourse. The edges illustrate user interactions, highlighting engagement patterns and types of content that drive responses, such as tweet impressions. This interconnected structure enables sentiment analysis and public reaction studies, allowing researchers to explore engagement trends and identify the mental health topics that resonate most with users.
The dataset consists of three files: 1. Edges Data: Contains graph data essential for social network analysis, including fields for UserID (Source), UserID (Destination), Post/Tweet ID, and Date of Relationship. This file enables analysis of user connections without including tweet content, maintaining compliance with Twitter/X’s data-sharing policies. 2. Nodes Data: Offers user-specific details relevant to network analysis, including UserID, Account Creation Date, Follower and Following counts, Verified Status, and Date Joined Twitter. This file allows researchers to examine user behavior (e.g., identifying influential users or spam-like accounts) without direct reference to tweet content. 3. Twitter/X Content Data: This file contains only the raw tweet text as a single-column dataset, without associated user identifiers or metadata. By isolating the text, we ensure alignment with anonymization standards observed in similar published datasets, safeguarding user privacy in compliance with Twitter/X's data guidelines. This content is crucial for addressing the research focus on mental health discourse in social media. (References to prior Data in Brief publications involving Twitter/X data informed the dataset's structure.)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
-This Dataset was gathered by crawling Twitter's REST API using the Python library tweepy 3. This dataset contains the tweets of the 20 most popular twitter users (with the most followers) whereby retweets are neglected. These accounts belong to public people, such as Katy Perry and Barack Obama, platforms, YouTube, Instagram, and television channels shows, e.g., CNN Breaking News and The Ellen Show. -Consequently, the dataset contains a mix of relatively structured tweets, tweets written in a formal and informative manner, and completely unstructured tweets written in a colloquial style. Unfortunately, the geocoordinates were not available for those tweets. - H -This Dataset has been used to generate reserach paper under title "Machine Learning Techniques for Anomalies Detection in Post Arrays". -Crawled attributes are: Author (Twitter User), Content (Tweet), Date_Time, id (Twitter User ID), language (Tweet Langugage), Number_of_Likes, Number_of_Shares. Overall: 52543 tweets of top 20 users in twitter Screen_Name #Tweets Time span (in days) TheEllenShow 3,147 - 662 jimmyfallon 3,123 - 1231 ArianaGrande 3,104 - 613 YouTube 3,077 - 411 KimKardashian 2,939 - 603 katyperry 2,924 - 1,598 selenagomez 2,913 - 2,266 rihanna 2,877 - 1,557 BarackObama 2,863 - 849 britneyspears 2,776 - 1,548 instagram 2,577 - 456 shakira 2,530 - 1,850 Cristiano 2,507 - 2,407 jtimberlake 2,478 - 2,491 ladygaga 2,329 - 894 Twitter 2,290 - 2,593 ddlovato 2,217 - 741 taylorswift13 2,029 - 2,091 justinbieber 2,000 - 664 cnnbrk 1,842 - 183
Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.
The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.
How popular is Instagram?
Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.
Who uses Instagram?
Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.
Celebrity influencers on Instagram
Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.
The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:
Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.
Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).
Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.
Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).
WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.
From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.
The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).
The dataset has the following fields:
'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the detected language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / platform of the given text,
'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.
ToDo Statistics (under construction)
This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.
During a 2024 survey, 77 percent of respondents from Nigeria stated that they used social media as a source of news. In comparison, just 23 percent of Japanese respondents said the same. Large portions of social media users around the world admit that they do not trust social platforms either as media sources or as a way to get news, and yet they continue to access such networks on a daily basis.
Social media: trust and consumption
Despite the majority of adults surveyed in each country reporting that they used social networks to keep up to date with news and current affairs, a 2018 study showed that social media is the least trusted news source in the world. Less than 35 percent of adults in Europe considered social networks to be trustworthy in this respect, yet more than 50 percent of adults in Portugal, Poland, Romania, Hungary, Bulgaria, Slovakia and Croatia said that they got their news on social media.
What is clear is that we live in an era where social media is such an enormous part of daily life that consumers will still use it in spite of their doubts or reservations. Concerns about fake news and propaganda on social media have not stopped billions of users accessing their favorite networks on a daily basis.
Most Millennials in the United States use social media for news every day, and younger consumers in European countries are much more likely to use social networks for national political news than their older peers.
Like it or not, reading news on social is fast becoming the norm for younger generations, and this form of news consumption will likely increase further regardless of whether consumers fully trust their chosen network or not.
Researcher(s): Alexandros Mokas, Eleni Kamateri Supervisor: Ioannis Tsampoulatidis This dataset contains the post-processing of the social media data collected for two different use cases during the first two years of the Deepcube project. More specifically, it contains two sub-datasets, including: The UC2 dataset containing the post-processing of the Twitter data collected for the DeepCube use case (UC2) dealing with the climate induced migration in Africa. This dataset contains in total 5,695,253 social media posts collected from the Twitter platform, based on the initial version of search criteria relevant to UC2 (defined by Universitat De Valencia), focused on the regions of Ethiopia and Somalia and started from 26 June, 2021 till March, 2023. The UC5 dataset containing the post-processing of the Twitter and Instagram data collected for the DeepCube use case (UC5) related to the sustainable and environmentally-friendly tourism. This dataset contains in total 58,143 social media posts collected from the Twitter and Instagram platform (12,881 collected from Twitter and 45,262 collected from Instagram), based on the initial version of search criteria relevant to UC5 (defined by MURMURATION SAS), focused on the regions of Brasil and started from 26 June, 2021 till March, 2023. For every social media post retrieved from Twitter and Instagram, a preprocessing step was performed. This involved a three-step analysis of each post using the appropriate web service. First, the location of the post was automatically extracted from the text using a location extraction service. Second, the images included in the post were analyzed using a concept extraction service, which identified and provided the top ten concepts that best described the image. These concepts included items such as "person," "building," "drought," "sun," and so on. Finally, the sentiment expressed in the post's text was determined by using a sentiment analysis service. The sentiment was classified as either positive, negative, or neutral. After the social media posts were preprocessed, they were visualized using the Social Media Web Application. This intuitive, user-friendly online application was designed for both expert and non-expert users and offers a web-based user interface for filtering and visualizing the collected social media data. The application provides various filtering options, an interactive map, a timeline, and a collection of graphs to help users analyze the data. Moreover, this application provides users with the option to download aggregated data for specific periods by applying filters and clicking the "Download Posts" button. This feature allows users to easily extract and analyze social media data outside of the web application, providing greater flexibility and control over data analysis. The dataset is provided by INFALIA. INFALIA, being a spin-off of the CERTH institute and a partner of a research EU project, releases this dataset containing Tweets IDs and post pre-processing data for the sole purpose of enabling the validation of the research conducted within the DeepCube. Moreover, Twitter Content provided in this dataset to third parties remains subject to the Twitter Policy, and those third parties must agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy (https://developer.twitter.com/en/developer-terms) before receiving this download. License: Creative Commons Attribution 4.0 International
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Researcher(s): Alexandros Mokas, Eleni Kamateri
Supervisor: Ioannis Tsampoulatidis
This repository contains 3 social media datasets:
2 Post-processing datasets: These datasets contain post-processing data extracted from the analysis of social media posts collected for two different use cases during the first two years of the Deepcube project. More specifically, these include:
The UC2 dataset containing the post-processing analysis of the Twitter data collected for the DeepCube use case (UC2) dealing with the climate induced migration in Africa. This dataset contains in total 5,695,253 social media posts collected from the Twitter platform, based on the initial version of search criteria relevant to UC2 defined by Universitat De Valencia, focused on the regions of Ethiopia and Somalia and started from 26 June, 2021 till March, 2023.
The UC5 dataset containing the post-processing analysis of the Twitter and Instagram data collected for the DeepCube use case (UC5) related to the sustainable and environmentally-friendly tourism. This dataset contains in total 58,143 social media posts collected from the Twitter and Instagram platform (12,881 collected from Twitter and 45,262 collected from Instagram), based on the initial version of search criteria relevant to UC5 defined by MURMURATION SAS, focused on the regions of Brasil and started from 26 June, 2021 till March, 2023.
1 Annotated dataset: An additional anottated dataset was created that contains post-processing data along with annotations of Twitter posts collected for UC2 for the years 2010-2022. More specifically, it includes:
The UC2 dataset contain the post-processing of the Twitter data collected for the DeepCube use case (UC2) dealing with the climate induced migration in Africa. This dataset contains in total 1721 annotated (412 relevant and 1309 irrelevant) by social media posts collected from the Twitter platform, focused on the region of Somalia and started from 1 January, 2010 till 31 December, 2022.
For every social media post retrieved from Twitter and Instagram, a preprocessing step was performed. This involved a three-step analysis of each post using the appropriate web service. First, the location of the post was automatically extracted from the text using a location extraction service. Second, the images included in the post were analyzed using a concept extraction service, which identified and provided the top ten concepts that best described the image. These concepts included items such as "person," "building," "drought," "sun," and so on. Finally, the sentiment expressed in the post's text was determined by using a sentiment analysis service. The sentiment was classified as either positive, negative, or neutral.
After the social media posts were preprocessed, they were visualized using the Social Media Web Application. This intuitive, user-friendly online application was designed for both expert and non-expert users and offers a web-based user interface for filtering and visualizing the collected social media data. The application provides various filtering options, an interactive map, a timeline, and a collection of graphs to help users analyze the data. Moreover, this application provides users with the option to download aggregated data for specific periods by applying filters and clicking the "Download Posts" button. This feature allows users to easily extract and analyze social media data outside of the web application, providing greater flexibility and control over data analysis.
The dataset is provided by INFALIA. INFALIA, being a spin-off of the CERTH institute and a partner of a research EU project, releases this dataset containing Tweets IDs and post pre-processing data for the sole purpose of enabling the validation of the research conducted within the DeepCube. Moreover, Twitter Content provided in this dataset to third parties remains subject to the Twitter Policy, and those third parties must agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy (https://developer.twitter.com/en/developer-terms) before receiving this download.
How much time do people spend on social media? As of 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
MyDigitalFootprint (MDF) is a novel large-scale dataset composed of smartphone embedded sensors data, physical proximity information, and Online Social Networks interactions aimed at supporting multimodal context-recognition and social relationships modelling in mobile environments. The dataset includes two months of measurements and information collected from the personal mobile devices of 31 volunteer users by following the in-the-wild data collection approach: the data has been collected in the users' natural environment, without limiting their usual behaviour. Existing public datasets generally consist of a limited set of context data, aimed at optimising specific application domains (human activity recognition is the most common example). On the contrary, the dataset contains a comprehensive set of information describing the user context in the mobile environment.
The complete analysis of the data contained in MDF has been presented in the following publication:
https://www.sciencedirect.com/science/article/abs/pii/S1574119220301383?via%3Dihub
The full anonymised dataset is contained in the folder MDF. Moreover, in order to demonstrate the efficacy of MDF, there are three proof of concept context-aware applications based on different machine learning tasks:
For the sake of reproducibility, the data used to evaluate the proof-of-concept applications are contained in the folders link-prediction, context-recognition, and cars, respectively.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RLLL1Vhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RLLL1V
Pinning down the role of social ties in the decision to protest has been notoriously elusive, largely due to data limitations. Social media and their global use by protesters offer an unprecedented opportunity to observe real-time social ties and online behavior, though often without an attendant measure of real-world behavior. We collect data on Twitter activity during the 2015 Charlie Hebdo protest in Paris which, unusually, record real-world protest attendance and network structure measured beyond egocentric networks. We devise a test of social theories of protest that hold that participation depends on exposure to others’ intentions and network position determines exposure. Our findings are strongly consistent with these theories, showing that protesters are significantly more connected to one another via direct, indirect, triadic, and reciprocated ties than comparable non-protesters. These results offer the first large-scale empirical support for the claim that social network structure has consequences for protest participation. The data were collected by the NYU Social Media and Political Participation (SMaPP) laboratory (https://wp.nyu.edu/smapp/), of which Nagler and Tucker are co-Directors along with Richard Bonneau and John T. Jost. The SMaPP lab is supported by the INSPIRE program of the National Science Foundation (Award SES-1248077), the New York University Global Institute for Advanced Study, the Moore-Sloan Data Science Environment, and Dean Thomas Carew’s Research Investment Fund at New York University. In order to run the replication end-to-end, we recommend downloading the comprehensive archive (charlie-hebdo-replication.tar.gz). The archive contains all the files with the appropriate directory structure. Once the archive is expanded, the full replication pipeline may be executed by running the script run-all.sh in the scripts directory.
https://brightdata.com/licensehttps://brightdata.com/license
The LinkedIn posts dataset is a comprehensive collection of user-generated content on LinkedIn, featuring key fields such as post ID, user ID, URL, title, post text, date posted, hashtags, and engagement metrics like the number of likes and comments. This dataset also includes additional elements such as embedded links, images, videos, top visible comments, and links to more posts by the user or relevant content. It is ideal for social media analysts, marketers, and researchers looking to analyze user behavior, content trends, and engagement on LinkedIn.
The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).