100+ datasets found

Facebook users worldwide 2017-2027
statista.com
tokrwards.com
+4more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stacy Jo Dixon, Facebook users worldwide 2017-2027 [Dataset]. https://www.statista.com/topics/1164/social-networks/
Explore at:
Dataset provided by
Statistahttp://statista.com/
Authors
Stacy Jo Dixon
Description
The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Social Media Datasets
brightdata.com
.json, .csv, .xlsx
Updated Sep 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2022). Social Media Datasets [Dataset]. https://brightdata.com/products/datasets/social-media
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Sep 7, 2022
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Gain valuable insights with our comprehensive Social Media Dataset, designed to help businesses, marketers, and analysts track trends, monitor engagement, and optimize strategies. This dataset provides structured and reliable social media data from multiple platforms.

Dataset Features

User Profiles: Access public social media profiles, including usernames, bios, follower counts, engagement metrics, and more. Ideal for audience analysis, influencer marketing, and competitive research. Posts & Content: Extract posts, captions, hashtags, media (images/videos), timestamps, and engagement metrics such as likes, shares, and comments. Useful for trend analysis, sentiment tracking, and content strategy optimization. Comments & Interactions: Analyze user interactions, including replies, mentions, and discussions. This data helps brands understand audience sentiment and engagement patterns. Hashtag & Trend Tracking: Monitor trending hashtags, topics, and viral content across platforms to stay ahead of industry trends and consumer interests.

Customizable Subsets for Specific Needs Our Social Media Dataset is fully customizable, allowing you to filter data based on platform, region, keywords, engagement levels, or specific user profiles. Whether you need a broad dataset for market research or a focused subset for brand monitoring, we tailor the dataset to your needs.

Popular Use Cases

Brand Monitoring & Reputation Management: Track brand mentions, customer feedback, and sentiment analysis to manage online reputation effectively. Influencer Marketing & Audience Analysis: Identify key influencers, analyze engagement metrics, and optimize influencer partnerships. Competitive Intelligence: Monitor competitor activity, content performance, and audience engagement to refine marketing strategies. Market Research & Consumer Insights: Analyze social media trends, customer preferences, and emerging topics to inform business decisions. AI & Predictive Analytics: Leverage structured social media data for AI-driven trend forecasting, sentiment analysis, and automated content recommendations.

Whether you're tracking brand sentiment, analyzing audience engagement, or monitoring industry trends, our Social Media Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
c
Social Media Usage Dataset(Applications)
cubig.ai
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Social Media Usage Dataset(Applications) [Dataset]. https://cubig.ai/store/products/321/social-media-usage-datasetapplications
Explore at:
Dataset updated
May 28, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
Description
1) Data Introduction • The Social Media Usage Dataset(Applications) features patterns and activity indicators that 1,000 users use seven major social media platforms, including Facebook, Instagram, and Twitter.

2) Data Utilization (1) Social Media Usage Dataset(Applications) has characteristics that: • This dataset provides different social media activity data for each user, including daily usage time, number of posts, number of likes received, and number of new followers. (2) Social Media Usage Dataset(Applications) can be used to: • Analysis of User Participation by Platform: You can analyze participation and popular trends by platform by comparing usage time and activity for each social media. • Establish marketing strategy: Based on user activity data, it can be used for targeted marketing, content production, and user retention strategies.

Social Media Engagement (2025)

kaggle.com

Updated Mar 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Damla Ağaça (2025). Social Media Engagement (2025) [Dataset]. https://www.kaggle.com/datasets/dagaca/social-media-engagement-2025

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 21, 2025

Dataset provided by

Kaggle

Authors

Damla Ağaça

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Social Media Engagement (2025)

This dataset contains 20,000 synthetic social media posts crafted to mimic realistic user activity on a fictional platform. It simulates various user demographics, post content, hashtags, topics, and detailed engagement metrics such as likes, comments, and shares.

Overview

Each record represents a unique social media post made by a user, enriched with features that allow for analysis of trends, behavior, and engagement. The dataset includes:

User-level information: age, gender, followers, verified status, etc.
Post-level information: topic, hashtags, media, engagement
Platform and device data
Calculated engagement rate

Column Descriptions

Column	Description
`post_id`	Unique identifier for each post
`user_id`	Unique identifier for each user
`user_name`	Synthetic username
`user_gender`	Gender of the user (Male, Female, Other)
`user_age`	Age of the user (16–60)
`followers_count`	Number of followers the user has
`following_count`	Number of accounts the user follows
`account_creation_date`	Account registration date
`is_verified`	Boolean flag for verified users
`location`	City or region where the user is located
`topic`	Main topic of the post (e.g., Travel, Food, Fashion, etc.)
`post_content`	Actual content of the post
`content_length`	Number of characters in the post content
`hashtags`	Relevant hashtags used in the post
`has_media`	Whether the post includes image or video
`post_date`	Timestamp of when the post was made
`device`	Device used to make the post (e.g., iPhone, Android)
`language`	Language of the post
`likes`	Number of likes received
`comments`	Number of comments received
`shares`	Number of times the post was shared
`engagement_rate`	Normalized metric: (likes + comments + shares) / followers_count

p
Social Media Datasets
promptcloud.com
csv
Updated Jul 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PromptCloud (2025). Social Media Datasets [Dataset]. https://www.promptcloud.com/dataset/social-media/
Explore at:
csvAvailable download formats
Dataset updated
Jul 28, 2025
Dataset authored and provided by
PromptCloud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Social media datasets provide real-time insight into public opinion, trending topics, user behavior, sentiment, and global events as reflected on platforms like Twitter (X), Facebook, and Instagram. These datasets are crucial for marketing analysts, newsrooms, political strategists, crisis response teams, and brand managers to monitor discourse and take data-driven action. Extracted from live user-generated content, […]
Average daily time spent on social media worldwide 2012-2025
statista.com
thefarmdosupply.com
+1more
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Average daily time spent on social media worldwide 2012-2025 [Dataset]. https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/
Explore at:
Dataset updated
Jun 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
How much time do people spend on social media? As of 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
B
Dataset: Decentralized Social Media Use and Users
borealisdata.ca
search.dataone.org
+1more
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anatoliy Gruzd; Alyssa Saiphoo; Philip Mai (2024). Dataset: Decentralized Social Media Use and Users [Dataset]. http://doi.org/10.5683/SP3/MJYGAR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/MJYGAR
Dataset updated
Aug 7, 2024
Dataset provided by
Borealis
Authors
Anatoliy Gruzd; Alyssa Saiphoo; Philip Mai
License
https://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP3/MJYGARhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP3/MJYGAR
Dataset funded by
Canada Research Chairs Program
Description
The dataset contains 31 transcribed and anonymized interviews of blockchain-based social media users. The dataset was collected during the summer of 2022 as part of a research project at the Social Media Lab at Toronto Metropolitan University. The dataset is available upon request for validation by peer-reviewers or other researchers in the field.

IMDB & Social Media Dataset

kaggle.com

Updated Nov 5, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

momo5577 (2023). IMDB & Social Media Dataset [Dataset]. https://www.kaggle.com/datasets/momo5577/imdb-and-social-media-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 5, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

momo5577

Description

This dataset is compiled using this dataset from GitHub.

Data Description Table

Variable Name	Description
`movie_title`	Title of the Movie
`duration`	Duration in minutes
`director_name`	Name of the Director of the Movie
`director_facebook_likes`	Number of likes of the Director on his Facebook Page
`actor_1_name`	Primary actor starring in the movie
`actor_1_facebook_likes`	Number of likes of the Actor_1 on his/her Facebook Page
`actor_2_name`	Other actor starring in the movie
`actor_2_facebook_likes`	Number of likes of the Actor_2 on his/her Facebook Page
`actor_3_name`	Other actor starring in the movie
`actor_3_facebook_likes`	Number of likes of the Actor_3 on his/her Facebook Page
`num_user_for_reviews`	Number of users who gave a review
`num_critic_for_reviews`	Number of critical reviews on imdb
`num_voted_users`	Number of people who voted for the movie
`cast_total_facebook_likes`	Total number of facebook likes of the entire cast of the movie
`movie_facebook_likes`	Number of Facebook likes in the movie page
`plot_keywords`	Keywords describing the movie plot
`facenumber_in_poster`	Number of the actor who featured in the movie poster
`color`	Film colorization. ‘Black and White’ or ‘Color’
`genres`	Film categorization like ‘Animation’, ‘Comedy’, etc
`title_year`	The year in which the movie is released (1916:2016)
`language`	Languages like English, Arabic, Chinese, etc
`country`	Country where the movie is produced
`content_rating`	Content rating of the movie
`aspect_ratio`	Aspect ratio the movie was made in
`movie_imdb_link`	IMDB link of the movie
`gross`	Gross earnings of the movie in Dollars
`budget`	Budget of the movie in Dollars
`imdb_score`	IMDB Score of the movie on IMDB

m
Abbreviated FOMO and social media dataset
figshare.mq.edu.au
researchdata.edu.au
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danielle Einstein; Carol Dabb; Madeleine Ferrari; Anne McMaugh; Peter McEvoy; Ron Rapee; Eyal Karin; Maree J. Abbott (2023). Abbreviated FOMO and social media dataset [Dataset]. http://doi.org/10.25949/20188298.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25949/20188298.v1
Dataset updated
May 30, 2023
Dataset provided by
Macquarie University
Authors
Danielle Einstein; Carol Dabb; Madeleine Ferrari; Anne McMaugh; Peter McEvoy; Ron Rapee; Eyal Karin; Maree J. Abbott
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database is comprised of 951 participants who provided self-report data online in their school classrooms. The data was collected in 2016 and 2017. The dataset is comprised of 509 males (54%) and 442 females (46%). Their ages ranged from 12 to 16 years (M = 13.69, SD = 0.72). Seven participants did not report their age. The majority were born in Australia (N = 849, 89%). The next most common countries of birth were China (N = 24, 2.5%), the UK (N = 23, 2.4%), and the USA (N = 9, 0.9%). Data were drawn from students at five Australian independent secondary schools. The data contains item responses for the Spence Children’s Anxiety Scale (SCAS; Spence, 1998) which is comprised of 44 items. The Social media question asked about frequency of use with the question “How often do you use social media?”. The response options ranged from constantly to once a week or less. Items measuring Fear of Missing Out were included and incorporated the following five questions based on the APS Stress and Wellbeing in Australia Survey (APS, 2015). These were “When I have a good time it is important for me to share the details online; I am afraid that I will miss out on something if I don’t stay connected to my online social networks; I feel worried and uncomfortable when I can’t access my social media accounts; I find it difficult to relax or sleep after spending time on social networking sites; I feel my brain burnout with the constant connectivity of social media. Internal consistency for this measure was α = .81. Self compassion was measured using the 12-item short-form of the Self-Compassion Scale (SCS-SF; Raes et al., 2011). The data set has the option of downloading an excel file (composed of two worksheet tabs) or CSV files 1) Data and 2) Variable labels. References: Australian Psychological Society. (2015). Stress and wellbeing in Australia survey. https://www.headsup.org.au/docs/default-source/default-document-library/stress-and-wellbeing-in-australia-report.pdf?sfvrsn=7f08274d_4 Raes, F., Pommier, E., Neff, K. D., & Van Gucht, D. (2011). Construction and factorial validation of a short form of the self-compassion scale. Clinical Psychology and Psychotherapy, 18(3), 250-255. https://doi.org/10.1002/cpp.702 Spence, S. H. (1998). A measure of anxiety symptoms among children. Behaviour Research and Therapy, 36(5), 545-566. https://doi.org/10.1016/S0005-7967(98)00034-5
m
Graph-Based Social Media Data on Mental Health Topics
data.mendeley.com
Updated Nov 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Ady Sanjaya (2024). Graph-Based Social Media Data on Mental Health Topics [Dataset]. http://doi.org/10.17632/z45txpdp7f.2
Explore at:
Unique identifier
https://doi.org/10.17632/z45txpdp7f.2
Dataset updated
Nov 4, 2024
Authors
Samuel Ady Sanjaya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is structured as a graph, where nodes represent users and edges capture their interactions, including tweets, retweets, replies, and mentions. Each node provides detailed user attributes, such as unique ID, follower and following counts, and verification status, offering insights into each user's identity, role, and influence in the mental health discourse. The edges illustrate user interactions, highlighting engagement patterns and types of content that drive responses, such as tweet impressions. This interconnected structure enables sentiment analysis and public reaction studies, allowing researchers to explore engagement trends and identify the mental health topics that resonate most with users.

The dataset consists of three files: 1. Edges Data: Contains graph data essential for social network analysis, including fields for UserID (Source), UserID (Destination), Post/Tweet ID, and Date of Relationship. This file enables analysis of user connections without including tweet content, maintaining compliance with Twitter/X’s data-sharing policies. 2. Nodes Data: Offers user-specific details relevant to network analysis, including UserID, Account Creation Date, Follower and Following counts, Verified Status, and Date Joined Twitter. This file allows researchers to examine user behavior (e.g., identifying influential users or spam-like accounts) without direct reference to tweet content. 3. Twitter/X Content Data: This file contains only the raw tweet text as a single-column dataset, without associated user identifiers or metadata. By isolating the text, we ensure alignment with anonymization standards observed in similar published datasets, safeguarding user privacy in compliance with Twitter/X's data guidelines. This content is crucial for addressing the research focus on mental health discourse in social media. (References to prior Data in Brief publications involving Twitter/X data informed the dataset's structure.)
d
US B2B Marketing Data | 148MM B2B Marketing Contacts: Email, Phone + Social...
datarade.ai
.json, .csv, .xls
Updated Oct 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salutary Data (2023). US B2B Marketing Data | 148MM B2B Marketing Contacts: Email, Phone + Social Media Marketing Data [Dataset]. https://datarade.ai/data-products/salutary-data-direct-marketing-data-62m-us-b2b-contacts-salutary-data
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Oct 16, 2023
Dataset authored and provided by
Salutary Data
Area covered
United States of America
Description
Salutary Data is a boutique, B2B contact and company data provider that's committed to delivering high quality data for sales intelligence, lead generation, marketing, recruiting / HR, identity resolution, and ML / AI. Our database currently consists of 148MM+ highly curated B2B Contacts ( US only), along with over 4M+ companies, and is updated regularly to ensure we have the most up-to-date information.

We can enrich your in-house data ( CRM Enrichment, Lead Enrichment, etc.) and provide you with a custom dataset ( such as a lead list) tailored to your target audience specifications and data use-case. We also support large-scale data licensing to software providers and agencies that intend to redistribute our data to their customers and end-users.

What makes Salutary unique? - We offer our clients a truly unique, one-stop aggregation of the best-of-breed quality data sources. Our supplier network consists of numerous, established high quality suppliers that are rigorously vetted. - We leverage third party verification vendors to ensure phone numbers and emails are accurate and connect to the right person. Additionally, we deploy automated and manual verification techniques to ensure we have the latest job information for contacts. - We're reasonably priced and easy to work with.

Products: API Suite Web UI Full and Custom Data Feeds

Services: Data Enrichment - We assess the fill rate gaps and profile your customer file for the purpose of appending fields, updating information, and/or rendering net new “look alike” prospects for your campaigns. ABM Match & Append - Send us your domain or other company related files, and we’ll match your Account Based Marketing targets and provide you with B2B contacts to campaign. Optionally throw in your suppression file to avoid any redundant records. Verification (“Cleaning/Hygiene”) Services - Address the 2% per month aging issue on contact records! We will identify duplicate records, contacts no longer at the company, rid your email hard bounces, and update/replace titles or phones. This is right up our alley and levers our existing internal and external processes and systems.
S
Social media profile growth, engagement rate, and reach
data.sugarlandtx.gov
xlsx
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Communications and Community Engagement (2024). Social media profile growth, engagement rate, and reach [Dataset]. https://data.sugarlandtx.gov/dataset/social-media-profile-growth-engagement-rate-and-reach
Explore at:
xlsxAvailable download formats
Dataset updated
Jan 3, 2024
Dataset authored and provided by
Communications and Community Engagement
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Profile growth - the growth on our social platforms to see where and when we're gaining followers. Engagement rate - a ratio of how many people interacted with ours posts based on when users are usually online. Reach - the number of feeds our posts appeared in (doesn't mean people interacted with the post).
Data from: Youtube social network
kaggle.com
zip
Updated Sep 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo De Tomasi (2019). Youtube social network [Dataset]. https://www.kaggle.com/datasets/lodetomasi1995/youtube-social-network
Explore at:
zip(10604317 bytes)Available download formats
Dataset updated
Sep 1, 2019
Authors
Lorenzo De Tomasi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.

We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

more info : https://snap.stanford.edu/data/com-Youtube.html
daily_socialmedia_engagement
kaggle.com
Updated Feb 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeel Gajera (2023). daily_socialmedia_engagement [Dataset]. https://www.kaggle.com/datasets/earthian/daily-socialmedia-engagement
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2023
Dataset provided by
Kaggle
Authors
Jeel Gajera
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset contains information about daily engagement hours on various social media platforms for 1000 users. The data includes user IDs, age, and daily engagement hours on Facebook, Instagram, WhatsApp, Twitter, LinkedIn, Snapchat, and YouTube.
MultiSocial
zenodo.org
data.niaid.nih.gov
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13846152
Dataset updated
Aug 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Disclaimer

Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

Data Source

The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

The dataset has the following fields:

'text' - a text sample,

'label' - 0 for human-written text, 1 for machine-generated text,

'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

'language' - the ISO 639-1 language code identifying the detected language of the given text,

'length' - word count of the given text,

'source' - a string identifying the source dataset / platform of the given text,

'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

ToDo Statistics (under construction)
Social Media Disaster-Related Discussions
kaggle.com
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Social Media Disaster-Related Discussions [Dataset]. https://www.kaggle.com/datasets/thedevastator/mining-disaster-related-insights-from-social-med
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2022
Dataset provided by
Kaggle
Authors
The Devastator
Description
Social Media Disaster-Related Discussions

Detecting Relevant Content with Trusted Judgments

By CrowdFlower [source]

About this dataset

Welcome to the disaster tweets dataset! This collection of tweets holds a wealth of information about global disasters and their effects on people, governments, and organizations all over the world. With over 10,000 tweets collected and carefully annotated with labels of whether they reported an actual disaster or not, this dataset provides unique insight into what these events look like in terms of social media conversations.

This information is derived from a variety of key terms related to disaster events, such as “ablaze” and “pandemonium” which was used to gather each individual tweet for analysis. The columns for each tweet include detailed metadata about the user who posted it along with variables such as keyword relevance and location. Alongside all these attributes is the core text belonging to each individual tweet- giving you access to all sorts of stories from natural disasters, contagious disease outbreaks or conflicts between nations that can be found in one place!

So whatever you're looking for - whether it's observations about first-hand accounts or conducting research on public sentiment during a major event - this dataset offers you an invaluable source full of timely information that could potentially save lives down the line. So take your journey through this data now and embark upon discovering what devastation looks like through social media!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains tweets related to disaster events, including the keyword, location, text, tweetid and userid. It provides insights into how people interact with each other on social media during a disaster. Using this dataset you can gain valuable insight into the dynamics of online communication in disasters and provide an important point of reference for future disaster management initiatives.

Research Ideas

Analyzing the effectiveness of disaster relief and humanitarian aid efforts, by mapping tweets against public data of areas affected by disasters and donations made to help those affected.

Developing advanced statistical models to predict the magnitude and impact of an oncoming natural disaster using keyword analysis in social media posts related to past disasters.

Creating text-based classifiers to accurately detect disaster-related tweets in real-time, allowing emergency services providers early warning signs before a potential event occurs

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

Unknown License - Please check the dataset description for more information.

Columns

File: socialmedia-disaster-tweets-DFE.csv | Column name | Description | |:-----------------------|:-----------------------------------------------------------------------------------| | _golden | A boolean value indicating whether the tweet is a golden tweet or not. (Boolean) | | _unit_state | The state of the tweet (e.g. finalized, judged, etc.). (String) | | _trusted_judgments | The number of trusted judgments for the tweet. (Integer) | | _last_judgment_at | The date and time of the last judgment for the tweet. (DateTime) | | choose_one | The label assigned to the tweet (e.g. relevant, not relevant, etc.). (String) | | choose_one_gold | The gold label assigned to the tweet (e.g. relevant, not relevant, etc.). (String) | | keyword | The keyword associated with the tweet. (String) | | location | The location associated with the tweet. (String) | | text | The text content of the tweet. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit CrowdFlower.
H
Replication Data for: Social Networks and Protest Participation: Evidence...
dataverse.harvard.edu
Updated Nov 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jennifer Larson; Jonathan Nagler; Jonathan Ronen; Joshua Tucker (2019). Replication Data for: Social Networks and Protest Participation: Evidence from 130 Million Twitter Users [Dataset]. http://doi.org/10.7910/DVN/RLLL1V
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/RLLL1V
Dataset updated
Nov 22, 2019
Dataset provided by
Harvard Dataverse
Authors
Jennifer Larson; Jonathan Nagler; Jonathan Ronen; Joshua Tucker
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RLLL1Vhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RLLL1V
Description
Pinning down the role of social ties in the decision to protest has been notoriously elusive, largely due to data limitations. Social media and their global use by protesters offer an unprecedented opportunity to observe real-time social ties and online behavior, though often without an attendant measure of real-world behavior. We collect data on Twitter activity during the 2015 Charlie Hebdo protest in Paris which, unusually, record real-world protest attendance and network structure measured beyond egocentric networks. We devise a test of social theories of protest that hold that participation depends on exposure to others’ intentions and network position determines exposure. Our findings are strongly consistent with these theories, showing that protesters are significantly more connected to one another via direct, indirect, triadic, and reciprocated ties than comparable non-protesters. These results offer the first large-scale empirical support for the claim that social network structure has consequences for protest participation. The data were collected by the NYU Social Media and Political Participation (SMaPP) laboratory (https://wp.nyu.edu/smapp/), of which Nagler and Tucker are co-Directors along with Richard Bonneau and John T. Jost. The SMaPP lab is supported by the INSPIRE program of the National Science Foundation (Award SES-1248077), the New York University Global Institute for Advanced Study, the Moore-Sloan Data Science Environment, and Dean Thomas Carew’s Research Investment Fund at New York University. In order to run the replication end-to-end, we recommend downloading the comprehensive archive (charlie-hebdo-replication.tar.gz). The archive contains all the files with the appropriate directory structure. Once the archive is expanded, the full replication pipeline may be executed by running the script run-all.sh in the scripts directory.
s
Twitter cascade dataset
researchdata.smu.edu.sg
smu.edu.sg
+1more
pdf
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Living Analytics Research Centre (2023). Twitter cascade dataset [Dataset]. http://doi.org/10.25440/smu.12062709.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062709.v1
Dataset updated
May 31, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
Living Analytics Research Centre
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.
d
Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...
datarade.ai
.json, .csv
Updated Aug 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 12, 2024
Dataset authored and provided by
Dataplex
Area covered
Holy See, Botswana, Macao, Chile, Jersey, Gambia, Martinique, Mexico, Christmas Island, Côte d'Ivoire
Description
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.

User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.

Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.

AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.

Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.

Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.

Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.

Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.

Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
u
Social Media and Mental Health - Dataset - BSOS Data Repository
bsos-data.umd.edu
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Social Media and Mental Health - Dataset - BSOS Data Repository [Dataset]. https://bsos-data.umd.edu/dataset/social-media-and-mental-health
Explore at:
Dataset updated
Jul 24, 2024
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
The dataset encompasses demographic, health, and mental health information of students from 48 different states in the USA, born between 1971 and 2003. It includes data on general health ratings, responses to the PHQ-9 depression screening tool, and the GAD-7 anxiety assessment tool. It details how often students experienced various mental health symptoms over the past two weeks, their depression severity scores, and anxiety severity scores. Also, it covers experiences of feeling overwhelmed, exhausted, and hopeless within the last 12 months, along with diagnoses of depression, therapy, and medication usage. The dataset also includes information on various medical conditions, student status (full-time or international), sex, and race.

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Facebook users worldwide 2017-2027 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Facebook users worldwide 2017-2027

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).

Clear search

Close search

Google apps

Main menu

Facebook users worldwide 2017-2027

Social Media Datasets

Social Media Usage Dataset(Applications)

Social Media Engagement (2025)

Social Media Engagement (2025)

Overview

Column Descriptions

Social Media Datasets

Average daily time spent on social media worldwide 2012-2025

Dataset: Decentralized Social Media Use and Users

IMDB & Social Media Dataset

Abbreviated FOMO and social media dataset

Graph-Based Social Media Data on Mental Health Topics

US B2B Marketing Data | 148MM B2B Marketing Contacts: Email, Phone + Social...

Social media profile growth, engagement rate, and reach

Data from: Youtube social network

daily_socialmedia_engagement

MultiSocial

Disclaimer

Data Source

Social Media Disaster-Related Discussions

Social Media Disaster-Related Discussions

Detecting Relevant Content with Trusted Judgments

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Replication Data for: Social Networks and Protest Participation: Evidence...

Twitter cascade dataset

Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

Social Media and Mental Health - Dataset - BSOS Data Repository

Facebook users worldwide 2017-2027See More Versions

Facebook users worldwide 2017-2027