The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
https://brightdata.com/licensehttps://brightdata.com/license
Gain valuable insights with our comprehensive Social Media Dataset, designed to help businesses, marketers, and analysts track trends, monitor engagement, and optimize strategies. This dataset provides structured and reliable social media data from multiple platforms.
Dataset Features
User Profiles: Access public social media profiles, including usernames, bios, follower counts, engagement metrics, and more. Ideal for audience analysis, influencer marketing, and competitive research. Posts & Content: Extract posts, captions, hashtags, media (images/videos), timestamps, and engagement metrics such as likes, shares, and comments. Useful for trend analysis, sentiment tracking, and content strategy optimization. Comments & Interactions: Analyze user interactions, including replies, mentions, and discussions. This data helps brands understand audience sentiment and engagement patterns. Hashtag & Trend Tracking: Monitor trending hashtags, topics, and viral content across platforms to stay ahead of industry trends and consumer interests.
Customizable Subsets for Specific Needs Our Social Media Dataset is fully customizable, allowing you to filter data based on platform, region, keywords, engagement levels, or specific user profiles. Whether you need a broad dataset for market research or a focused subset for brand monitoring, we tailor the dataset to your needs.
Popular Use Cases
Brand Monitoring & Reputation Management: Track brand mentions, customer feedback, and sentiment analysis to manage online reputation effectively. Influencer Marketing & Audience Analysis: Identify key influencers, analyze engagement metrics, and optimize influencer partnerships. Competitive Intelligence: Monitor competitor activity, content performance, and audience engagement to refine marketing strategies. Market Research & Consumer Insights: Analyze social media trends, customer preferences, and emerging topics to inform business decisions. AI & Predictive Analytics: Leverage structured social media data for AI-driven trend forecasting, sentiment analysis, and automated content recommendations.
Whether you're tracking brand sentiment, analyzing audience engagement, or monitoring industry trends, our Social Media Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Social Media Usage Dataset(Applications) features patterns and activity indicators that 1,000 users use seven major social media platforms, including Facebook, Instagram, and Twitter.
2) Data Utilization (1) Social Media Usage Dataset(Applications) has characteristics that: • This dataset provides different social media activity data for each user, including daily usage time, number of posts, number of likes received, and number of new followers. (2) Social Media Usage Dataset(Applications) can be used to: • Analysis of User Participation by Platform: You can analyze participation and popular trends by platform by comparing usage time and activity for each social media. • Establish marketing strategy: Based on user activity data, it can be used for targeted marketing, content production, and user retention strategies.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 20,000 synthetic social media posts crafted to mimic realistic user activity on a fictional platform. It simulates various user demographics, post content, hashtags, topics, and detailed engagement metrics such as likes, comments, and shares.
Each record represents a unique social media post made by a user, enriched with features that allow for analysis of trends, behavior, and engagement. The dataset includes:
Column | Description |
---|---|
post_id | Unique identifier for each post |
user_id | Unique identifier for each user |
user_name | Synthetic username |
user_gender | Gender of the user (Male, Female, Other) |
user_age | Age of the user (16–60) |
followers_count | Number of followers the user has |
following_count | Number of accounts the user follows |
account_creation_date | Account registration date |
is_verified | Boolean flag for verified users |
location | City or region where the user is located |
topic | Main topic of the post (e.g., Travel, Food, Fashion, etc.) |
post_content | Actual content of the post |
content_length | Number of characters in the post content |
hashtags | Relevant hashtags used in the post |
has_media | Whether the post includes image or video |
post_date | Timestamp of when the post was made |
device | Device used to make the post (e.g., iPhone, Android) |
language | Language of the post |
likes | Number of likes received |
comments | Number of comments received |
shares | Number of times the post was shared |
engagement_rate | Normalized metric: (likes + comments + shares) / followers_count |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social media datasets provide real-time insight into public opinion, trending topics, user behavior, sentiment, and global events as reflected on platforms like Twitter (X), Facebook, and Instagram. These datasets are crucial for marketing analysts, newsrooms, political strategists, crisis response teams, and brand managers to monitor discourse and take data-driven action. Extracted from live user-generated content, […]
How much time do people spend on social media? As of 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
https://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP3/MJYGARhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP3/MJYGAR
The dataset contains 31 transcribed and anonymized interviews of blockchain-based social media users. The dataset was collected during the summer of 2022 as part of a research project at the Social Media Lab at Toronto Metropolitan University. The dataset is available upon request for validation by peer-reviewers or other researchers in the field.
This dataset is compiled using this dataset from GitHub.
Data Description Table
Variable Name | Description |
---|---|
movie_title | Title of the Movie |
duration | Duration in minutes |
director_name | Name of the Director of the Movie |
director_facebook_likes | Number of likes of the Director on his Facebook Page |
actor_1_name | Primary actor starring in the movie |
actor_1_facebook_likes | Number of likes of the Actor_1 on his/her Facebook Page |
actor_2_name | Other actor starring in the movie |
actor_2_facebook_likes | Number of likes of the Actor_2 on his/her Facebook Page |
actor_3_name | Other actor starring in the movie |
actor_3_facebook_likes | Number of likes of the Actor_3 on his/her Facebook Page |
num_user_for_reviews | Number of users who gave a review |
num_critic_for_reviews | Number of critical reviews on imdb |
num_voted_users | Number of people who voted for the movie |
cast_total_facebook_likes | Total number of facebook likes of the entire cast of the movie |
movie_facebook_likes | Number of Facebook likes in the movie page |
plot_keywords | Keywords describing the movie plot |
facenumber_in_poster | Number of the actor who featured in the movie poster |
color | Film colorization. ‘Black and White’ or ‘Color’ |
genres | Film categorization like ‘Animation’, ‘Comedy’, etc |
title_year | The year in which the movie is released (1916:2016) |
language | Languages like English, Arabic, Chinese, etc |
country | Country where the movie is produced |
content_rating | Content rating of the movie |
aspect_ratio | Aspect ratio the movie was made in |
movie_imdb_link | IMDB link of the movie |
gross | Gross earnings of the movie in Dollars |
budget | Budget of the movie in Dollars |
imdb_score | IMDB Score of the movie on IMDB |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database is comprised of 951 participants who provided self-report data online in their school classrooms. The data was collected in 2016 and 2017. The dataset is comprised of 509 males (54%) and 442 females (46%). Their ages ranged from 12 to 16 years (M = 13.69, SD = 0.72). Seven participants did not report their age. The majority were born in Australia (N = 849, 89%). The next most common countries of birth were China (N = 24, 2.5%), the UK (N = 23, 2.4%), and the USA (N = 9, 0.9%). Data were drawn from students at five Australian independent secondary schools. The data contains item responses for the Spence Children’s Anxiety Scale (SCAS; Spence, 1998) which is comprised of 44 items. The Social media question asked about frequency of use with the question “How often do you use social media?”. The response options ranged from constantly to once a week or less. Items measuring Fear of Missing Out were included and incorporated the following five questions based on the APS Stress and Wellbeing in Australia Survey (APS, 2015). These were “When I have a good time it is important for me to share the details online; I am afraid that I will miss out on something if I don’t stay connected to my online social networks; I feel worried and uncomfortable when I can’t access my social media accounts; I find it difficult to relax or sleep after spending time on social networking sites; I feel my brain burnout with the constant connectivity of social media. Internal consistency for this measure was α = .81. Self compassion was measured using the 12-item short-form of the Self-Compassion Scale (SCS-SF; Raes et al., 2011). The data set has the option of downloading an excel file (composed of two worksheet tabs) or CSV files 1) Data and 2) Variable labels. References: Australian Psychological Society. (2015). Stress and wellbeing in Australia survey. https://www.headsup.org.au/docs/default-source/default-document-library/stress-and-wellbeing-in-australia-report.pdf?sfvrsn=7f08274d_4 Raes, F., Pommier, E., Neff, K. D., & Van Gucht, D. (2011). Construction and factorial validation of a short form of the self-compassion scale. Clinical Psychology and Psychotherapy, 18(3), 250-255. https://doi.org/10.1002/cpp.702 Spence, S. H. (1998). A measure of anxiety symptoms among children. Behaviour Research and Therapy, 36(5), 545-566. https://doi.org/10.1016/S0005-7967(98)00034-5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is structured as a graph, where nodes represent users and edges capture their interactions, including tweets, retweets, replies, and mentions. Each node provides detailed user attributes, such as unique ID, follower and following counts, and verification status, offering insights into each user's identity, role, and influence in the mental health discourse. The edges illustrate user interactions, highlighting engagement patterns and types of content that drive responses, such as tweet impressions. This interconnected structure enables sentiment analysis and public reaction studies, allowing researchers to explore engagement trends and identify the mental health topics that resonate most with users.
The dataset consists of three files: 1. Edges Data: Contains graph data essential for social network analysis, including fields for UserID (Source), UserID (Destination), Post/Tweet ID, and Date of Relationship. This file enables analysis of user connections without including tweet content, maintaining compliance with Twitter/X’s data-sharing policies. 2. Nodes Data: Offers user-specific details relevant to network analysis, including UserID, Account Creation Date, Follower and Following counts, Verified Status, and Date Joined Twitter. This file allows researchers to examine user behavior (e.g., identifying influential users or spam-like accounts) without direct reference to tweet content. 3. Twitter/X Content Data: This file contains only the raw tweet text as a single-column dataset, without associated user identifiers or metadata. By isolating the text, we ensure alignment with anonymization standards observed in similar published datasets, safeguarding user privacy in compliance with Twitter/X's data guidelines. This content is crucial for addressing the research focus on mental health discourse in social media. (References to prior Data in Brief publications involving Twitter/X data informed the dataset's structure.)
Salutary Data is a boutique, B2B contact and company data provider that's committed to delivering high quality data for sales intelligence, lead generation, marketing, recruiting / HR, identity resolution, and ML / AI. Our database currently consists of 148MM+ highly curated B2B Contacts ( US only), along with over 4M+ companies, and is updated regularly to ensure we have the most up-to-date information.
We can enrich your in-house data ( CRM Enrichment, Lead Enrichment, etc.) and provide you with a custom dataset ( such as a lead list) tailored to your target audience specifications and data use-case. We also support large-scale data licensing to software providers and agencies that intend to redistribute our data to their customers and end-users.
What makes Salutary unique? - We offer our clients a truly unique, one-stop aggregation of the best-of-breed quality data sources. Our supplier network consists of numerous, established high quality suppliers that are rigorously vetted. - We leverage third party verification vendors to ensure phone numbers and emails are accurate and connect to the right person. Additionally, we deploy automated and manual verification techniques to ensure we have the latest job information for contacts. - We're reasonably priced and easy to work with.
Products: API Suite Web UI Full and Custom Data Feeds
Services: Data Enrichment - We assess the fill rate gaps and profile your customer file for the purpose of appending fields, updating information, and/or rendering net new “look alike” prospects for your campaigns. ABM Match & Append - Send us your domain or other company related files, and we’ll match your Account Based Marketing targets and provide you with B2B contacts to campaign. Optionally throw in your suppression file to avoid any redundant records. Verification (“Cleaning/Hygiene”) Services - Address the 2% per month aging issue on contact records! We will identify duplicate records, contacts no longer at the company, rid your email hard bounces, and update/replace titles or phones. This is right up our alley and levers our existing internal and external processes and systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Profile growth - the growth on our social platforms to see where and when we're gaining followers. Engagement rate - a ratio of how many people interacted with ours posts based on when users are usually online. Reach - the number of feeds our posts appeared in (doesn't mean people interacted with the post).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
more info : https://snap.stanford.edu/data/com-Youtube.html
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains information about daily engagement hours on various social media platforms for 1000 users. The data includes user IDs, age, and daily engagement hours on Facebook, Instagram, WhatsApp, Twitter, LinkedIn, Snapchat, and YouTube.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.
The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:
Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.
Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).
Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.
Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).
WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.
From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.
The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).
The dataset has the following fields:
'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the detected language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / platform of the given text,
'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.
ToDo Statistics (under construction)
By CrowdFlower [source]
Welcome to the disaster tweets dataset! This collection of tweets holds a wealth of information about global disasters and their effects on people, governments, and organizations all over the world. With over 10,000 tweets collected and carefully annotated with labels of whether they reported an actual disaster or not, this dataset provides unique insight into what these events look like in terms of social media conversations.
This information is derived from a variety of key terms related to disaster events, such as “ablaze” and “pandemonium” which was used to gather each individual tweet for analysis. The columns for each tweet include detailed metadata about the user who posted it along with variables such as keyword relevance and location. Alongside all these attributes is the core text belonging to each individual tweet- giving you access to all sorts of stories from natural disasters, contagious disease outbreaks or conflicts between nations that can be found in one place!
So whatever you're looking for - whether it's observations about first-hand accounts or conducting research on public sentiment during a major event - this dataset offers you an invaluable source full of timely information that could potentially save lives down the line. So take your journey through this data now and embark upon discovering what devastation looks like through social media!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains tweets related to disaster events, including the keyword, location, text, tweetid and userid. It provides insights into how people interact with each other on social media during a disaster. Using this dataset you can gain valuable insight into the dynamics of online communication in disasters and provide an important point of reference for future disaster management initiatives.
- Analyzing the effectiveness of disaster relief and humanitarian aid efforts, by mapping tweets against public data of areas affected by disasters and donations made to help those affected.
- Developing advanced statistical models to predict the magnitude and impact of an oncoming natural disaster using keyword analysis in social media posts related to past disasters.
- Creating text-based classifiers to accurately detect disaster-related tweets in real-time, allowing emergency services providers early warning signs before a potential event occurs
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: socialmedia-disaster-tweets-DFE.csv | Column name | Description | |:-----------------------|:-----------------------------------------------------------------------------------| | _golden | A boolean value indicating whether the tweet is a golden tweet or not. (Boolean) | | _unit_state | The state of the tweet (e.g. finalized, judged, etc.). (String) | | _trusted_judgments | The number of trusted judgments for the tweet. (Integer) | | _last_judgment_at | The date and time of the last judgment for the tweet. (DateTime) | | choose_one | The label assigned to the tweet (e.g. relevant, not relevant, etc.). (String) | | choose_one_gold | The gold label assigned to the tweet (e.g. relevant, not relevant, etc.). (String) | | keyword | The keyword associated with the tweet. (String) | | location | The location associated with the tweet. (String) | | text | The text content of the tweet. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit CrowdFlower.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RLLL1Vhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RLLL1V
Pinning down the role of social ties in the decision to protest has been notoriously elusive, largely due to data limitations. Social media and their global use by protesters offer an unprecedented opportunity to observe real-time social ties and online behavior, though often without an attendant measure of real-world behavior. We collect data on Twitter activity during the 2015 Charlie Hebdo protest in Paris which, unusually, record real-world protest attendance and network structure measured beyond egocentric networks. We devise a test of social theories of protest that hold that participation depends on exposure to others’ intentions and network position determines exposure. Our findings are strongly consistent with these theories, showing that protesters are significantly more connected to one another via direct, indirect, triadic, and reciprocated ties than comparable non-protesters. These results offer the first large-scale empirical support for the claim that social network structure has consequences for protest participation. The data were collected by the NYU Social Media and Political Participation (SMaPP) laboratory (https://wp.nyu.edu/smapp/), of which Nagler and Tucker are co-Directors along with Richard Bonneau and John T. Jost. The SMaPP lab is supported by the INSPIRE program of the National Science Foundation (Award SES-1248077), the New York University Global Institute for Advanced Study, the Moore-Sloan Data Science Environment, and Dean Thomas Carew’s Research Investment Fund at New York University. In order to run the replication end-to-end, we recommend downloading the comprehensive archive (charlie-hebdo-replication.tar.gz). The archive contains all the files with the appropriate directory structure. Once the archive is expanded, the full replication pipeline may be executed by running the script run-all.sh in the scripts directory.
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
The dataset encompasses demographic, health, and mental health information of students from 48 different states in the USA, born between 1971 and 2003. It includes data on general health ratings, responses to the PHQ-9 depression screening tool, and the GAD-7 anxiety assessment tool. It details how often students experienced various mental health symptoms over the past two weeks, their depression severity scores, and anxiety severity scores. Also, it covers experiences of feeling overwhelmed, exhausted, and hopeless within the last 12 months, along with diagnoses of depression, therapy, and medication usage. The dataset also includes information on various medical conditions, student status (full-time or international), sex, and race.
The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).