71 datasets found

Reddit user worldwide 2024, by country
statista.com
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Reddit user worldwide 2024, by country [Dataset]. https://www.statista.com/forecasts/1174696/reddit-user-by-country
Explore at:
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 1, 2024 - Dec 31, 2024
Area covered
Albania
Description
Comparing the *** selected regions regarding the number of Reddit users , the United States is leading the ranking (****** million users) and is followed by the United Kingdom with ***** million users. At the other end of the spectrum is Gabon with **** million users, indicating a difference of ****** million users to the United States. User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to *** countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Reddit Datasets
brightdata.com
.json, .csv, .xlsx
Updated Jan 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2023). Reddit Datasets [Dataset]. https://brightdata.com/products/datasets/reddit
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jan 11, 2023
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Access our extensive Reddit datasets that provide detailed information on posts, communities (subreddits), and user engagement. Gain insights into post performance, user comments, community statistics, and content trends with our ethically sourced data. Free samples are available for evaluation. 3M+ records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

Post ID, Title & URL Post Description & Date Username of Poster Upvotes & Comment Count Community Name, URL & Description Community Member Count Attached Photos & Videos Full Post Comments Related Posts Post Karma Post Tags And more
Reddit users in the United States 2019-2028
statista.com
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). Reddit users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
Explore at:
Dataset updated
Jul 30, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United States
Description
The number of Reddit users in the United States was forecast to continuously increase between 2024 and 2028 by in total 10.3 million users (+5.21 percent). After the ninth consecutive increasing year, the Reddit user base is estimated to reach 208.12 million users and therefore a new peak in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Reddit users in countries like Mexico and Canada.
d
Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...
datarade.ai
.json, .csv
Updated Aug 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 12, 2024
Dataset authored and provided by
Dataplex
Area covered
Gambia, Chile, Martinique, Christmas Island, Côte d'Ivoire, Holy See, Macao, Botswana, Jersey, Mexico
Description
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.

User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.

Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.

AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.

Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.

Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.

Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.

Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.

Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Reddit Mental Health Dataset
zenodo.org
csv
Updated Oct 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel M. Low; Daniel M. Low; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh (2020). Reddit Mental Health Dataset [Dataset]. http://doi.org/10.17605/osf.io/7peyq
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.17605/osf.io/7peyq
Dataset updated
Oct 16, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel M. Low; Daniel M. Low; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
This dataset contains posts from 28 subreddits (15 mental health support groups) from 2018-2020. We used this dataset to understand the impact of COVID-19 on mental health support groups from January to April, 2020 and included older timeframes to obtain baseline posts before COVID-19.

Please cite if you use this dataset:

Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635.

@article{low2020natural, title={Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study}, author={Low, Daniel M and Rumker, Laurie and Torous, John and Cecchi, Guillermo and Ghosh, Satrajit S and Talkar, Tanya}, journal={Journal of medical Internet research}, volume={22}, number={10}, pages={e22635}, year={2020}, publisher={JMIR Publications Inc., Toronto, Canada} }

License

This dataset is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/

It was downloaded using pushshift API. Re-use of this data is subject to Reddit API terms.

Reddit Mental Health Dataset

Contains posts and text features for the following timeframes from 28 mental health and non-mental health subreddits:

15 specific mental health support groups (r/EDAnonymous, r/addiction, r/alcoholism, r/adhd, r/anxiety, r/autism, r/bipolarreddit, r/bpd, r/depression, r/healthanxiety, r/lonely, r/ptsd, r/schizophrenia, r/socialanxiety, and r/suicidewatch)

2 broad mental health subreddits (r/mentalhealth, r/COVID19_support)

11 non-mental health subreddits (r/conspiracy, r/divorce, r/fitness, r/guns, r/jokes, r/legaladvice, r/meditation, r/parenting, r/personalfinance, r/relationships, r/teaching).

filenames and corresponding timeframes:

post: Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears). Unique users: 320,364.

pre: Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts. Unique users: 327,289.

2019: Jan 1 to April 20, 2019 (r/EDAnonymous appears). A control for seasonal fluctuations to match post data. Unique users: 282,560.

2018: Jan 1 to April 20, 2018. A control for seasonal fluctuations to match post data. Unique users: 177,089

Unique users across all time windows (pre and 2019 overlap): 826,961.

See manuscript Supplementary Materials (https://doi.org/10.31234/osf.io/xvwcy) for more information.

Note: if subsampling (e.g., to balance subreddits), we recommend bootstrapping analyses for unbiased results.
Subreddits
kaggle.com
zip
Updated May 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Severan (2018). Subreddits [Dataset]. https://www.kaggle.com/rayraegah/subreddits
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 12, 2018
Authors
Severan
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Context

Analyse the popularity of public subreddits

Content

The CSV contains a long list of every subreddit on Reddit. There are a total of 1067472 subreddits and the columns in the dataset are:

base10 id,

base36 reddit id,

creation epoch,

subreddit name,

number of subscribers

Acknowledgements

This dataset was originally published on /r/datasets by /u/Stuck_In_the_Matrix

Inspiration

What's on Reddit?

Find your subreddit
Z
Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...
data.niaid.nih.gov
Updated Jan 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trujillo, Amaury (2023). Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation Interventions on r/The_Donald [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6250576
Explore at:
Dataset updated
Jan 10, 2023
Dataset provided by
Cresci, Stefano
Trujillo, Amaury
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.

An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again

If you use this dataset please cite the related article.

The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.

The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.

The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.

The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.

A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.
1 million Reddit comments from 40 subreddits
kaggle.com
Updated Feb 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Magnan (2020). 1 million Reddit comments from 40 subreddits [Dataset]. https://www.kaggle.com/smagnan/1-million-reddit-comments-from-40-subreddits/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samuel Magnan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F949630%2F1a380791014d44ae3581e006f4540b9a%2F898dc7.png?generation=1580627804062875&alt=media" alt="Reddit Banner">

Content

This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc...).

For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced.

I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json -> csv).

This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety.

The information kept here is:

subreddit (categorical): on which subreddit the comment was posted

body (str): comment content

controversiality (binary): a reddit aggregated metric

score (scalar): upvotes minus downvotes

Acknowledgements

The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data.

What can I do with that?

Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models.

Note

If you think the License (CC0: Public Domain) should be different, contact me
Reddit: /r/CryptoCurrency
kaggle.com
Updated Dec 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/CryptoCurrency [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-financial-opportunities-through-crypto
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/CryptoCurrency

Posts, Scores, Comment Counts and Creation Timestamps

By Reddit [source]

About this dataset

This dataset contains detailed information on posts, scores and comments from the Reddit subreddit ‘CryptoCurrency’ - a fascinating online community devoted to discussion and analysis of the latest developments in blockchain investments, digital currencies, and other associated topics. Dive into the data to see what ultimate insights cryptocurrency enthusiasts are offering each other - their post titles, scores (the net upvotes a post has received), comment counts, created dates and timestamps are all laid out here for easy exploration. By taking advantage of this unique snapshot into crypto discussions and trends you can gain a better understanding not only of what topics have been popular over time but also how they're being discussed across this passionate community. Are there particular trends or patterns that emerge? It's up to you to uncover them!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains posts and comments from the subreddit ‘CryptoCurrency’, which is a widely-followed discussion board devoted to discussing cryptocurrencies, blockchain investments, and other related topics. The dataset contains a large number of posts from the subreddit and their associated scores, comment counts and creation timestamps. This dataset can be used in numerous ways for both research and practical business applications.
First, let's explore what columns are contained within this dataset: title, score, url, comms_num (number of comments), created (date and time post was created), body (actual content of the post), timestamp. With this information at hand you can begin answering key questions such as: What type of topics bring more attention? What topics are not popular? Are there any correlations between posts with higher scores(upvotes) or more comments?
To better understand these questions there are numerous tools that can be employed on this data including Natural Language Processing tools such as TF-IDF vectorizers or Latent Dirichlet Allocation to understand what type of themes dominate these conversations. Additionally machine learning algorithms such as clustering techniques like K Nearest Neighbors or Unsupervised Learning techniques like Principal Component Analysis could help uncover insights from this data set. For example if we wanted to find out which words in titles correlated with higher scores then KNN could give us a better understanding as it would build clusters based on similar titles/words and show how each vary in relation score wise giving us an overview on how related words influence scores before analyzing content or any other factors within the data set.
Furthermore Reddit users actively engage with posts so by looking at comment counts insight can also be taken into effect regarding popularity etc... For example one may observe that whenever new coin values arise they tend to have more comments than usual - an insight indicating high levels of user engagement at certain moments in time when compared to regular periods which could be useful when making comparisons between individual coins etc..
Overall this data can provide tremendous value depending on its usage case - whether it stands for research purposes only or applied analytics geared towards predicting prices/engagement/ user sentiment etc it all depends but nonetheless opportunities lie within unlocking financial opportunities through cryptocurrency discussion found on reddit thus making it highly valuable for multiple purposes utilized properly!

Research Ideas

This dataset can be used to create a sentiment analysis of the comments and posts on CryptoCurrency topics and how these conversations have changed over time. This can help ascertain how different events within the crypto market have been received by investors, speculators, and other users on the subreddit.

The dataset can also be utilized to identify trends in successful topics of conversation (in terms of post scores) and give insight into what types of topics are popular among Redditors in the CryptoCurrency space.

Furthermore, this dataset could provide insight into user behavior on CryptoCurrency subreddits by enabling analysis around peak times for certain conversations or post popularity as well as which users tend to comment or post more frequently in response times vs others

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

...
p
Reddit Datasets
promptcloud.com
csv
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PromptCloud (2025). Reddit Datasets [Dataset]. https://www.promptcloud.com/dataset/reddit/
Explore at:
csvAvailable download formats
Dataset updated
Mar 28, 2025
Dataset authored and provided by
PromptCloud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Extracting Insights from Online DiscussionsReddit is one of the largest social discussion platforms, making it a valuable source for real-time opinions, trends, sentiment analysis, and user interactions across various industries. Scraping Reddit data allows businesses, researchers, and analysts to explore public discussions, track sentiment, and gain actionable insights from user-generated content. Benefits and Impact: Trend […]
t
Reddit dataset
service.tib.eu
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Reddit dataset [Dataset]. https://service.tib.eu/ldmservice/dataset/reddit-dataset
Explore at:
Dataset updated
Nov 25, 2024
Description
The Reddit dataset contains tuples of user name, a subreddit where the user makes a comment to a thread, and a timestamp for the interaction, split into sessions manually.
Z
Comprehensive dataset of over 4000 subreddits across 13 categories
data.niaid.nih.gov
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinod, Dayanand (2024). Comprehensive dataset of over 4000 subreddits across 13 categories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13343577
Explore at:
Dataset updated
Oct 1, 2024
Dataset provided by
S, Arjhun
Vinod, Dayanand
Deepak, Pranav
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset encompasses a rich collection of 4000 subreddits organized into 13 distinct categories, providing a valuable resource for researchers and data scientists in the fields of social media analysis, natural language processing, and community dynamics. The subreddits and the respective categories were obtained here.

Each subreddit contains an average of over 400 posts and 11 million unique users.

The dataset is formatted in JSON.

The data is structured in the following manner.

id: the post's unique identifier

post_user: the post's author (anonymized)

post_time: the time at which the post was created, in unix time

post_body: the post's body

comments: a list of comments on the post, where each comment is a dictionary with the following keys:

id: the comment's unique identifier

user: the comment's author (anonymized)

time: the time at which the comment was created, in unix time

body: the comment's body

replies: a list of replies to the comment, where each reply is a dictionary with the same information as a comment.

The comments and replies are threaded within.
reddit user posting behavior (mid-2013)
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Randy Olson (2023). reddit user posting behavior (mid-2013) [Dataset]. http://doi.org/10.6084/m9.figshare.874101.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.874101.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Randy Olson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file contains the posting preferences for over 850,000 active reddit users. This sample was taken in mid-2013. This data was used to generate the interactive visualization, "redditviz," and will be analyzed in detail in an upcoming research article. Please cite our paper "Navigating the massive world of reddit" if you use this data in your work. URL: http://arxiv.org/abs/1312.3387 The file is organized as follows: Each line is an entry for an anonymous user. Each user was randomly assigned a unique ID, which is what shows in the first entry of each line. Following the user ID, separated by commas, are the subreddits (i.e., interests) that the user regularly posts in. In order for a user to be considered "active" in that subreddit, they had to post or comment there at least 10 times in their last 1,000 posts and comments.
f
Regression exploring the relationship between amount of missing content per...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Devin Gaffney; J. Nathan Matias (2023). Regression exploring the relationship between amount of missing content per subreddit and total amount of known content per subreddit, and month in which the subreddit was created. [Dataset]. http://doi.org/10.1371/journal.pone.0200162.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0200162.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Devin Gaffney; J. Nathan Matias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We expect that these two variables would have meaningful explanatory power for where missing content is—we find that this appears to be the case for missing comments but not for missing submissions, as evidenced by the relative R2 values.
R/The_Donald Reddit Dataset
figshare.com
txt
Updated Jun 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivian Ferrillo (2022). R/The_Donald Reddit Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.19991777.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19991777.v1
Dataset updated
Jun 4, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Vivian Ferrillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains publically available posting data on users who posted on r/The_Donald in January 2017. All data was scraped via the PushShift.io project. Dataset contains monthly posting data of each individual and the results of term frequency analysis. All sampled users were anonymized.
Reddit users in Africa 2020-2028
statista.com
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). Reddit users in Africa 2020-2028 [Dataset]. https://www.statista.com/topics/9922/social-media-in-africa/
Explore at:
Dataset updated
Jan 10, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
Africa
Description
The number of Reddit users in Africa was forecast to continuously increase between 2024 and 2028 by in total 4.7 million users (+66.67 percent). After the eighth consecutive increasing year, the Reddit user base is estimated to reach 11.78 million users and therefore a new peak in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Reddit users in countries like North America and Asia.
m
Reddit, Inc. - Change-Receivables
macro-rankings.com
csv, excel
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
macro-rankings (2025). Reddit, Inc. - Change-Receivables [Dataset]. https://www.macro-rankings.com/Markets/Stocks/RDDT-NYSE/Cashflow-Statement/Change-Receivables
Explore at:
excel, csvAvailable download formats
Dataset updated
Aug 9, 2025
Dataset authored and provided by
macro-rankings
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
united states
Description
Change-Receivables Time Series for Reddit, Inc.. Reddit, Inc. operates a digital community in the United States and internationally. The company's platform enables user to engage in conversations, explore passions, research new hobbies, exchange goods and services, create new communities and experiences, share laughs, and find belonging. It also organizes communities based on specific interests that enable users to engage in conversations by sharing experiences, submitting links, uploading images and videos, and replying to one another. The company was founded in 2005 and is headquartered in San Francisco, California.
Question-Answer Jokes
kaggle.com
Updated Jan 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiri Roznovjak (2017). Question-Answer Jokes [Dataset]. https://www.kaggle.com/datasets/jiriroz/qa-jokes/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 5, 2017
Dataset provided by
Kaggle
Authors
Jiri Roznovjak
Description
This dataset contains 38,269 jokes of the question-answer form, obtained from the r/Jokes subreddit. The dataset contains a csv file, where a row contains a question ("Why did the chicken cross the road"), the corresponding answer ("To get to the other side") and a unique ID.

The data comes from the end of 2016 all the way to 2008. The entries with a higher ID correspond to the ones submitted earlier.

An example of what one might do with the data is build a sequence-to-sequence model where the input is a question and the output is an answer. Then, given a question, the model should generate a funny answer. This is what I did as the final project for my fall 2016 machine learning class. The project page can be viewed here.

Disclaimer: The dataset contains jokes that some may find inappropriate.

License

Released under reddit's API terms

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Reddit user worldwide 2024, by country [Dataset]. https://www.statista.com/forecasts/1174696/reddit-user-by-country

Reddit user worldwide 2024, by country

Explore at:

18 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 10, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Jan 1, 2024 - Dec 31, 2024

Area covered

Albania

Description

Comparing the *** selected regions regarding the number of Reddit users , the United States is leading the ranking (****** million users) and is followed by the United Kingdom with ***** million users. At the other end of the spectrum is Gabon with **** million users, indicating a difference of ****** million users to the United States. User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to *** countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).

Clear search

Close search

Google apps

Main menu

Reddit user worldwide 2024, by country

Reddit Datasets

Reddit users in the United States 2019-2028

Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Reddit r/AskScience Flair Dataset

Reddit Mental Health Dataset

Subreddits

Context

Content

Acknowledgements

Inspiration

Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...

1 million Reddit comments from 40 subreddits

Content

Acknowledgements

What can I do with that?

Note

Reddit: /r/CryptoCurrency

Reddit: /r/CryptoCurrency

Posts, Scores, Comment Counts and Creation Timestamps

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

...

Reddit Datasets

Reddit dataset

Comprehensive dataset of over 4000 subreddits across 13 categories

reddit user posting behavior (mid-2013)

Regression exploring the relationship between amount of missing content per...

R/The_Donald Reddit Dataset

Reddit users in Africa 2020-2028

Reddit, Inc. - Change-Receivables

Question-Answer Jokes

License

Reddit user worldwide 2024, by country

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`