44 datasets found

d
Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Dataplex
Area covered
Mexico, Holy See, Christmas Island, Jersey, Chile, Gambia, Macao, Botswana, Martinique, Côte d'Ivoire
Description
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.

User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.

Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.

AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.

Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.

Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.

Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.

Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.

Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Gendered terms in 100 most popular subreddits
figshare.com
zip
Updated Jun 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Thelwall; Emma Stuart (2018). Gendered terms in 100 most popular subreddits [Dataset]. http://doi.org/10.6084/m9.figshare.6470930.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6470930.v1
Dataset updated
Jun 10, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mike Thelwall; Emma Stuart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Gendered terms in 100 most popular subreddits, as judged by a chi-squared test. Spreadsheets are organised by subjectively determined subreddit theme.This is data associated with the paper: She’s Reddit: A new source of Gendered interest information? with Emma Stuart.
Z
Cross-mentions between 4chan and Reddit
data.niaid.nih.gov
zenodo.org
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hagen, Sal (2023). Cross-mentions between 4chan and Reddit [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10059084
Explore at:
Dataset updated
Oct 31, 2023
Dataset authored and provided by
Hagen, Sal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The included datasets are at the basis of the article "No space for Reddit spacing: Mapping the reflexive relationship between groups on 4chan and Reddit", published in Social Media + Society. They include cross-mentions between 4chan and Reddit, as well as various metrics associated to these cross-references. The timeframe ranges from the earliest date available (for Reddit: June 2006; for 4chan/b/: April 2006; for 4chan/pol/: December 2013) and ends in January 2023 (except for the 4chan/b/ dataset, which ends in December 2008). The datasets specifically entail the following: 1. Cross-mentions from Reddit to 4chan reddit-mentions-to-4chan.csv I used the Pushshift API's search endpoint to fetch Reddit comments (so no opening posts) with the keyword "4chan" (note: this Pushshift functionality is now deprecated). I also used a rudimentary filter to remove posts by bots, specifically by 1) deleting posts from every account that had "bot" or "auto" in the username and 2) removing all posts by authors with 100 or more contributions and which I manually identified as automated accounts. I removed URL-only cross-references, i.e. posts that only mentioned "://boards.4chan.org" or "://boards.4channel.org" without another 4chan-reference/ This resulted in 2,638,621 "4chan" references across Reddit. 2. Cross-mentions from 4chan/pol/ to Reddit 4chan-pol_mentions-of-reddit.csv With a complete dataset of /pol/ collected through 4CAT, I queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor"). I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/ This resulted in 1,640,273 "Reddit" references on /pol/. 3. Cross-mentions from 4chan/b/ to Reddit 4chan-b_mentions-to-reddit.csv I extracted five million posts from Jason Scott's 4chan/b/ dump. I then queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor"). I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/ This resulted in 1,287 "Reddit" references on /b/. See Hagen (2020) for more information on the 4chan/b/ dataset. 4. Cross-mention metrics cross-mention-metrics.xlsx I extracted the following metrics from the datasets above: 4.1 The total number of cross-mentions, absolute and relative, per monthThis simply used the monthly counts from datasets 1 and 2. 4.2 The most mentioned subreddits on /pol/, per yearUsing the regular expression: r\/[a-zA-Z_] 4.3 Subreddits that mention 4chan most often, per year 4.4 4chan boards mentioned across Reddit, per month 4.5 4chan boards mentioned by subreddits I counted every subreddit- or board-mention per post instead of total occurrences. For 4.4 and 4.5, I used the following regular expression to extract 4chan board names: (\s|^|4chan)\/(a|b|c|d|e|f|g|gif|h|hr|k|m|o|p|t|v|vg|vm|vmg|vr|vrpg|vst|w|wg|i|ic|r9k|s4s|vip|qa|cm|hm|lgbt|y|3|aco|adv|an|bant|biz|cgl|ck|co|diy|fa|fit|gd|hc|his|int|jp|lit|mlp|mu|n|news|out|po|pol|pw|qst|sci|soc|sp|tg|toy|trv|tv|vp|vt|wsg|wsr|x|xs|new)\/(\s|$) I also omitted 4chan's /r/, /u/, and /s/ boards; despite their small scale, they appeared as false positives due to their unrelated vernacular meaning on Reddit (e.g. /u/ as a username prefix). 4.5 was also transformed and included as a Gephi network file (subreddit-board-mentions.gephi). Lastly, I also included: 4.6 The total amount of posts on 4chan and Reddit This was used to calculate 4.1. It uses Pushshift's database statistics (which as of Nov. 2023 requires a login; see this Pastebin for an alternative) and metrics of total 4chan post counts from 4stats.io. Each of these metrics has their own corresponding tab in the Excel file. 5. Co-words of "4chan" and "reddit" in the cross-mentions co-words.xslx Using datasets 1, 2, and 3, I extracted the top ten words appearing directly next to "4chan" on Reddit, and next to "Reddit" on 4chan, per year. I first pre-processed the text, which involved tokenisation, filtering of unwanted text elements like URLs, stop word removal (I whitelisted back), and lemmatisation. For the co-word extraction I used a window size of two. I excluded a range of semantically uninteresting words or commonly used hate speech terms prevalent throughout 4chan. 6. Annotated cross-mentions between Reddit and 4chan/pol/ in September 2014 annotations_4chanpol-2014.csvannotations_reddit-2014-kotakuinaction-anonimised.csvannotations_reddit-2014-tumblrinaction-anonimised.csv I extracted cross-mentions from /pol/ to Reddit and from Reddit to 4chan in September 2014 for close-reading and annotation. _ The author names are removed for all datasets.
d
Reddit blackout announcements: 2023 API protest
search.dataone.org
data.niaid.nih.gov
+1more
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Pettis (2024). Reddit blackout announcements: 2023 API protest [Dataset]. http://doi.org/10.5061/dryad.qfttdz0qd
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.qfttdz0qd
Dataset updated
Feb 8, 2024
Dataset provided by
Dryad Digital Repository
Authors
Ben Pettis
Time period covered
Jan 1, 2024
Description
Starting June 12, 2023, many Reddit communities (subreddits) began a protest where they "went dark" - by changing to private mode - as a protest in response to Reddit's plans to change its API access policies and fee structure. Supporters of the protest criticize the planned changes for being prohibitively expensive for 3rd party apps. Beyond 3rd party apps, there is significant concern that the API changes are a move by the platform to increase monetization, degrade the user experience, and eventually kill off other custom features such as the old.reddit.com interface, the Reddit Enhancement Suite browser extension, and more. Additionally, there are concerns that the API changes will impede the ability of subreddit moderators (who are all unpaid users) to access tools to keep their communities on-topic and free of spam. This dataset includes the "stickied" posts that appeared on 5,351 subreddits on June 11, 2023 and June 12, 2023 - including many subreddits announcing their plans to pa..., The list of subreddits was created from the ist of participating subreddits that had been collated in the /r/ModCoord subreddit. An initial Python script looks at three reddit posts and grabs the list of participating subreddits:

https://www.reddit.com/r/ModCoord/comments/1401qw5/incomplete_and_growing_list_of_participating/ https://www.reddit.com/r/ModCoord/comments/143fzf6/incomplete_and_growing_list_of_participating/ https://www.reddit.com/r/ModCoord/comments/146ffpb/incomplete_and_growing_list_of_participating/

It uses the requests library to get the HTTP response body. Then it uses re to search for links that look like r/iphone, e.g. what the list looks like in the post. Next it's just a bit of string cleanup and then writing to an output file. This script does not use the Reddit API at all. It's just basic HTTP requests. A second Python script then reads that list and uses the Reddit API to request information about current posts in each subr..., , # Reddit Blackout Announcements - 2023 API Protest

Reddit Blackout Announcements - 2023 API Protest

This dataset includes the list of scraped subreddits, a single CSV file for each subreddit, and a copy of the Python scripts used to scrape the data.

Description of the data and file structure

The dataset is uploaded as a single .zip file. Once it is downloaded and decompressed, it will include several files and directories. Here is how they are organized . â””â”€â”€ subreddit-list.txt â””â”€â”€ CSVs â””â”€â”€ [subreddit-name].csv â””â”€â”€ [...] â””â”€â”€ code â””â”€â”€ [...] â””â”€â”€ parsed TXTs â””â”€â”€ API.txt â””â”€â”€ blackout.txt â””â”€â”€ community.txt â””â”€â”€ mod-team.txt â””â”€â”€ moderator.txt â””â”€â”€ platform.txt â””â”€â”€ protest.txt

Subreddit List

The subreddit-list.txt file contains a list of 5,351 subreddit names. Each appears on its own line. This list was generated using the list-subreddits.py script, as described below.

Stickied Posts - CSVs

The "CSVs" directory contains 5,351 CSV (Comma Separated Value) files, each named ...
d
Data from: Reddit Dataset on Meme Stock: GameStop
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Han, Jing (2023). Reddit Dataset on Meme Stock: GameStop [Dataset]. http://doi.org/10.7910/DVN/TUMIPC
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/TUMIPC
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Han, Jing
Description
This dataset collects one-year Reddit posts, post metadata, post title sentiments, and post comments threads from subreddit: r/GME, r/superstonk, r/DDintoGME, and r/GMEJungle.
H
Reddit May 2019 Submissions
dataverse.harvard.edu
search.dataone.org
Updated Jul 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Baumgartner (2019). Reddit May 2019 Submissions [Dataset]. http://doi.org/10.7910/DVN/JVI8CT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/JVI8CT
Dataset updated
Jul 7, 2019
Dataset provided by
Harvard Dataverse
Authors
Jason Baumgartner
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dataset Metrics Total size of data uncompressed: 59,515,177,346 bytes Number of objects (submissions): 19,456,493 Reddit API Documentation: https://www.reddit.com/dev/api/ Overview This dataset contains all available submissions from Reddit during the month of May, 2019 (using UTC time boundaries). The data has been split to accommodate the file upload limitations for dataverse. Each file is a collection of json objects (ndjson). Each file was then compressed using zstandard compression (https://facebook.github.io/zstd). The files should be ordered by the id of the submission (represented by the id field). The time that each object was ingested is recorded in the retrieved_on field (in epoch seconds). Methodology Monthly Reddit ingests are usually started around a week into a new month for the previous month (but could be delayed). This gives submission scores, gildings and num_comments time to "settle" close to their eventual score before Reddit archives the posts (usually done after six months from the post's creation). All submissions are ingested via Reddit's API (using the /api/info endpoint). This is a "best effort" attempt to get all available data at the time of ingest. Due to the nature of Reddit, subreddits can go from private to public at any time, so it's possible more submissions could be found by rescanning missing ids. The author of this dataset highly encourages any researchers to do a sanity check on the data and to rescan for missing ids to ensure all available data has been gathered. If you need assistance, you can contact me directly. All efforts were made to capture as much data as possible. Generally, > 95% of all ids are captured. Missing data could be the result of Reddit API errors, submissions that were private during the ingest but then became public and subreddits that were quarantined and were not added to the whitelist before ingesting the data. When collecting the data, two scans are done. The first scan of ids using the /api/info endpoint collects all available data. After the first scan, a second scan is done requesting only missing ids from the first scan. This helps to keep the data as complete and comprehensive as possible. Contact If you have any questions about the data or require more details on the methodology, you are welcome to contact the author.
o
Data from: The Reddit Politosphere: A Large-Scale Text and Network Resource...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentin Hofmann; Hinrich Sch��tze; Janet B. Pierrehumbert (2022). The Reddit Politosphere: A Large-Scale Text and Network Resource of Online Political Discourse [Dataset]. http://doi.org/10.5281/zenodo.5851729
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5851729
Dataset updated
Jan 14, 2022
Authors
Valentin Hofmann; Hinrich Sch��tze; Janet B. Pierrehumbert
Description
The Reddit Politosphere is a large-scale resource of online political discourse covering more than 600 political discussion groups over a period of 12 years. Based on the Pushshift Reddit Dataset, it is to the best of our knowledge the largest and ideologically most comprehensive dataset of its type now available. One key feature of the Reddit Politosphere is that it consists of both text and network data. We also release annotated metadata for subreddits and users. Documentation and scripts for easy data access are provided in an associated repository on GitHub.
Reddit Scraper
kaggle.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eliyah Afzal (2021). Reddit Scraper [Dataset]. https://www.kaggle.com/afzale/reddit-scraper/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Eliyah Afzal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A tool created for scraping data on Reddit.com allowing the user to choose sub-reddits to crawl through and keywords to search for. Data is acquired by accessing the Reddit API utilizing python and the python reddit api wrapper, or PRAW. The data is stored as a collection of JSON files. This is a reusable tool that can be used for various data collections, I have included an example of its output.
Total global visitor traffic to Reddit.com 2024
statista.com
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Total global visitor traffic to Reddit.com 2024 [Dataset]. https://www.statista.com/statistics/443332/reddit-monthly-visitors/
Explore at:
Dataset updated
Nov 11, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Oct 2023 - Mar 2024
Area covered
Worldwide
Description
Reddit is a web traffic powerhouse: in March 2024 approximately 2.2 billion visits were measured to the online forum, making it one of the most-visited websites online. The front page of the internet Formerly known as “the front page of the internet”, Reddit is an online forum platform with over 130,000 sub-forums and communities. The platform allows registered users, called Redditors, to post content. Each post is open to the entire Reddit community to vote upon, either by down- or upvotes. The most popular posts are featured directly on the front page. Subreddits are available by category and Redditors can follow selected subreddits relevant to their interest and also control what content they see on their custom front page. Some of the most popular subreddits are r/AskReddit or r/AMA – the “Ask Me Anything” format. According to the company, Reddit hosted 1,800 AMAs in 2018, with a wide range of topics and hosts. One of the most popular Reddit AMA of 2022 by number of upvotes was by actor Nicolas Cagem with more than 238.5 thousand upvotes. Reddit usage The United States account for the biggest share of Reddit's desktop traffic, followed by the UK, and Canada. As of March 2023, Reddit ranked among the most popular social media websites in the United States.
P
Pushshift Reddit Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Baumgartner; Savvas Zannettou; Brian Keegan; Megan Squire; Jeremy Blackburn, Pushshift Reddit Dataset [Dataset]. https://paperswithcode.com/dataset/pushshift-reddit
Explore at:
Authors
Jason Baumgartner; Savvas Zannettou; Brian Keegan; Megan Squire; Jeremy Blackburn
Description
Pushshift makes available all the submissions and comments posted on Reddit between June 2005 and April 2019. The dataset consists of 651,778,198 submissions and 5,601,331,385 comments posted on 2,888,885 subreddits.

WikiReddit: Tracing Information and Attention Flows Between Online Platforms...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

Z
Cross-platform mentions of the QAnon conspiracy theory
data.niaid.nih.gov
Updated Jun 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sal Hagen (2021). Cross-platform mentions of the QAnon conspiracy theory [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3758478
Explore at:
Dataset updated
Jun 22, 2021
Dataset provided by
Daniël de Zeeuw
Emilija Jokubauskaitė
Stijn Peeters
Sal Hagen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains mentions of the QAnon conspiracy theory across the Web between 28 October 2017 and 1 November 2018. The following list details the data per platform and its collection process:

4chan: Posts and comments on 4chan/pol/ mentioning "Q" or "QAnon". The data is collected through 4CAT, a data capturing and analysis tool that hosts all posts and comments made on 4chan/pol/ since 2014.

8chan: Posts and comments on the /qresearch/ board and other smaller boards mentioning "Q" or "QAnon". The data is derived qanon.news, a grassroots archive. Considering its amateur nature, the dataset is likely not 100% complete, but still includes over 200,000 posts.

Reddit: Comments made on politically-oriented subreddits mentioning "Q" or "QAnon". The data is gathered through the Pushshift API.

YouTube: Videos mentioning QAnon or "Q" in the title or video decription. The data is collected via the YouTube v3 API using the search endpoint. Multiple keywords were queried ("qanon", "qanon 4chan", etc) to collect a large sample. False positives were then filtered out manually.

Breitbart: Disqus comments on Breitbart.com mentioning QAnon or "Q". The data was gathered by crawling all of Breitbart.com in the timeframe and using the Disqus API.

Online news media: Articles from English online news sources mentioning QAnon. The data is derived from Nexis Uni and ContextualWeb Search by searching for "QAnon". Irrelevant sources and false positives were filtered manually.

The datasets include timestamps, text bodies, and platform-specific information like subreddits and channel titles. To collect data from 4chan, 8chan, Reddit, and Breitbart, we used the same SQL query, sampled 200 comments, and edited the query to so it would have sufficient number of true positives (> 94%). The YouTube and online news media datasets are filtered manually.

For Breitbart and Reddit, the data is anonymised by omitting author information. The online news media article text is omitted because of copyright concerns.

See the article on First Monday for the full collection process.
d
Replication Data for: Engagement with fact-checked posts on Reddit
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bond, Robert; Garrett, R. Kelly (2023). Replication Data for: Engagement with fact-checked posts on Reddit [Dataset]. http://doi.org/10.7910/DVN/E4FHIZ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/E4FHIZ
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Bond, Robert; Garrett, R. Kelly
Description
Data and code to replicate analyses.. Visit https://dataone.org/datasets/sha256%3Ab1d67e16fe2eec9a88d92b4fc0e8149182c36d5062b5ec384c744c8292e426f8 for complete metadata about this dataset.
o
Reddit and StackOverflow dataset (Programming languages)
explore.openaire.eu
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniele De Vinco; Alessia Antelmi (2023). Reddit and StackOverflow dataset (Programming languages) [Dataset]. http://doi.org/10.5281/zenodo.7685061
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7685061
Dataset updated
Jan 1, 2023
Authors
Daniele De Vinco; Alessia Antelmi
Description
This data set contains anonymized data collected from Reddit (via the Pushshift API) and StackOverflow (from Kaggle's dataset). Each folder includes the data split by trimester. The schema of StackOverflow and Reddit-related files follows: Fields from StackOverflow question_id answer_id creation_date - answer creation_date score - score of the question/answer tags - all tags flagged for a question answer_count - number of answers for a question start_question - question's time of creation last_activity_date - last update on the question new_id - hashed id of the answerer q_new_id - hashed id of the questioner Fields from Reddit comment_id submission_id score - score of the question/submission subreddit created_utc - time of creation (unrelated to last modified comments) new_id - hashed id The .txt files represent the structure of the corresponding hypergraphs.
d
Replication Data for: Place Your Bets? The Value of Investment Research on...
search.dataone.org
dataverse.harvard.edu
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jame, Russell; Dan Bradley; Jan Hanousek Jr; Zicheng Xiao (2023). Replication Data for: Place Your Bets? The Value of Investment Research on Reddit's Wallstreetbets [Dataset]. http://doi.org/10.7910/DVN/X9ZFV9
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/X9ZFV9
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Jame, Russell; Dan Bradley; Jan Hanousek Jr; Zicheng Xiao
Description
This code replicates the main tables and figures in "Place Your Bets? The Value of Investment Research on Reddit's Wallstreetbets" using psuedo-data.
Reddit: quarterly number of DAU 2021-2025, by online status
statista.com
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Reddit: quarterly number of DAU 2021-2025, by online status [Dataset]. https://www.statista.com/statistics/1453133/reddit-quarterly-dau-by-online-status/
Explore at:
Dataset updated
May 27, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
During the first quarter of 2025, online forum and news aggregator Reddit saw approximately 108.1 million daily active users (DAU) engaging with its platform. Of these, over 59.4 million users were not logged in and accessed the platform's content without proving they registered to Reddit. This represents an increase of approximately 6.8 percent compared to the previous quarter, when Reddit saw 55.6 million logged-off DAU.
Reddit Scholar Posts (Scraped from April - August 2014)
figshare.com
html
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Gardner (2023). Reddit Scholar Posts (Scraped from April - August 2014) [Dataset]. http://doi.org/10.6084/m9.figshare.5771364.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5771364.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Gabriel Gardner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One file containing posts to Reddit's r/Scholar forum. These were scraped from April 2014 through August 2014. The following variables are available for each post:*DayOfWeek*Date *Week *Title *User*Description*Link *FirstLinkProvided *FirstLinkProvidedDomain*UserSuppliedISBN*UserSuppliedPMID*UserSuppliedDOIPreparation: (1) Scraping: Scraping took place via Google Sheets and Google Scripts. Google Sheets code used to scrape Reddit is available here: https://ctrlq.org/code/19600-reddit-scraper-script?_ga=2.69874256.1771062494.1515537454-939238282.1515537453(2) Cleaning: After scraping, data was cleaned and enriched in Excel. Extraneous HTML tags were removed and Excel's Text to Columns command was used to separate the posts into more easily analyzable chunks. Excel's IF, SEARCH, and FIND functions were used to extract the FirstLinkProvided, UserSupliedISBN, UserSuppliedPMID, and UserSuppliedDOI information. OpenRefine (http://openrefine.org) was used to turn the FirstLinkProvided values into the FirstLinkProviedDomain values.Caveats: *Some posts requested more than one item. If two or more URLs were supplied in a post, only the first was captured by the URL extraction method used in Excel. *To allow for verification of the scrape's accuracy, a link back to each original post on Reddit is included. (These links also allow for researchers to see the comments on the original post, which may have been made after the scrape took place and are not included in this file.) Links to original posts reveal user information. Because identification of users would be a trivial matter for someone who followed links to the original posts, no attempts were made to anonymize user IDs in this file.
Z
Smart Contracts Posts and Topic on Reddit
data.niaid.nih.gov
zenodo.org
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibba, Giacomo Francesco (2023). Smart Contracts Posts and Topic on Reddit [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10222218
Explore at:
Dataset updated
Nov 29, 2023
Dataset authored and provided by
Ibba, Giacomo Francesco
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A dataset including two CSVs collecting Reddit posts including respectively:- Permalink: The post's link- Title: The post's title- author: The author's name.- authorUrl: The URL to the author's profile- commentCount: The number of comments related to the post- id: The id of the post-createdDate: The creation date of the reddit post- query: Contains the URL of the board- category: The category related to the subreddit post- score: post score- awardCount: number of awards- silverCount: number of silver gildings- goldCount: number of gold gildings- platinumCount: number of platinum gildings- upvoteRatio: the upvote ratio
d
Replication Data for: The Spread of Political Misinformation on Online...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Burton, Anthony (2023). Replication Data for: The Spread of Political Misinformation on Online Subcultural Platforms [Dataset]. http://doi.org/10.7910/DVN/ZDN6BN
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ZDN6BN
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Burton, Anthony
Description
Hostnames and YouTube videos within collected posts on 4chan /pol/ and political subreddits
o
Reddit /r/Bitcoin Data for Jun 2022
opendatabay.com
.undefined
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Reddit /r/Bitcoin Data for Jun 2022 [Dataset]. https://www.opendatabay.com/data/ai-ml/5ae5899f-7e6a-4519-8b82-8cf6463b478b
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 22, 2025
Dataset authored and provided by
Datasimple
Area covered
Finance & Banking Analytics
Description
Context As anyone who's been keeping track for the last ten years can tell you, the world of cryptocurrency moves fast. Its movements are all too often supported or hindered by viral fads - be it posts on Reddit, Twitter takes, or something else entirely. We have compiled a month of the most famous cryptocurrency subreddit, /r/Bitcoin, into two convenient CSV files, creating a large cryptocurrency dataset for use both enterprise and academic.

For a larger version, please see our Reddit /r/Bitcoin dataset.

Content This dataset contains a comprehensive collection of posts and comments mentioning AAPL in their title and body text respectively. The data is procured using SocialGrep.

To preserve users' anonymity and to prevent targeted harassment, the data does not include usernames.

Acknowledgements This dataset was created using SocialGrep Exports. If social data analysis is your thing, we also have a good Reddit search tool.

We would also like to thank André François McKenzie for providing us with the background image for this dataset.

Inspiration Cryptocurrency is still a new topic in everyone's minds. It fluctuates wildly as time goes on - can we predict any future trends from seeing the public opinion shift?

License

CC-BY

Original Data Source: Reddit /r/Bitcoin Data for Jun 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex

Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation

Explore at:

.json, .csvAvailable download formats

Dataset authored and provided by

Dataplex

Area covered

Mexico, Holy See, Christmas Island, Jersey, Chile, Gambia, Macao, Botswana, Martinique, Côte d'Ivoire

Description

The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.
User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.
Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.
AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.
Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.
Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.
Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.
Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.
Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...

Clear search

Close search

Google apps

Main menu

Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

Gendered terms in 100 most popular subreddits

Cross-mentions between 4chan and Reddit

Reddit blackout announcements: 2023 API protest

Reddit Blackout Announcements - 2023 API Protest

Description of the data and file structure

Subreddit List

Stickied Posts - CSVs

Data from: Reddit Dataset on Meme Stock: GameStop

Reddit May 2019 Submissions

Data from: The Reddit Politosphere: A Large-Scale Text and Network Resource...

Reddit Scraper

Total global visitor traffic to Reddit.com 2024

Pushshift Reddit Dataset

WikiReddit: Tracing Information and Attention Flows Between Online Platforms...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Cross-platform mentions of the QAnon conspiracy theory

Replication Data for: Engagement with fact-checked posts on Reddit

Reddit and StackOverflow dataset (Programming languages)

Replication Data for: Place Your Bets? The Value of Investment Research on...

Reddit: quarterly number of DAU 2021-2025, by online status

Reddit Scholar Posts (Scraped from April - August 2014)

Smart Contracts Posts and Topic on Reddit

Replication Data for: The Spread of Political Misinformation on Online...

Reddit /r/Bitcoin Data for Jun 2022