The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gendered terms in 100 most popular subreddits, as judged by a chi-squared test. Spreadsheets are organised by subjectively determined subreddit theme.This is data associated with the paper: She’s Reddit: A new source of Gendered interest information? with Emma Stuart.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The included datasets are at the basis of the article "No space for Reddit spacing: Mapping the reflexive relationship between groups on 4chan and Reddit", published in Social Media + Society. They include cross-mentions between 4chan and Reddit, as well as various metrics associated to these cross-references. The timeframe ranges from the earliest date available (for Reddit: June 2006; for 4chan/b/: April 2006; for 4chan/pol/: December 2013) and ends in January 2023 (except for the 4chan/b/ dataset, which ends in December 2008). The datasets specifically entail the following: 1. Cross-mentions from Reddit to 4chan reddit-mentions-to-4chan.csv I used the Pushshift API's search endpoint to fetch Reddit comments (so no opening posts) with the keyword "4chan" (note: this Pushshift functionality is now deprecated). I also used a rudimentary filter to remove posts by bots, specifically by 1) deleting posts from every account that had "bot" or "auto" in the username and 2) removing all posts by authors with 100 or more contributions and which I manually identified as automated accounts. I removed URL-only cross-references, i.e. posts that only mentioned "://boards.4chan.org" or "://boards.4channel.org" without another 4chan-reference/ This resulted in 2,638,621 "4chan" references across Reddit. 2. Cross-mentions from 4chan/pol/ to Reddit 4chan-pol_mentions-of-reddit.csv With a complete dataset of /pol/ collected through 4CAT, I queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor"). I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/ This resulted in 1,640,273 "Reddit" references on /pol/. 3. Cross-mentions from 4chan/b/ to Reddit 4chan-b_mentions-to-reddit.csv I extracted five million posts from Jason Scott's 4chan/b/ dump. I then queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor"). I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/ This resulted in 1,287 "Reddit" references on /b/. See Hagen (2020) for more information on the 4chan/b/ dataset. 4. Cross-mention metrics cross-mention-metrics.xlsx I extracted the following metrics from the datasets above: 4.1 The total number of cross-mentions, absolute and relative, per monthThis simply used the monthly counts from datasets 1 and 2. 4.2 The most mentioned subreddits on /pol/, per yearUsing the regular expression: r\/[a-zA-Z_] 4.3 Subreddits that mention 4chan most often, per year 4.4 4chan boards mentioned across Reddit, per month 4.5 4chan boards mentioned by subreddits I counted every subreddit- or board-mention per post instead of total occurrences. For 4.4 and 4.5, I used the following regular expression to extract 4chan board names: (\s|^|4chan)\/(a|b|c|d|e|f|g|gif|h|hr|k|m|o|p|t|v|vg|vm|vmg|vr|vrpg|vst|w|wg|i|ic|r9k|s4s|vip|qa|cm|hm|lgbt|y|3|aco|adv|an|bant|biz|cgl|ck|co|diy|fa|fit|gd|hc|his|int|jp|lit|mlp|mu|n|news|out|po|pol|pw|qst|sci|soc|sp|tg|toy|trv|tv|vp|vt|wsg|wsr|x|xs|new)\/(\s|$) I also omitted 4chan's /r/, /u/, and /s/ boards; despite their small scale, they appeared as false positives due to their unrelated vernacular meaning on Reddit (e.g. /u/ as a username prefix). 4.5 was also transformed and included as a Gephi network file (subreddit-board-mentions.gephi). Lastly, I also included: 4.6 The total amount of posts on 4chan and Reddit This was used to calculate 4.1. It uses Pushshift's database statistics (which as of Nov. 2023 requires a login; see this Pastebin for an alternative) and metrics of total 4chan post counts from 4stats.io. Each of these metrics has their own corresponding tab in the Excel file. 5. Co-words of "4chan" and "reddit" in the cross-mentions co-words.xslx Using datasets 1, 2, and 3, I extracted the top ten words appearing directly next to "4chan" on Reddit, and next to "Reddit" on 4chan, per year. I first pre-processed the text, which involved tokenisation, filtering of unwanted text elements like URLs, stop word removal (I whitelisted back), and lemmatisation. For the co-word extraction I used a window size of two. I excluded a range of semantically uninteresting words or commonly used hate speech terms prevalent throughout 4chan. 6. Annotated cross-mentions between Reddit and 4chan/pol/ in September 2014 annotations_4chanpol-2014.csvannotations_reddit-2014-kotakuinaction-anonimised.csvannotations_reddit-2014-tumblrinaction-anonimised.csv I extracted cross-mentions from /pol/ to Reddit and from Reddit to 4chan in September 2014 for close-reading and annotation. _ The author names are removed for all datasets.
Starting June 12, 2023, many Reddit communities (subreddits) began a protest where they "went dark" - by changing to private mode - as a protest in response to Reddit's plans to change its API access policies and fee structure. Supporters of the protest criticize the planned changes for being prohibitively expensive for 3rd party apps. Beyond 3rd party apps, there is significant concern that the API changes are a move by the platform to increase monetization, degrade the user experience, and eventually kill off other custom features such as the old.reddit.com interface, the Reddit Enhancement Suite browser extension, and more. Additionally, there are concerns that the API changes will impede the ability of subreddit moderators (who are all unpaid users) to access tools to keep their communities on-topic and free of spam. This dataset includes the "stickied" posts that appeared on 5,351 subreddits on June 11, 2023 and June 12, 2023 - including many subreddits announcing their plans to pa..., The list of subreddits was created from the ist of participating subreddits that had been collated in the /r/ModCoord subreddit. An initial Python script looks at three reddit posts and grabs the list of participating subreddits:
https://www.reddit.com/r/ModCoord/comments/1401qw5/incomplete_and_growing_list_of_participating/ https://www.reddit.com/r/ModCoord/comments/143fzf6/incomplete_and_growing_list_of_participating/ https://www.reddit.com/r/ModCoord/comments/146ffpb/incomplete_and_growing_list_of_participating/
It uses the requests library to get the HTTP response body. Then it uses re to search for links that look like r/iphone, e.g. what the list looks like in the post. Next it's just a bit of string cleanup and then writing to an output file. This script does not use the Reddit API at all. It's just basic HTTP requests. A second Python script then reads that list and uses the Reddit API to request information about current posts in each subr..., , # Reddit Blackout Announcements - 2023 API Protest
This dataset includes the list of scraped subreddits, a single CSV file for each subreddit, and a copy of the Python scripts used to scrape the data.
The dataset is uploaded as a single .zip file. Once it is downloaded and decompressed, it will include several files and directories. Here is how they are organized . └── subreddit-list.txt └── CSVs └── [subreddit-name].csv └── [...] └── code └── [...] └── parsed TXTs └── API.txt └── blackout.txt └── community.txt └── mod-team.txt └── moderator.txt └── platform.txt └── protest.txt
The subreddit-list.txt file contains a list of 5,351 subreddit names. Each appears on its own line. This list was generated using the list-subreddits.py script, as described below.
The "CSVs" directory contains 5,351 CSV (Comma Separated Value) files, each named ...
This dataset collects one-year Reddit posts, post metadata, post title sentiments, and post comments threads from subreddit: r/GME, r/superstonk, r/DDintoGME, and r/GMEJungle.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset Metrics Total size of data uncompressed: 59,515,177,346 bytes Number of objects (submissions): 19,456,493 Reddit API Documentation: https://www.reddit.com/dev/api/ Overview This dataset contains all available submissions from Reddit during the month of May, 2019 (using UTC time boundaries). The data has been split to accommodate the file upload limitations for dataverse. Each file is a collection of json objects (ndjson). Each file was then compressed using zstandard compression (https://facebook.github.io/zstd). The files should be ordered by the id of the submission (represented by the id field). The time that each object was ingested is recorded in the retrieved_on field (in epoch seconds). Methodology Monthly Reddit ingests are usually started around a week into a new month for the previous month (but could be delayed). This gives submission scores, gildings and num_comments time to "settle" close to their eventual score before Reddit archives the posts (usually done after six months from the post's creation). All submissions are ingested via Reddit's API (using the /api/info endpoint). This is a "best effort" attempt to get all available data at the time of ingest. Due to the nature of Reddit, subreddits can go from private to public at any time, so it's possible more submissions could be found by rescanning missing ids. The author of this dataset highly encourages any researchers to do a sanity check on the data and to rescan for missing ids to ensure all available data has been gathered. If you need assistance, you can contact me directly. All efforts were made to capture as much data as possible. Generally, > 95% of all ids are captured. Missing data could be the result of Reddit API errors, submissions that were private during the ingest but then became public and subreddits that were quarantined and were not added to the whitelist before ingesting the data. When collecting the data, two scans are done. The first scan of ids using the /api/info endpoint collects all available data. After the first scan, a second scan is done requesting only missing ids from the first scan. This helps to keep the data as complete and comprehensive as possible. Contact If you have any questions about the data or require more details on the methodology, you are welcome to contact the author.
The Reddit Politosphere is a large-scale resource of online political discourse covering more than 600 political discussion groups over a period of 12 years. Based on the Pushshift Reddit Dataset, it is to the best of our knowledge the largest and ideologically most comprehensive dataset of its type now available. One key feature of the Reddit Politosphere is that it consists of both text and network data. We also release annotated metadata for subreddits and users. Documentation and scripts for easy data access are provided in an associated repository on GitHub.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A tool created for scraping data on Reddit.com allowing the user to choose sub-reddits to crawl through and keywords to search for. Data is acquired by accessing the Reddit API utilizing python and the python reddit api wrapper, or PRAW. The data is stored as a collection of JSON files. This is a reusable tool that can be used for various data collections, I have included an example of its output.
Reddit is a web traffic powerhouse: in March 2024 approximately 2.2 billion visits were measured to the online forum, making it one of the most-visited websites online. The front page of the internet Formerly known as “the front page of the internet”, Reddit is an online forum platform with over 130,000 sub-forums and communities. The platform allows registered users, called Redditors, to post content. Each post is open to the entire Reddit community to vote upon, either by down- or upvotes. The most popular posts are featured directly on the front page. Subreddits are available by category and Redditors can follow selected subreddits relevant to their interest and also control what content they see on their custom front page. Some of the most popular subreddits are r/AskReddit or r/AMA – the “Ask Me Anything” format. According to the company, Reddit hosted 1,800 AMAs in 2018, with a wide range of topics and hosts. One of the most popular Reddit AMA of 2022 by number of upvotes was by actor Nicolas Cagem with more than 238.5 thousand upvotes. Reddit usage The United States account for the biggest share of Reddit's desktop traffic, followed by the UK, and Canada. As of March 2023, Reddit ranked among the most popular social media websites in the United States.
Pushshift makes available all the submissions and comments posted on Reddit between June 2005 and April 2019. The dataset consists of 651,778,198 submissions and 5,601,331,385 comments posted on 2,888,885 subreddits.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks
Column Name | Type | Description |
---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks
Column Name | Type | Description |
---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains mentions of the QAnon conspiracy theory across the Web between 28 October 2017 and 1 November 2018. The following list details the data per platform and its collection process:
4chan: Posts and comments on 4chan/pol/ mentioning "Q" or "QAnon". The data is collected through 4CAT, a data capturing and analysis tool that hosts all posts and comments made on 4chan/pol/ since 2014.
8chan: Posts and comments on the /qresearch/ board and other smaller boards mentioning "Q" or "QAnon". The data is derived qanon.news, a grassroots archive. Considering its amateur nature, the dataset is likely not 100% complete, but still includes over 200,000 posts.
Reddit: Comments made on politically-oriented subreddits mentioning "Q" or "QAnon". The data is gathered through the Pushshift API.
YouTube: Videos mentioning QAnon or "Q" in the title or video decription. The data is collected via the YouTube v3 API using the search endpoint. Multiple keywords were queried ("qanon", "qanon 4chan", etc) to collect a large sample. False positives were then filtered out manually.
Breitbart: Disqus comments on Breitbart.com mentioning QAnon or "Q". The data was gathered by crawling all of Breitbart.com in the timeframe and using the Disqus API.
Online news media: Articles from English online news sources mentioning QAnon. The data is derived from Nexis Uni and ContextualWeb Search by searching for "QAnon". Irrelevant sources and false positives were filtered manually.
The datasets include timestamps, text bodies, and platform-specific information like subreddits and channel titles. To collect data from 4chan, 8chan, Reddit, and Breitbart, we used the same SQL query, sampled 200 comments, and edited the query to so it would have sufficient number of true positives (> 94%). The YouTube and online news media datasets are filtered manually.
For Breitbart and Reddit, the data is anonymised by omitting author information. The online news media article text is omitted because of copyright concerns.
See the article on First Monday for the full collection process.
Data and code to replicate analyses.. Visit https://dataone.org/datasets/sha256%3Ab1d67e16fe2eec9a88d92b4fc0e8149182c36d5062b5ec384c744c8292e426f8 for complete metadata about this dataset.
This data set contains anonymized data collected from Reddit (via the Pushshift API) and StackOverflow (from Kaggle's dataset). Each folder includes the data split by trimester. The schema of StackOverflow and Reddit-related files follows: Fields from StackOverflow question_id answer_id creation_date - answer creation_date score - score of the question/answer tags - all tags flagged for a question answer_count - number of answers for a question start_question - question's time of creation last_activity_date - last update on the question new_id - hashed id of the answerer q_new_id - hashed id of the questioner Fields from Reddit comment_id submission_id score - score of the question/submission subreddit created_utc - time of creation (unrelated to last modified comments) new_id - hashed id The .txt files represent the structure of the corresponding hypergraphs.
This code replicates the main tables and figures in "Place Your Bets? The Value of Investment Research on Reddit's Wallstreetbets" using psuedo-data.
During the first quarter of 2025, online forum and news aggregator Reddit saw approximately 108.1 million daily active users (DAU) engaging with its platform. Of these, over 59.4 million users were not logged in and accessed the platform's content without proving they registered to Reddit. This represents an increase of approximately 6.8 percent compared to the previous quarter, when Reddit saw 55.6 million logged-off DAU.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One file containing posts to Reddit's r/Scholar forum. These were scraped from April 2014 through August 2014. The following variables are available for each post:*DayOfWeek*Date *Week *Title *User*Description*Link *FirstLinkProvided *FirstLinkProvidedDomain*UserSuppliedISBN*UserSuppliedPMID*UserSuppliedDOIPreparation: (1) Scraping: Scraping took place via Google Sheets and Google Scripts. Google Sheets code used to scrape Reddit is available here: https://ctrlq.org/code/19600-reddit-scraper-script?_ga=2.69874256.1771062494.1515537454-939238282.1515537453(2) Cleaning: After scraping, data was cleaned and enriched in Excel. Extraneous HTML tags were removed and Excel's Text to Columns command was used to separate the posts into more easily analyzable chunks. Excel's IF, SEARCH, and FIND functions were used to extract the FirstLinkProvided, UserSupliedISBN, UserSuppliedPMID, and UserSuppliedDOI information. OpenRefine (http://openrefine.org) was used to turn the FirstLinkProvided values into the FirstLinkProviedDomain values.Caveats: *Some posts requested more than one item. If two or more URLs were supplied in a post, only the first was captured by the URL extraction method used in Excel. *To allow for verification of the scrape's accuracy, a link back to each original post on Reddit is included. (These links also allow for researchers to see the comments on the original post, which may have been made after the scrape took place and are not included in this file.) Links to original posts reveal user information. Because identification of users would be a trivial matter for someone who followed links to the original posts, no attempts were made to anonymize user IDs in this file.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A dataset including two CSVs collecting Reddit posts including respectively:- Permalink: The post's link- Title: The post's title- author: The author's name.- authorUrl: The URL to the author's profile- commentCount: The number of comments related to the post- id: The id of the post-createdDate: The creation date of the reddit post- query: Contains the URL of the board- category: The category related to the subreddit post- score: post score- awardCount: number of awards- silverCount: number of silver gildings- goldCount: number of gold gildings- platinumCount: number of platinum gildings- upvoteRatio: the upvote ratio
Hostnames and YouTube videos within collected posts on 4chan /pol/ and political subreddits
Context As anyone who's been keeping track for the last ten years can tell you, the world of cryptocurrency moves fast. Its movements are all too often supported or hindered by viral fads - be it posts on Reddit, Twitter takes, or something else entirely. We have compiled a month of the most famous cryptocurrency subreddit, /r/Bitcoin, into two convenient CSV files, creating a large cryptocurrency dataset for use both enterprise and academic.
For a larger version, please see our Reddit /r/Bitcoin dataset.
Content This dataset contains a comprehensive collection of posts and comments mentioning AAPL in their title and body text respectively. The data is procured using SocialGrep.
To preserve users' anonymity and to prevent targeted harassment, the data does not include usernames.
Acknowledgements This dataset was created using SocialGrep Exports. If social data analysis is your thing, we also have a good Reddit search tool.
We would also like to thank André François McKenzie for providing us with the background image for this dataset.
Inspiration Cryptocurrency is still a new topic in everyone's minds. It fluctuates wildly as time goes on - can we predict any future trends from seeing the public opinion shift?
CC-BY
Original Data Source: Reddit /r/Bitcoin Data for Jun 2022
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...