44 datasets found
  1. d

    Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Dataplex
    Area covered
    Mexico, Holy See, Christmas Island, Jersey, Chile, Gambia, Macao, Botswana, Martinique, Côte d'Ivoire
    Description

    The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

    Dataset Overview:

    This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

    2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

    Sourced Directly from Reddit:

    All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

    Key Features:

    • Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.
    • User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.
    • Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.
    • AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

    Use Cases:

    • Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.
    • Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.
    • Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.
    • Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

    Data Quality and Reliability:

    The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

    Integration and Usability:

    The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

    User-Friendly Structure and Metadata:

    The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

    Ideal For:

    • Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.
    • Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.
    • Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

    This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...

  2. Gendered terms in 100 most popular subreddits

    • figshare.com
    zip
    Updated Jun 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Thelwall; Emma Stuart (2018). Gendered terms in 100 most popular subreddits [Dataset]. http://doi.org/10.6084/m9.figshare.6470930.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 10, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mike Thelwall; Emma Stuart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gendered terms in 100 most popular subreddits, as judged by a chi-squared test. Spreadsheets are organised by subjectively determined subreddit theme.This is data associated with the paper: She’s Reddit: A new source of Gendered interest information? with Emma Stuart.

  3. Z

    Cross-mentions between 4chan and Reddit

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hagen, Sal (2023). Cross-mentions between 4chan and Reddit [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10059084
    Explore at:
    Dataset updated
    Oct 31, 2023
    Dataset authored and provided by
    Hagen, Sal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The included datasets are at the basis of the article "No space for Reddit spacing: Mapping the reflexive relationship between groups on 4chan and Reddit", published in Social Media + Society. They include cross-mentions between 4chan and Reddit, as well as various metrics associated to these cross-references. The timeframe ranges from the earliest date available (for Reddit: June 2006; for 4chan/b/: April 2006; for 4chan/pol/: December 2013) and ends in January 2023 (except for the 4chan/b/ dataset, which ends in December 2008). The datasets specifically entail the following: 1. Cross-mentions from Reddit to 4chan reddit-mentions-to-4chan.csv I used the Pushshift API's search endpoint to fetch Reddit comments (so no opening posts) with the keyword "4chan" (note: this Pushshift functionality is now deprecated). I also used a rudimentary filter to remove posts by bots, specifically by 1) deleting posts from every account that had "bot" or "auto" in the username and 2) removing all posts by authors with 100 or more contributions and which I manually identified as automated accounts. I removed URL-only cross-references, i.e. posts that only mentioned "://boards.4chan.org" or "://boards.4channel.org" without another 4chan-reference/ This resulted in 2,638,621 "4chan" references across Reddit. 2. Cross-mentions from 4chan/pol/ to Reddit 4chan-pol_mentions-of-reddit.csv With a complete dataset of /pol/ collected through 4CAT, I queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor"). I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/ This resulted in 1,640,273 "Reddit" references on /pol/. 3. Cross-mentions from 4chan/b/ to Reddit 4chan-b_mentions-to-reddit.csv I extracted five million posts from Jason Scott's 4chan/b/ dump. I then queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor"). I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/ This resulted in 1,287 "Reddit" references on /b/. See Hagen (2020) for more information on the 4chan/b/ dataset. 4. Cross-mention metrics cross-mention-metrics.xlsx I extracted the following metrics from the datasets above: 4.1 The total number of cross-mentions, absolute and relative, per monthThis simply used the monthly counts from datasets 1 and 2. 4.2 The most mentioned subreddits on /pol/, per yearUsing the regular expression: r\/[a-zA-Z_] 4.3 Subreddits that mention 4chan most often, per year 4.4 4chan boards mentioned across Reddit, per month 4.5 4chan boards mentioned by subreddits I counted every subreddit- or board-mention per post instead of total occurrences. For 4.4 and 4.5, I used the following regular expression to extract 4chan board names: (\s|^|4chan)\/(a|b|c|d|e|f|g|gif|h|hr|k|m|o|p|t|v|vg|vm|vmg|vr|vrpg|vst|w|wg|i|ic|r9k|s4s|vip|qa|cm|hm|lgbt|y|3|aco|adv|an|bant|biz|cgl|ck|co|diy|fa|fit|gd|hc|his|int|jp|lit|mlp|mu|n|news|out|po|pol|pw|qst|sci|soc|sp|tg|toy|trv|tv|vp|vt|wsg|wsr|x|xs|new)\/(\s|$) I also omitted 4chan's /r/, /u/, and /s/ boards; despite their small scale, they appeared as false positives due to their unrelated vernacular meaning on Reddit (e.g. /u/ as a username prefix). 4.5 was also transformed and included as a Gephi network file (subreddit-board-mentions.gephi). Lastly, I also included: 4.6 The total amount of posts on 4chan and Reddit This was used to calculate 4.1. It uses Pushshift's database statistics (which as of Nov. 2023 requires a login; see this Pastebin for an alternative) and metrics of total 4chan post counts from 4stats.io. Each of these metrics has their own corresponding tab in the Excel file. 5. Co-words of "4chan" and "reddit" in the cross-mentions co-words.xslx Using datasets 1, 2, and 3, I extracted the top ten words appearing directly next to "4chan" on Reddit, and next to "Reddit" on 4chan, per year. I first pre-processed the text, which involved tokenisation, filtering of unwanted text elements like URLs, stop word removal (I whitelisted back), and lemmatisation. For the co-word extraction I used a window size of two. I excluded a range of semantically uninteresting words or commonly used hate speech terms prevalent throughout 4chan. 6. Annotated cross-mentions between Reddit and 4chan/pol/ in September 2014 annotations_4chanpol-2014.csvannotations_reddit-2014-kotakuinaction-anonimised.csvannotations_reddit-2014-tumblrinaction-anonimised.csv I extracted cross-mentions from /pol/ to Reddit and from Reddit to 4chan in September 2014 for close-reading and annotation. _ The author names are removed for all datasets.

  4. d

    Reddit blackout announcements: 2023 API protest

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Pettis (2024). Reddit blackout announcements: 2023 API protest [Dataset]. http://doi.org/10.5061/dryad.qfttdz0qd
    Explore at:
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Ben Pettis
    Time period covered
    Jan 1, 2024
    Description

    Starting June 12, 2023, many Reddit communities (subreddits) began a protest where they "went dark" - by changing to private mode - as a protest in response to Reddit's plans to change its API access policies and fee structure. Supporters of the protest criticize the planned changes for being prohibitively expensive for 3rd party apps. Beyond 3rd party apps, there is significant concern that the API changes are a move by the platform to increase monetization, degrade the user experience, and eventually kill off other custom features such as the old.reddit.com interface, the Reddit Enhancement Suite browser extension, and more. Additionally, there are concerns that the API changes will impede the ability of subreddit moderators (who are all unpaid users) to access tools to keep their communities on-topic and free of spam. This dataset includes the "stickied" posts that appeared on 5,351 subreddits on June 11, 2023 and June 12, 2023 - including many subreddits announcing their plans to pa..., The list of subreddits was created from the ist of participating subreddits that had been collated in the /r/ModCoord subreddit. An initial Python script looks at three reddit posts and grabs the list of participating subreddits:

    https://www.reddit.com/r/ModCoord/comments/1401qw5/incomplete_and_growing_list_of_participating/ https://www.reddit.com/r/ModCoord/comments/143fzf6/incomplete_and_growing_list_of_participating/ https://www.reddit.com/r/ModCoord/comments/146ffpb/incomplete_and_growing_list_of_participating/

    It uses the requests library to get the HTTP response body. Then it uses re to search for links that look like r/iphone, e.g. what the list looks like in the post. Next it's just a bit of string cleanup and then writing to an output file. This script does not use the Reddit API at all. It's just basic HTTP requests. A second Python script then reads that list and uses the Reddit API to request information about current posts in each subr..., , # Reddit Blackout Announcements - 2023 API Protest

    Reddit Blackout Announcements - 2023 API Protest

    This dataset includes the list of scraped subreddits, a single CSV file for each subreddit, and a copy of the Python scripts used to scrape the data.

    Description of the data and file structure

    The dataset is uploaded as a single .zip file. Once it is downloaded and decompressed, it will include several files and directories. Here is how they are organized . └── subreddit-list.txt └── CSVs └── [subreddit-name].csv └── [...] └── code └── [...] └── parsed TXTs └── API.txt └── blackout.txt └── community.txt └── mod-team.txt └── moderator.txt └── platform.txt └── protest.txt

    Subreddit List

    The subreddit-list.txt file contains a list of 5,351 subreddit names. Each appears on its own line. This list was generated using the list-subreddits.py script, as described below.

    Stickied Posts - CSVs

    The "CSVs" directory contains 5,351 CSV (Comma Separated Value) files, each named ...

  5. d

    Data from: Reddit Dataset on Meme Stock: GameStop

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Han, Jing (2023). Reddit Dataset on Meme Stock: GameStop [Dataset]. http://doi.org/10.7910/DVN/TUMIPC
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Han, Jing
    Description

    This dataset collects one-year Reddit posts, post metadata, post title sentiments, and post comments threads from subreddit: r/GME, r/superstonk, r/DDintoGME, and r/GMEJungle.

  6. H

    Reddit May 2019 Submissions

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jul 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Baumgartner (2019). Reddit May 2019 Submissions [Dataset]. http://doi.org/10.7910/DVN/JVI8CT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Jason Baumgartner
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset Metrics Total size of data uncompressed: 59,515,177,346 bytes Number of objects (submissions): 19,456,493 Reddit API Documentation: https://www.reddit.com/dev/api/ Overview This dataset contains all available submissions from Reddit during the month of May, 2019 (using UTC time boundaries). The data has been split to accommodate the file upload limitations for dataverse. Each file is a collection of json objects (ndjson). Each file was then compressed using zstandard compression (https://facebook.github.io/zstd). The files should be ordered by the id of the submission (represented by the id field). The time that each object was ingested is recorded in the retrieved_on field (in epoch seconds). Methodology Monthly Reddit ingests are usually started around a week into a new month for the previous month (but could be delayed). This gives submission scores, gildings and num_comments time to "settle" close to their eventual score before Reddit archives the posts (usually done after six months from the post's creation). All submissions are ingested via Reddit's API (using the /api/info endpoint). This is a "best effort" attempt to get all available data at the time of ingest. Due to the nature of Reddit, subreddits can go from private to public at any time, so it's possible more submissions could be found by rescanning missing ids. The author of this dataset highly encourages any researchers to do a sanity check on the data and to rescan for missing ids to ensure all available data has been gathered. If you need assistance, you can contact me directly. All efforts were made to capture as much data as possible. Generally, > 95% of all ids are captured. Missing data could be the result of Reddit API errors, submissions that were private during the ingest but then became public and subreddits that were quarantined and were not added to the whitelist before ingesting the data. When collecting the data, two scans are done. The first scan of ids using the /api/info endpoint collects all available data. After the first scan, a second scan is done requesting only missing ids from the first scan. This helps to keep the data as complete and comprehensive as possible. Contact If you have any questions about the data or require more details on the methodology, you are welcome to contact the author.

  7. o

    Data from: The Reddit Politosphere: A Large-Scale Text and Network Resource...

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Jan 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Hofmann; Hinrich Sch��tze; Janet B. Pierrehumbert (2022). The Reddit Politosphere: A Large-Scale Text and Network Resource of Online Political Discourse [Dataset]. http://doi.org/10.5281/zenodo.5851729
    Explore at:
    Dataset updated
    Jan 14, 2022
    Authors
    Valentin Hofmann; Hinrich Sch��tze; Janet B. Pierrehumbert
    Description

    The Reddit Politosphere is a large-scale resource of online political discourse covering more than 600 political discussion groups over a period of 12 years. Based on the Pushshift Reddit Dataset, it is to the best of our knowledge the largest and ideologically most comprehensive dataset of its type now available. One key feature of the Reddit Politosphere is that it consists of both text and network data. We also release annotated metadata for subreddits and users. Documentation and scripts for easy data access are provided in an associated repository on GitHub.

  8. Reddit Scraper

    • kaggle.com
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eliyah Afzal (2021). Reddit Scraper [Dataset]. https://www.kaggle.com/afzale/reddit-scraper/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Eliyah Afzal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A tool created for scraping data on Reddit.com allowing the user to choose sub-reddits to crawl through and keywords to search for. Data is acquired by accessing the Reddit API utilizing python and the python reddit api wrapper, or PRAW. The data is stored as a collection of JSON files. This is a reusable tool that can be used for various data collections, I have included an example of its output.

  9. Total global visitor traffic to Reddit.com 2024

    • statista.com
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Total global visitor traffic to Reddit.com 2024 [Dataset]. https://www.statista.com/statistics/443332/reddit-monthly-visitors/
    Explore at:
    Dataset updated
    Nov 11, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    Reddit is a web traffic powerhouse: in March 2024 approximately 2.2 billion visits were measured to the online forum, making it one of the most-visited websites online. The front page of the internet Formerly known as “the front page of the internet”, Reddit is an online forum platform with over 130,000 sub-forums and communities. The platform allows registered users, called Redditors, to post content. Each post is open to the entire Reddit community to vote upon, either by down- or upvotes. The most popular posts are featured directly on the front page. Subreddits are available by category and Redditors can follow selected subreddits relevant to their interest and also control what content they see on their custom front page. Some of the most popular subreddits are r/AskReddit or r/AMA – the “Ask Me Anything” format. According to the company, Reddit hosted 1,800 AMAs in 2018, with a wide range of topics and hosts. One of the most popular Reddit AMA of 2022 by number of upvotes was by actor Nicolas Cagem with more than 238.5 thousand upvotes. Reddit usage The United States account for the biggest share of Reddit's desktop traffic, followed by the UK, and Canada. As of March 2023, Reddit ranked among the most popular social media websites in the United States.

  10. P

    Pushshift Reddit Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Baumgartner; Savvas Zannettou; Brian Keegan; Megan Squire; Jeremy Blackburn, Pushshift Reddit Dataset [Dataset]. https://paperswithcode.com/dataset/pushshift-reddit
    Explore at:
    Authors
    Jason Baumgartner; Savvas Zannettou; Brian Keegan; Megan Squire; Jeremy Blackburn
    Description

    Pushshift makes available all the submissions and comments posted on Reddit between June 2005 and April 2019. The dataset consists of 651,778,198 submissions and 5,601,331,385 comments posted on 2,888,885 subreddits.

  11. WikiReddit: Tracing Information and Attention Flows Between Online Platforms...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  12. Z

    Cross-platform mentions of the QAnon conspiracy theory

    • data.niaid.nih.gov
    Updated Jun 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sal Hagen (2021). Cross-platform mentions of the QAnon conspiracy theory [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3758478
    Explore at:
    Dataset updated
    Jun 22, 2021
    Dataset provided by
    Daniël de Zeeuw
    Emilija Jokubauskaitė
    Stijn Peeters
    Sal Hagen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains mentions of the QAnon conspiracy theory across the Web between 28 October 2017 and 1 November 2018. The following list details the data per platform and its collection process:

    4chan: Posts and comments on 4chan/pol/ mentioning "Q" or "QAnon". The data is collected through 4CAT, a data capturing and analysis tool that hosts all posts and comments made on 4chan/pol/ since 2014.

    8chan: Posts and comments on the /qresearch/ board and other smaller boards mentioning "Q" or "QAnon". The data is derived qanon.news, a grassroots archive. Considering its amateur nature, the dataset is likely not 100% complete, but still includes over 200,000 posts.

    Reddit: Comments made on politically-oriented subreddits mentioning "Q" or "QAnon". The data is gathered through the Pushshift API.

    YouTube: Videos mentioning QAnon or "Q" in the title or video decription. The data is collected via the YouTube v3 API using the search endpoint. Multiple keywords were queried ("qanon", "qanon 4chan", etc) to collect a large sample. False positives were then filtered out manually.

    Breitbart: Disqus comments on Breitbart.com mentioning QAnon or "Q". The data was gathered by crawling all of Breitbart.com in the timeframe and using the Disqus API.

    Online news media: Articles from English online news sources mentioning QAnon. The data is derived from Nexis Uni and ContextualWeb Search by searching for "QAnon". Irrelevant sources and false positives were filtered manually.

    The datasets include timestamps, text bodies, and platform-specific information like subreddits and channel titles. To collect data from 4chan, 8chan, Reddit, and Breitbart, we used the same SQL query, sampled 200 comments, and edited the query to so it would have sufficient number of true positives (> 94%). The YouTube and online news media datasets are filtered manually.

    For Breitbart and Reddit, the data is anonymised by omitting author information. The online news media article text is omitted because of copyright concerns.

    See the article on First Monday for the full collection process.

  13. d

    Replication Data for: Engagement with fact-checked posts on Reddit

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bond, Robert; Garrett, R. Kelly (2023). Replication Data for: Engagement with fact-checked posts on Reddit [Dataset]. http://doi.org/10.7910/DVN/E4FHIZ
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Bond, Robert; Garrett, R. Kelly
    Description

    Data and code to replicate analyses.. Visit https://dataone.org/datasets/sha256%3Ab1d67e16fe2eec9a88d92b4fc0e8149182c36d5062b5ec384c744c8292e426f8 for complete metadata about this dataset.

  14. o

    Reddit and StackOverflow dataset (Programming languages)

    • explore.openaire.eu
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele De Vinco; Alessia Antelmi (2023). Reddit and StackOverflow dataset (Programming languages) [Dataset]. http://doi.org/10.5281/zenodo.7685061
    Explore at:
    Dataset updated
    Jan 1, 2023
    Authors
    Daniele De Vinco; Alessia Antelmi
    Description

    This data set contains anonymized data collected from Reddit (via the Pushshift API) and StackOverflow (from Kaggle's dataset). Each folder includes the data split by trimester. The schema of StackOverflow and Reddit-related files follows: Fields from StackOverflow question_id answer_id creation_date - answer creation_date score - score of the question/answer tags - all tags flagged for a question answer_count - number of answers for a question start_question - question's time of creation last_activity_date - last update on the question new_id - hashed id of the answerer q_new_id - hashed id of the questioner Fields from Reddit comment_id submission_id score - score of the question/submission subreddit created_utc - time of creation (unrelated to last modified comments) new_id - hashed id The .txt files represent the structure of the corresponding hypergraphs.

  15. d

    Replication Data for: Place Your Bets? The Value of Investment Research on...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jame, Russell; Dan Bradley; Jan Hanousek Jr; Zicheng Xiao (2023). Replication Data for: Place Your Bets? The Value of Investment Research on Reddit's Wallstreetbets [Dataset]. http://doi.org/10.7910/DVN/X9ZFV9
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Jame, Russell; Dan Bradley; Jan Hanousek Jr; Zicheng Xiao
    Description

    This code replicates the main tables and figures in "Place Your Bets? The Value of Investment Research on Reddit's Wallstreetbets" using psuedo-data.

  16. Reddit: quarterly number of DAU 2021-2025, by online status

    • statista.com
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit: quarterly number of DAU 2021-2025, by online status [Dataset]. https://www.statista.com/statistics/1453133/reddit-quarterly-dau-by-online-status/
    Explore at:
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    During the first quarter of 2025, online forum and news aggregator Reddit saw approximately 108.1 million daily active users (DAU) engaging with its platform. Of these, over 59.4 million users were not logged in and accessed the platform's content without proving they registered to Reddit. This represents an increase of approximately 6.8 percent compared to the previous quarter, when Reddit saw 55.6 million logged-off DAU.

  17. Reddit Scholar Posts (Scraped from April - August 2014)

    • figshare.com
    html
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Gardner (2023). Reddit Scholar Posts (Scraped from April - August 2014) [Dataset]. http://doi.org/10.6084/m9.figshare.5771364.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Gabriel Gardner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    One file containing posts to Reddit's r/Scholar forum. These were scraped from April 2014 through August 2014. The following variables are available for each post:*DayOfWeek*Date *Week *Title *User*Description*Link *FirstLinkProvided *FirstLinkProvidedDomain*UserSuppliedISBN*UserSuppliedPMID*UserSuppliedDOIPreparation: (1) Scraping: Scraping took place via Google Sheets and Google Scripts. Google Sheets code used to scrape Reddit is available here: https://ctrlq.org/code/19600-reddit-scraper-script?_ga=2.69874256.1771062494.1515537454-939238282.1515537453(2) Cleaning: After scraping, data was cleaned and enriched in Excel. Extraneous HTML tags were removed and Excel's Text to Columns command was used to separate the posts into more easily analyzable chunks. Excel's IF, SEARCH, and FIND functions were used to extract the FirstLinkProvided, UserSupliedISBN, UserSuppliedPMID, and UserSuppliedDOI information. OpenRefine (http://openrefine.org) was used to turn the FirstLinkProvided values into the FirstLinkProviedDomain values.Caveats: *Some posts requested more than one item. If two or more URLs were supplied in a post, only the first was captured by the URL extraction method used in Excel. *To allow for verification of the scrape's accuracy, a link back to each original post on Reddit is included. (These links also allow for researchers to see the comments on the original post, which may have been made after the scrape took place and are not included in this file.) Links to original posts reveal user information. Because identification of users would be a trivial matter for someone who followed links to the original posts, no attempts were made to anonymize user IDs in this file.

  18. Z

    Smart Contracts Posts and Topic on Reddit

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibba, Giacomo Francesco (2023). Smart Contracts Posts and Topic on Reddit [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10222218
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset authored and provided by
    Ibba, Giacomo Francesco
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A dataset including two CSVs collecting Reddit posts including respectively:- Permalink: The post's link- Title: The post's title- author: The author's name.- authorUrl: The URL to the author's profile- commentCount: The number of comments related to the post- id: The id of the post-createdDate: The creation date of the reddit post- query: Contains the URL of the board- category: The category related to the subreddit post- score: post score- awardCount: number of awards- silverCount: number of silver gildings- goldCount: number of gold gildings- platinumCount: number of platinum gildings- upvoteRatio: the upvote ratio

  19. d

    Replication Data for: The Spread of Political Misinformation on Online...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burton, Anthony (2023). Replication Data for: The Spread of Political Misinformation on Online Subcultural Platforms [Dataset]. http://doi.org/10.7910/DVN/ZDN6BN
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Burton, Anthony
    Description

    Hostnames and YouTube videos within collected posts on 4chan /pol/ and political subreddits

  20. o

    Reddit /r/Bitcoin Data for Jun 2022

    • opendatabay.com
    .undefined
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Reddit /r/Bitcoin Data for Jun 2022 [Dataset]. https://www.opendatabay.com/data/ai-ml/5ae5899f-7e6a-4519-8b82-8cf6463b478b
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 22, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Finance & Banking Analytics
    Description

    Context As anyone who's been keeping track for the last ten years can tell you, the world of cryptocurrency moves fast. Its movements are all too often supported or hindered by viral fads - be it posts on Reddit, Twitter takes, or something else entirely. We have compiled a month of the most famous cryptocurrency subreddit, /r/Bitcoin, into two convenient CSV files, creating a large cryptocurrency dataset for use both enterprise and academic.

    For a larger version, please see our Reddit /r/Bitcoin dataset.

    Content This dataset contains a comprehensive collection of posts and comments mentioning AAPL in their title and body text respectively. The data is procured using SocialGrep.

    To preserve users' anonymity and to prevent targeted harassment, the data does not include usernames.

    Acknowledgements This dataset was created using SocialGrep Exports. If social data analysis is your thing, we also have a good Reddit search tool.

    We would also like to thank André François McKenzie for providing us with the background image for this dataset.

    Inspiration Cryptocurrency is still a new topic in everyone's minds. It fluctuates wildly as time goes on - can we predict any future trends from seeing the public opinion shift?

    License

    CC-BY

    Original Data Source: Reddit /r/Bitcoin Data for Jun 2022

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex

Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation

Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Dataplex
Area covered
Mexico, Holy See, Christmas Island, Jersey, Chile, Gambia, Macao, Botswana, Martinique, Côte d'Ivoire
Description

The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

  • Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.
  • User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.
  • Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.
  • AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

  • Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.
  • Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.
  • Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.
  • Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

  • Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.
  • Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.
  • Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...

Search
Clear search
Close search
Google apps
Main menu