100+ datasets found
  1. Reddit usage reach in the United States 2024, by age group

    • statista.com
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit usage reach in the United States 2024, by age group [Dataset]. https://www.statista.com/statistics/261766/share-of-us-internet-users-who-use-reddit-by-age-group/
    Explore at:
    Dataset updated
    Feb 17, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 1, 2024 - Jun 10, 2024
    Area covered
    United States
    Description

    According to a survey of adults in the United States in 2024, 46 percent of respondents who used Reddit were aged between 19 and 29 years. Reddit usage tends to be affected by users’ age, with older users reporting lower levels of engagement. Reddit engagement in numbers Reddit is one of the most popular websites in the forum category, allowing users to interact in multiple close-knitted communities organized in sub-threads and divided by topics. In March 2024, Reddit.com registered an average of 2.2 billion monthly visits from desktop and mobile combined. Reddit users are mostly based in North America, with the United States accounting for the biggest share of traffic worldwide by far. The future of Reddit Reddit was created in 2005, was redesigned for the very first time in 2018 to make it more appealing to new users and increase engagement from non-participating guests (jokingly called “lurkers”) who nonetheless enjoy the content. In February 2024, the company announced it was entering the public market by releasing its S-1 registration statement. In 2024, the company generated around 1.3 billion U.S. dollars worldwide in revenues. This translated into an average revenue per user (ARPU) of around 4.21 dollars in the last quarter of 2024.

  2. Reddit usage reach in the United States 2023, by ethnicity

    • statista.com
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit usage reach in the United States 2023, by ethnicity [Dataset]. https://www.statista.com/statistics/261770/share-of-us-internet-users-who-use-reddit-by-ethnicity/
    Explore at:
    Dataset updated
    Feb 17, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 1, 2024 - Jun 10, 2024
    Area covered
    United States
    Description

    According to a survey of internet users conducted in the United States between February and June, 2024, 14 percent of Black Americans reported having ever used Reddit. Asian Americans appeared to be more likely than both Black and white Americans to have ever used the social media and community forum, with 36 percent of users in the demographic reporting to have used the popular forum and social media.

  3. Reddit usage reach in the United States 2024, by gender

    • statista.com
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit usage reach in the United States 2024, by gender [Dataset]. https://www.statista.com/statistics/261765/share-of-us-internet-users-who-use-reddit-by-gender/
    Explore at:
    Dataset updated
    Feb 17, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 1, 2024 - Jun 10, 2024
    Area covered
    United States
    Description

    As of June 2024, 28 percent of male respondents in the United States stated that they used Reddit, compared to 20 percent of their female counterpart. Reddit is a social networking and online forum company. The platform is organized in thematic groups, also called subreddits.

  4. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  5. Distribution of Reddit.com traffic 2024, by country

    • statista.com
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Distribution of Reddit.com traffic 2024, by country [Dataset]. https://www.statista.com/statistics/325144/reddit-global-active-user-distribution/
    Explore at:
    Dataset updated
    Nov 11, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    In the six months ending March 2024, the United States accounted for 48.46 percent of traffic to the online forum Reddit.com. The United Kingdom was ranked second, accounting for 7.16 percent of web visits to the social media platform. Reddit in the United States In August 2023, Reddit accounted for slightly over 1.6 percent of social media website traffic in the United States. Founded in 2005, Reddit is a discussion website which enables users to aggregate news by posting links and let other users vote and comment on them. There are thousands of subforums, called subreddits, on a wide range of topics available. One of the most popular subreddits is the AMA (“Ask Me Anything”), where celebrities, public figures or people in unique positions post threads that allow other Reddit users to ask them anything. In 2022, Nicolas Cage's AMA post generated over 238.5 thousand upvotes, making it the most popular AMA of the year. Reddit users in the United States Reddit use in the United States is more prevalent among younger online audiences. During a February 2021 survey, it was found that 36 percent of internet users aged 18 to 29 years and 22 percent of users aged 30 to 49 years used Reddit. However, the reach of the social platform strongly declines with age. Also, whilst around a 23 of male adults in the U.S. access Reddit, only 12 percent of women do the same.

  6. Z

    Reddit Comments Dataset for Text Style Transfer Tasks

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kopf, Fabian (2023). Reddit Comments Dataset for Text Style Transfer Tasks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8023141
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset authored and provided by
    Kopf, Fabian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit Comments Dataset for Text Style Transfer Tasks

    A dataset of Reddit comments prepared for Text Style Transfer Tasks.

    The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used: "Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {" This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models.

    The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews.

    The quality of formal translations was assessed with BERTScore and chrF++:

    BERTScore: F1-Score: 0.89, Precision: 0.90, Recall: 0.88

    chrF++: 37.16

    The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77

    The dataset consists of 3 components.

    reddit_commments.csv

    This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected: - subreddit (name of the subreddit in which the comment was posted) - id (ID of the comment) - submission_id (ID of the submission to which the comment was posted) - body (the comment itself) - created_utc (timestamp in seconds) - parent_id (The ID of the comment or submission to which the comment is a reply) - permalink (The URL to the original comment)- - token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer) - perplexity (What perplexity does GPT-2 calculate for the comment)

    The comments were filtered. This file contains only comments that: - have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens. - are not [removed] or [deleted] - do not contain URLs

    This file was used as a source for the other two file types.

    Labeled Files (training_labeled.csv and eval_labeled.csv)

    These files contain the formal translations of the Reddit comments.

    The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience.

    They are structured as follows: - Subreddit (name of the subreddit where the comment was posted). - Original Comment - Formal Comment

    Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json)

    These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment.

    These files can be used to train models to perform style transfers based on given examples. The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples.

    An entry in this file is structured as follows:

    "data":[ { "input_sentence":"The original Reddit comment", "style_samples":[ "sample1", "sample2", "sample3" ], "results_sentence":"The formal translated input_sentence", "subreddit":"The subreddit from which the comments originated" }, "..." ]

  7. reddit user posting behavior (mid-2013)

    • figshare.com
    application/gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Randy Olson (2023). reddit user posting behavior (mid-2013) [Dataset]. http://doi.org/10.6084/m9.figshare.874101.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Randy Olson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file contains the posting preferences for over 850,000 active reddit users. This sample was taken in mid-2013. This data was used to generate the interactive visualization, "redditviz," and will be analyzed in detail in an upcoming research article. Please cite our paper "Navigating the massive world of reddit" if you use this data in your work. URL: http://arxiv.org/abs/1312.3387 The file is organized as follows: Each line is an entry for an anonymous user. Each user was randomly assigned a unique ID, which is what shows in the first entry of each line. Following the user ID, separated by commas, are the subreddits (i.e., interests) that the user regularly posts in. In order for a user to be considered "active" in that subreddit, they had to post or comment there at least 10 times in their last 1,000 posts and comments.

  8. Reddit: content created H1 2024

    • statista.com
    Updated Apr 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department (2025). Reddit: content created H1 2024 [Dataset]. https://www.statista.com/topics/5672/reddit/
    Explore at:
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Description

    In the first half of 2024, a total of 5.33 billion pieces of content were created on Reddit. Of these, over 1.65 billion were comments left by registered users under posted content. Over 2.9 billion chats were exchanged during the examined period, while private messages on the platform had a volume of approximately 492 million pieces of content.

  9. d

    Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex, Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Dataplex
    Area covered
    Macao, Martinique, Gambia, Chile, Jersey, Christmas Island, Holy See, Botswana, Côte d'Ivoire, Mexico
    Description

    The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

    Dataset Overview:

    This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

    2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

    Sourced Directly from Reddit:

    All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

    Key Features:

    • Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.
    • User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.
    • Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.
    • AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

    Use Cases:

    • Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.
    • Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.
    • Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.
    • Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

    Data Quality and Reliability:

    The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

    Integration and Usability:

    The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

    User-Friendly Structure and Metadata:

    The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

    Ideal For:

    • Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.
    • Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.
    • Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

    This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...

  10. Cross-mentions between 4chan and Reddit

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sal Hagen; Sal Hagen (2023). Cross-mentions between 4chan and Reddit [Dataset]. http://doi.org/10.5281/zenodo.10059085
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sal Hagen; Sal Hagen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The included datasets are at the basis of the article "No space for Reddit spacing: Mapping the reflexive relationship between groups on 4chan and Reddit", published in Social Media + Society. They include cross-mentions between 4chan and Reddit, as well as various metrics associated to these cross-references.

    The timeframe ranges from the earliest date available (for Reddit: June 2006; for 4chan/b/: April 2006; for 4chan/pol/: December 2013) and ends in January 2023 (except for the 4chan/b/ dataset, which ends in December 2008).

    The datasets specifically entail the following:

    1. Cross-mentions from Reddit to 4chan

    reddit-mentions-to-4chan.csv

    I used the Pushshift API's search endpoint to fetch Reddit comments (so no opening posts) with the keyword "4chan" (note: this Pushshift functionality is now deprecated). I also used a rudimentary filter to remove posts by bots, specifically by 1) deleting posts from every account that had "bot" or "auto" in the username and 2) removing all posts by authors with 100 or more contributions and which I manually identified as automated accounts.

    I removed URL-only cross-references, i.e. posts that only mentioned "://boards.4chan.org" or "://boards.4channel.org" without another 4chan-reference/

    This resulted in 2,638,621 "4chan" references across Reddit.

    2. Cross-mentions from 4chan/pol/ to Reddit

    4chan-pol_mentions-of-reddit.csv

    With a complete dataset of /pol/ collected through 4CAT, I queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor").

    I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/

    This resulted in 1,640,273 "Reddit" references on /pol/.

    3. Cross-mentions from 4chan/b/ to Reddit

    4chan-b_mentions-to-reddit.csv

    I extracted five million posts from Jason Scott's 4chan/b/ dump. I then queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor").

    I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/

    This resulted in 1,287 "Reddit" references on /b/.

    See Hagen (2020) for more information on the 4chan/b/ dataset.

    4. Cross-mention metrics

    cross-mention-metrics.xlsx

    I extracted the following metrics from the datasets above:

    4.1 The total number of cross-mentions, absolute and relative, per month
    This simply used the monthly counts from datasets 1 and 2.

    4.2 The most mentioned subreddits on /pol/, per year
    Using the regular expression: r\/[a-zA-Z_]

    4.3 Subreddits that mention 4chan most often, per year

    4.4 4chan boards mentioned across Reddit, per month

    4.5 4chan boards mentioned by subreddits

    I counted every subreddit- or board-mention per post instead of total occurrences.

    For 4.4 and 4.5, I used the following regular expression to extract 4chan board names:

    (\s|^|4chan)\/(a|b|c|d|e|f|g|gif|h|hr|k|m|o|p|t|v|vg|vm|vmg|vr|vrpg|vst|w|wg|i|ic|r9k|s4s|vip|qa|cm|hm|lgbt|y|3|aco|adv|an|bant|biz|cgl|ck|co|diy|fa|fit|gd|hc|his|int|jp|lit|mlp|mu|n|news|out|po|pol|pw|qst|sci|soc|sp|tg|toy|trv|tv|vp|vt|wsg|wsr|x|xs|new)\/(\s|$)

    I also omitted 4chan's /r/, /u/, and /s/ boards; despite their small scale, they appeared as false positives due to their unrelated vernacular meaning on Reddit (e.g. /u/ as a username prefix).

    4.5 was also transformed and included as a Gephi network file (subreddit-board-mentions.gephi).

    Lastly, I also included:

    4.6 The total amount of posts on 4chan and Reddit

    This was used to calculate 4.1. It uses Pushshift's database statistics (which as of Nov. 2023 requires a login; see this Pastebin for an alternative) and metrics of total 4chan post counts from 4stats.io.

    Each of these metrics has their own corresponding tab in the Excel file.

    5. Co-words of "4chan" and "reddit" in the cross-mentions

    co-words.xslx

    Using datasets 1, 2, and 3, I extracted the top ten words appearing directly next to "4chan" on Reddit, and next to "Reddit" on 4chan, per year.

    I first pre-processed the text, which involved tokenisation, filtering of unwanted text elements like URLs, stop word removal (I whitelisted back), and lemmatisation.

    For the co-word extraction I used a window size of two. I excluded a range of semantically uninteresting words or commonly used hate speech terms prevalent throughout 4chan.

    6. Annotated cross-mentions between Reddit and 4chan/pol/ in September 2014

    annotations_4chanpol-2014.csv
    annotations_reddit-2014-kotakuinaction-anonimised.csv
    annotations_reddit-2014-tumblrinaction-anonimised.csv

    I extracted cross-mentions from /pol/ to Reddit and from Reddit to 4chan in September 2014 for close-reading and annotation.

    _

    The author names are removed for all datasets.

  11. h

    AITA-Reddit-Dataset

    • huggingface.co
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osama Bsher (2023). AITA-Reddit-Dataset [Dataset]. https://huggingface.co/datasets/OsamaBsher/AITA-Reddit-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2023
    Authors
    Osama Bsher
    Description

    Dataset Card for AITA Reddit Posts and Comments

    Posts of the AITA subreddit, with the 2 top voted comments that share the post verdict. Extracted using REDDIT PushShift (from 2013 to April 2023)

      Dataset Details
    

    The dataset contains 270,709 entiries each of which contain the post title, text, verdict, comment1, comment2 and score (number of upvotes) For more details see paper: https://arxiv.org/abs/2310.18336

      Dataset Sources
    

    The Reddit PushShift data dumps are… See the full description on the dataset page: https://huggingface.co/datasets/OsamaBsher/AITA-Reddit-Dataset.

  12. Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Jan 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amaury Trujillo; Amaury Trujillo; Stefano Cresci; Stefano Cresci (2023). Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation Interventions on r/The_Donald [Dataset]. http://doi.org/10.5281/zenodo.6250577
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jan 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amaury Trujillo; Amaury Trujillo; Stefano Cresci; Stefano Cresci
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.

    An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again

    If you use this dataset please cite the related article.

    The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.

    The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.

    The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.

    The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.

    A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.

  13. Ask Reddit

    • kaggle.com
    zip
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Preda (2021). Ask Reddit [Dataset]. https://www.kaggle.com/gpreda/ask-reddit
    Explore at:
    zip(1781882 bytes)Available download formats
    Dataset updated
    Oct 28, 2021
    Authors
    Gabriel Preda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    AskReddit ... (r/AskReddit), is one of the largest Reddit communities. Posts and comments are about questions and answers on a variety of random topics. The collection is a rich source of all kind of topics in English language. The data is not filtered.

    Collection

    Reddit posts from subreddit r/AskReddit, downloaded from https://www.reddit.com/r/AskReddit using praw (The Python Reddit API Wrapper).

    Script used for collection can be found here: Reddit extract content

    Content

    Data contains both posts and comments. Both posts and comments contains the following fields: * title - relevant for posts
    * score - relevant for posts - based on impact, number of comments
    * id - unique id for posts/comments
    * url - relevant for posts - url of post thread
    * commns_num - relevant for post - number of comments to this post
    * created - date of creation
    * body - relevant for posts/comments - text of the post or comment
    * timestamp - timestamp

    Acknowledgements

    All merit goes to the contributors to the posts of subreddit r/AskReddit. I only collects them daily.

    Inspiration

    You can use the data to: * Perform sentiment analysis;
    * Identify discussion topics;

  14. h

    tldr-17

    • huggingface.co
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Webis Group (2023). tldr-17 [Dataset]. https://huggingface.co/datasets/webis/tldr-17
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2023
    Dataset authored and provided by
    Webis Group
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.

    Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.

  15. h

    REDDIT_comments

    • huggingface.co
    Updated Aug 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HuggingFaceGECLM (2023). REDDIT_comments [Dataset]. https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2023
    Dataset authored and provided by
    HuggingFaceGECLM
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description

    Dataset Card for "REDDIT_comments"

      Dataset Summary
    

    Comments of 50 high-quality subreddits, extracted from the REDDIT PushShift data dumps (from 2006 to Jan 2023).

      Supported Tasks
    

    These comments can be used for text generation and language modeling, as well as dialogue modeling.

      Dataset Structure
    
    
    
    
    
      Data Splits
    

    Each split corresponds to a specific subreddit in the following list: "tifu", "explainlikeimfive", "WritingPrompts", "changemyview"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments.

  16. Reddit usage reach in the United States 2021, by education

    • statista.com
    Updated Feb 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Reddit usage reach in the United States 2021, by education [Dataset]. https://www.statista.com/statistics/261776/share-of-us-internet-users-who-use-reddit-by-education-level/
    Explore at:
    Dataset updated
    Feb 26, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 25, 2021 - Feb 8, 2021
    Area covered
    United States
    Description

    According to a February 2021 survey of internet users based in the United States, respondents that attended college were more likely to use Reddit, when compared to respondents with lower levels of education. 26 percent of respondents with a bachelor's or advanced degrees reported using the social network, compared to only nine percent of respondents holding a high school diploma or less.

  17. d

    Reddit blackout announcements: 2023 API protest

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Pettis (2024). Reddit blackout announcements: 2023 API protest [Dataset]. http://doi.org/10.5061/dryad.qfttdz0qd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 6, 2024
    Dataset provided by
    Dryad
    Authors
    Ben Pettis
    Time period covered
    Jan 22, 2024
    Description

    Reddit Blackout Announcements - 2023 API Protest

    https://doi.org/10.5061/dryad.qfttdz0qd

    Reddit Blackout Announcements - 2023 API Protest

    This dataset includes the list of scraped subreddits, a single CSV file for each subreddit, and a copy of the Python scripts used to scrape the data.

    Description of the data and file structure

    The dataset is uploaded as a single .zip file. Once it is downloaded and decompressed, it will include several files and directories. Here is how they are organized . └── subreddit-list.txt └── CSVs └── [subreddit-name].csv └── [...] └── code └── [...] └── parsed TXTs └── API.txt └── blackout.txt └── community.txt └── mod-team.txt └── moderator.txt └── platform.txt └── protest.txt

    Subreddit List

    The subreddit-list.txt file contains a list of 5,351 subreddit names. Each appears on its own line. This list was generated using the list-subreddits.py script, as described below.

    Stickied Posts - CSVs

    The "CSVs" directory contains 5,351 CSV (...

  18. The reddit self-post classification task

    • kaggle.com
    Updated Oct 29, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Swarbrick Jones (2018). The reddit self-post classification task [Dataset]. https://www.kaggle.com/mswarbrickjones/reddit-selfposts/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mike Swarbrick Jones
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    Welcome to the Reddit Self-Post Classification Task (RSPCT)!

    The aim of this dataset was to create an interesting, large text classification problem with many classes, that does not suffer from label sparsity as most datasets of its type do. See the blog post for a more detailed write up, or the paper here. The aim is to classify self-posts into the subreddit into which they were posted. A great deal of effort has gone into selecting a ‘good’ set of subreddits to minimise overlap in content.

    We recommend you look at the blogpost write-up for this dataset before continuing. There is also a rough draft of a paper here if you have more detailed questions.

    Data

    The data consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). For each post we give the subreddit, the title and content of the self-post.

    We have also given a manual annotation of about 3000 subreddits which went into the creation of this dataset, in subreddit_info.csv, this was the main criteria for selecting which subreddits went into this dataset. We include a top-level category and subcategory for each subreddit, and a reason for exclusion if this does not appear in the data.

    Recommendations

    We recommend splitting out the last 20% of the data as a test set (we have organised so that this is a random, stratified sample of all the data. In our experiments, we have been optimising for the precision-at-K metric for K = {1, 3, 5}

    Questions that we think would be interesting to answer

    • can sequential models (e.g. LSTMs) be trained to be competitive with / outperform bag-of-word approaches?
    • does transfer learning (e.g. OpenAI, ULMFIT) help on this problem? You may want to look at the GitHub page (https://github.com/mikesj-public/rspct-dataset/tree/master) to get hold of a unsupervised training set.
    • can you leverage a hierarchy (such as the one detailed in subreddit_info.csv), to improve accuracy?
    • can you use techniques from XML (extreme multi-class) machine learning to get a better score on this dataset?
  19. a

    Subreddit comments/submissions 2005-06 to 2024-12

    • academictorrents.com
    bittorrent
    Updated Feb 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Watchful1 (2025). Subreddit comments/submissions 2005-06 to 2024-12 [Dataset]. https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
    Explore at:
    bittorrent(3275329715321)Available download formats
    Dataset updated
    Feb 16, 2025
    Dataset authored and provided by
    Watchful1
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    This is the top 40,000 subreddits from reddit s history in separate files. You can use your torrent client to only download the subreddit s you re interested in. These are from the pushshift dumps from 2005-06 to 2024-12 which can be found here These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here If you have questions, please reply to this reddit post or DM u/Watchful on reddit or respond to this post

  20. Reddit brand profile in the United States 2024

    • statista.com
    • ai-chatbox.pro
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit brand profile in the United States 2024 [Dataset]. https://www.statista.com/forecasts/1304993/reddit-social-media-brand-profile-in-the-united-states
    Explore at:
    Dataset updated
    Jul 18, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 2024
    Area covered
    United States
    Description

    How high is the brand awareness of Reddit in the United States?When it comes to social media users, brand awareness of Reddit is at ** percent in the United States. The survey was conducted using the concept of aided brand recognition, showing respondents both the brand's logo and the written brand name.How popular is Reddit in the United States?In total, ** percent of U.S. social media users say they like Reddit. However, in actuality, among the ** percent of U.S. respondents who know Reddit, ** percent of people like the brand.What is the usage share of Reddit in the United States?All in all, ** percent of social media users in the United States use Reddit. That means, of the ** percent who know the brand, ** percent use them.How loyal are the users of Reddit?Around ** percent of social media users in the United States say they are likely to use Reddit again. Set in relation to the ** percent usage share of the brand, this means that ** percent of their users show loyalty to the brand.What's the buzz around Reddit in the United States?In February 2023, about ** percent of U.S. social media users had heard about Reddit in the media, on social media, or in advertising over the past four weeks. Of the ** percent who know the brand, that's ** percent, meaning at the time of the survey there's little buzz around Reddit in the United States.If you want to compare brands, do deep-dives by survey items of your choice, filter by total online population or users of a certain brand, or drill down on your very own hand-tailored target groups, our Consumer Insights Brand KPI survey has you covered.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Reddit usage reach in the United States 2024, by age group [Dataset]. https://www.statista.com/statistics/261766/share-of-us-internet-users-who-use-reddit-by-age-group/
Organization logo

Reddit usage reach in the United States 2024, by age group

Explore at:
36 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 17, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 1, 2024 - Jun 10, 2024
Area covered
United States
Description

According to a survey of adults in the United States in 2024, 46 percent of respondents who used Reddit were aged between 19 and 29 years. Reddit usage tends to be affected by users’ age, with older users reporting lower levels of engagement. Reddit engagement in numbers Reddit is one of the most popular websites in the forum category, allowing users to interact in multiple close-knitted communities organized in sub-threads and divided by topics. In March 2024, Reddit.com registered an average of 2.2 billion monthly visits from desktop and mobile combined. Reddit users are mostly based in North America, with the United States accounting for the biggest share of traffic worldwide by far. The future of Reddit Reddit was created in 2005, was redesigned for the very first time in 2018 to make it more appealing to new users and increase engagement from non-participating guests (jokingly called “lurkers”) who nonetheless enjoy the content. In February 2024, the company announced it was entering the public market by releasing its S-1 registration statement. In 2024, the company generated around 1.3 billion U.S. dollars worldwide in revenues. This translated into an average revenue per user (ARPU) of around 4.21 dollars in the last quarter of 2024.

Search
Clear search
Close search
Google apps
Main menu