100+ datasets found
  1. h

    the-reddit-dataset-dataset

    • huggingface.co
    Updated Jun 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SocialGrep (2022). the-reddit-dataset-dataset [Dataset]. https://huggingface.co/datasets/SocialGrep/the-reddit-dataset-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2022
    Authors
    SocialGrep
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A meta dataset of Reddit's own /r/datasets community.

  2. i

    depression reddit dataset

    • ieee-dataport.org
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaedong Oh (2023). depression reddit dataset [Dataset]. http://doi.org/10.21227/0dfh-5a29
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    IEEE Dataport
    Authors
    Jaedong Oh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We pre-processed and built posts and comments posted during 2010-2016 on the subreddit r/depression.

  3. h

    reddit-clustering

    • huggingface.co
    Updated Apr 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2022). reddit-clustering [Dataset]. https://huggingface.co/datasets/mteb/reddit-clustering
    Explore at:
    Dataset updated
    Apr 28, 2022
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    RedditClustering.v2 An MTEB dataset Massive Text Embedding Benchmark

    Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.

    Task category t2c

    Domains Web, Social, Written Reference https://arxiv.org/abs/2104.07081

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["RedditClustering.v2"])… See the full description on the dataset page: https://huggingface.co/datasets/mteb/reddit-clustering.

  4. P

    REDDIT-5K Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pinar Yanardag; S. V. N. Vishwanathan, REDDIT-5K Dataset [Dataset]. https://paperswithcode.com/dataset/reddit-5k
    Explore at:
    Authors
    Pinar Yanardag; S. V. N. Vishwanathan
    Description

    Reddit-5K is a relational dataset extracted from Reddit.

  5. Reddit Datasets

    • promptcloud.com
    csv
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2025). Reddit Datasets [Dataset]. https://www.promptcloud.com/dataset/reddit/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 28, 2025
    Dataset authored and provided by
    PromptCloud
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Extracting Insights from Online DiscussionsReddit is one of the largest social discussion platforms, making it a valuable source for real-time opinions, trends, sentiment analysis, and user interactions across various industries. Scraping Reddit data allows businesses, researchers, and analysts to explore public discussions, track sentiment, and gain actionable insights from user-generated content. Benefits and Impact: Trend […]

  6. H

    Data from: Reddit Dataset on Meme Stock: GameStop

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jul 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Han (2022). Reddit Dataset on Meme Stock: GameStop [Dataset]. http://doi.org/10.7910/DVN/TUMIPC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Jing Han
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset collects one-year Reddit posts, post metadata, post title sentiments, and post comments threads from subreddit: r/GME, r/superstonk, r/DDintoGME, and r/GMEJungle.

  7. Total global visitor traffic to Reddit.com 2024

    • statista.com
    • ai-chatbox.pro
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Total global visitor traffic to Reddit.com 2024 [Dataset]. https://www.statista.com/statistics/443332/reddit-monthly-visitors/
    Explore at:
    Dataset updated
    Nov 11, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    Reddit is a web traffic powerhouse: in March 2024 approximately 2.2 billion visits were measured to the online forum, making it one of the most-visited websites online. The front page of the internet Formerly known as “the front page of the internet”, Reddit is an online forum platform with over 130,000 sub-forums and communities. The platform allows registered users, called Redditors, to post content. Each post is open to the entire Reddit community to vote upon, either by down- or upvotes. The most popular posts are featured directly on the front page. Subreddits are available by category and Redditors can follow selected subreddits relevant to their interest and also control what content they see on their custom front page. Some of the most popular subreddits are r/AskReddit or r/AMA – the “Ask Me Anything” format. According to the company, Reddit hosted 1,800 AMAs in 2018, with a wide range of topics and hosts. One of the most popular Reddit AMA of 2022 by number of upvotes was by actor Nicolas Cagem with more than 238.5 thousand upvotes. Reddit usage The United States account for the biggest share of Reddit's desktop traffic, followed by the UK, and Canada. As of March 2023, Reddit ranked among the most popular social media websites in the United States.

  8. h

    reddit-self-disclosure

    • huggingface.co
    Updated Jul 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yao Dou (2024). reddit-self-disclosure [Dataset]. https://huggingface.co/datasets/douy/reddit-self-disclosure
    Explore at:
    Dataset updated
    Jul 4, 2024
    Authors
    Yao Dou
    Description

    The data are in conll IOB2 format. Each instance in batch 1-8 is annotated by one annotator, while each instance in batch 9 and 10 is annotated by two annotators followed by adjudication.

      Accessing this dataset implies automatic agreement to the following guidelines:
    

    Use of the corpus is limited to research purposes only. Redistribution of the corpus without the authors’ permission is prohibited. Compliance with Reddit’s policy is mandatory.

  9. Reddit

    • redivis.com
    application/jsonl +7
    Updated Oct 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2021). Reddit [Dataset]. https://redivis.com/datasets/prpw-49sqq9ehv
    Explore at:
    sas, stata, csv, avro, parquet, spss, application/jsonl, arrowAvailable download formats
    Dataset updated
    Oct 27, 2021
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Description

    Abstract

    Reddit posts, 2019-01-01 thru 2019-08-01.

    Documentation

    Source: https://console.cloud.google.com/bigquery?p=fh-bigquery&page=project

  10. h

    reddit

    • huggingface.co
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    peternasser (2024). reddit [Dataset]. https://huggingface.co/datasets/peternasser99/reddit
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Authors
    peternasser
    Description

    peternasser99/reddit dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. Reddit posts published annually 2018-2026

    • statista.com
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Reddit posts published annually 2018-2026 [Dataset]. https://www.statista.com/forecasts/1309798/reddit-posts-published
    Explore at:
    Dataset updated
    Feb 16, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    In 2023, it was estimated that the number of Reddit posts published on the platform during the year reached 470 million. It is estimated that the output volume has experienced a constant increase between 2018 and 2023, with the social and news aggregator tripling the amount of posts users published in this period. Reddit, which was launched in 2005, is a social forum and news aggregator with high traffic volumes.

  12. Reddit Comments Dataset for Text Style Transfer Tasks

    • zenodo.org
    • data.niaid.nih.gov
    csv, json
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Kopf; Fabian Kopf (2023). Reddit Comments Dataset for Text Style Transfer Tasks [Dataset]. http://doi.org/10.5281/zenodo.8023142
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fabian Kopf; Fabian Kopf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit Comments Dataset for Text Style Transfer Tasks

    A dataset of Reddit comments prepared for Text Style Transfer Tasks.

    The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used:
    "Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {"
    This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models.

    The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews.

    The quality of formal translations was assessed with BERTScore and chrF++:

    • BERTScore: F1-Score: 0.89, Precision: 0.90, Recall: 0.88
    • chrF++: 37.16

    The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77


    The dataset consists of 3 components.

    reddit_commments.csv

    This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected:
    - subreddit (name of the subreddit in which the comment was posted)
    - id (ID of the comment)
    - submission_id (ID of the submission to which the comment was posted)
    - body (the comment itself)
    - created_utc (timestamp in seconds)
    - parent_id (The ID of the comment or submission to which the comment is a reply)
    - permalink (The URL to the original comment)-
    - token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer)
    - perplexity (What perplexity does GPT-2 calculate for the comment)

    The comments were filtered. This file contains only comments that:
    - have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens.
    - are not [removed] or [deleted]
    - do not contain URLs

    This file was used as a source for the other two file types.

    Labeled Files (training_labeled.csv and eval_labeled.csv)

    These files contain the formal translations of the Reddit comments.

    The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience.

    They are structured as follows:
    - Subreddit (name of the subreddit where the comment was posted).
    - Original Comment
    - Formal Comment

    Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json)

    These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment.

    These files can be used to train models to perform style transfers based on given examples.
    The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples.

    An entry in this file is structured as follows:

    "data":[
    {
    "input_sentence":"The original Reddit comment",
    "style_samples":[
    "sample1",
    "sample2",
    "sample3"
    ],
    "results_sentence":"The formal translated input_sentence",
    "subreddit":"The subreddit from which the comments originated"
    },
    "..."
    ]

  13. Reddit Conversation Dataset

    • kaggle.com
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai J (2023). Reddit Conversation Dataset [Dataset]. https://www.kaggle.com/datasets/psyflow/reddit-conversation-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sai J
    Description

    This is aimed to build a conversation AI bot or for next word prediction. Please Upvote the dataset so that it reaches to maximum Kagglers and it can help them to build a well chat bot as the size of dataset is 2.6GB

  14. i

    suicidal ideation reddit dataset

    • ieee-dataport.org
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiv Shukla (2024). suicidal ideation reddit dataset [Dataset]. https://ieee-dataport.org/documents/suicidal-ideation-reddit-dataset
    Explore at:
    Dataset updated
    Jul 8, 2024
    Authors
    Shiv Shukla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    suicidal ideation dataset extracted from reddit and twitter social media platform.

  15. a

    Reddit comments/submissions 2005-06 to 2024-06

    • academictorrents.com
    bittorrent
    Updated Jul 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    stuck_in_the_matrix, Watchful1, RaiderBDev (2024). Reddit comments/submissions 2005-06 to 2024-06 [Dataset]. https://academictorrents.com/details/20520c420c6c846f555523babc8c059e9daa8fc5
    Explore at:
    bittorrentAvailable download formats
    Dataset updated
    Jul 14, 2024
    Dataset authored and provided by
    stuck_in_the_matrix, Watchful1, RaiderBDev
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Reddit comments and submissions from 2005-06 to 2023-09 collected by pushshift and u/RaiderBDev. These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here The more recent dumps are collected by u/RaiderBDev

  16. h

    scandi-reddit

    • huggingface.co
    Updated Jan 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Institute (2023). scandi-reddit [Dataset]. https://huggingface.co/datasets/alexandrainst/scandi-reddit
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2023
    Dataset authored and provided by
    Alexandra Institute
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for ScandiReddit

      Dataset Summary
    

    ScandiReddit is a filtered and post-processed corpus consisting of comments from Reddit. All Reddit comments from December 2005 up until October 2022 were downloaded through PushShift, after which these were filtered based on the FastText language detection model. Any comment which was classified as Danish (da), Norwegian (no), Swedish (sv) or Icelandic (is) with a confidence score above 70% was kept. The resulting comments… See the full description on the dataset page: https://huggingface.co/datasets/alexandrainst/scandi-reddit.

  17. Distribution of Reddit.com traffic 2024, by country

    • statista.com
    • ai-chatbox.pro
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Distribution of Reddit.com traffic 2024, by country [Dataset]. https://www.statista.com/statistics/325144/reddit-global-active-user-distribution/
    Explore at:
    Dataset updated
    Nov 11, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    In the six months ending March 2024, the United States accounted for 48.46 percent of traffic to the online forum Reddit.com. The United Kingdom was ranked second, accounting for 7.16 percent of web visits to the social media platform. Reddit in the United States In August 2023, Reddit accounted for slightly over 1.6 percent of social media website traffic in the United States. Founded in 2005, Reddit is a discussion website which enables users to aggregate news by posting links and let other users vote and comment on them. There are thousands of subforums, called subreddits, on a wide range of topics available. One of the most popular subreddits is the AMA (“Ask Me Anything”), where celebrities, public figures or people in unique positions post threads that allow other Reddit users to ask them anything. In 2022, Nicolas Cage's AMA post generated over 238.5 thousand upvotes, making it the most popular AMA of the year. Reddit users in the United States Reddit use in the United States is more prevalent among younger online audiences. During a February 2021 survey, it was found that 36 percent of internet users aged 18 to 29 years and 22 percent of users aged 30 to 49 years used Reddit. However, the reach of the social platform strongly declines with age. Also, whilst around a 23 of male adults in the U.S. access Reddit, only 12 percent of women do the same.

  18. Z

    Data from: The Reddit Politosphere: A Large-Scale Text and Network Resource...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hofmann, Valentin (2022). The Reddit Politosphere: A Large-Scale Text and Network Resource of Online Political Discourse [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5851728
    Explore at:
    Dataset updated
    Jan 16, 2022
    Dataset provided by
    Schütze, Hinrich
    Hofmann, Valentin
    Pierrehumbert, Janet B.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Reddit Politosphere is a large-scale resource of online political discourse covering more than 600 political discussion groups over a period of 12 years. Based on the Pushshift Reddit Dataset, it is to the best of our knowledge the largest and ideologically most comprehensive dataset of its type now available. One key feature of the Reddit Politosphere is that it consists of both text and network data. We also release annotated metadata for subreddits and users.

    Documentation and scripts for easy data access are provided in an associated repository on GitHub.

  19. reddit-depression-cleaned

    • huggingface.co
    Updated Feb 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fastai X Hugging Face Group 2022 (2023). reddit-depression-cleaned [Dataset]. https://huggingface.co/datasets/hugginglearners/reddit-depression-cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    fastai X Hugging Face Group 2022
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Depression: Reddit Dataset (Cleaned)

      Dataset Summary
    

    The raw data is collected through web scrapping Subreddits and is cleaned using multiple NLP techniques. The data is only in English language. It mainly targets mental health classification.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/reddit-depression-cleaned.
    
  20. h

    one-million-reddit-jokes

    • huggingface.co
    Updated Nov 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    one-million-reddit-jokes [Dataset]. https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2021
    Authors
    SocialGrep
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for one-million-reddit-jokes

      Dataset Summary
    

    This corpus contains a million posts from /r/jokes. Posts are annotated with their score.

      Languages
    

    Mainly English.

      Dataset Structure
    
    
    
    
    
    
    
      Data Instances
    

    A data point is a Reddit post.

      Data Fields
    

    'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type.… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SocialGrep (2022). the-reddit-dataset-dataset [Dataset]. https://huggingface.co/datasets/SocialGrep/the-reddit-dataset-dataset

the-reddit-dataset-dataset

SocialGrep/the-reddit-dataset-dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2022
Authors
SocialGrep
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A meta dataset of Reddit's own /r/datasets community.

Search
Clear search
Close search
Google apps
Main menu