100+ datasets found

Reddit Datasets
brightdata.com
.json, .csv, .xlsx
Updated Jan 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2023). Reddit Datasets [Dataset]. https://brightdata.com/products/datasets/reddit
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jan 11, 2023
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Access our extensive Reddit datasets that provide detailed information on posts, communities (subreddits), and user engagement. Gain insights into post performance, user comments, community statistics, and content trends with our ethically sourced data. Free samples are available for evaluation. 3M+ records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

Post ID, Title & URL Post Description & Date Username of Poster Upvotes & Comment Count Community Name, URL & Description Community Member Count Attached Photos & Videos Full Post Comments Related Posts Post Karma Post Tags And more
h
the-reddit-dataset-dataset
huggingface.co
Updated Jun 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SocialGrep (2022). the-reddit-dataset-dataset [Dataset]. https://huggingface.co/datasets/SocialGrep/the-reddit-dataset-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2022
Authors
SocialGrep
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A meta dataset of Reddit's own /r/datasets community.
Reddit Datasets
promptcloud.com
csv
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PromptCloud (2025). Reddit Datasets [Dataset]. https://www.promptcloud.com/dataset/reddit/
Explore at:
csvAvailable download formats
Dataset updated
Mar 28, 2025
Dataset authored and provided by
PromptCloud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Extracting Insights from Online DiscussionsReddit is one of the largest social discussion platforms, making it a valuable source for real-time opinions, trends, sentiment analysis, and user interactions across various industries. Scraping Reddit data allows businesses, researchers, and analysts to explore public discussions, track sentiment, and gain actionable insights from user-generated content. Benefits and Impact: Trend […]
Reddit Mental Health Dataset
zenodo.org
csv
Updated Oct 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel M. Low; Daniel M. Low; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh (2020). Reddit Mental Health Dataset [Dataset]. http://doi.org/10.17605/osf.io/7peyq
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.17605/osf.io/7peyq
Dataset updated
Oct 16, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel M. Low; Daniel M. Low; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
This dataset contains posts from 28 subreddits (15 mental health support groups) from 2018-2020. We used this dataset to understand the impact of COVID-19 on mental health support groups from January to April, 2020 and included older timeframes to obtain baseline posts before COVID-19.

Please cite if you use this dataset:

Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635.

@article{low2020natural, title={Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study}, author={Low, Daniel M and Rumker, Laurie and Torous, John and Cecchi, Guillermo and Ghosh, Satrajit S and Talkar, Tanya}, journal={Journal of medical Internet research}, volume={22}, number={10}, pages={e22635}, year={2020}, publisher={JMIR Publications Inc., Toronto, Canada} }

License

This dataset is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/

It was downloaded using pushshift API. Re-use of this data is subject to Reddit API terms.

Reddit Mental Health Dataset

Contains posts and text features for the following timeframes from 28 mental health and non-mental health subreddits:

15 specific mental health support groups (r/EDAnonymous, r/addiction, r/alcoholism, r/adhd, r/anxiety, r/autism, r/bipolarreddit, r/bpd, r/depression, r/healthanxiety, r/lonely, r/ptsd, r/schizophrenia, r/socialanxiety, and r/suicidewatch)

2 broad mental health subreddits (r/mentalhealth, r/COVID19_support)

11 non-mental health subreddits (r/conspiracy, r/divorce, r/fitness, r/guns, r/jokes, r/legaladvice, r/meditation, r/parenting, r/personalfinance, r/relationships, r/teaching).

filenames and corresponding timeframes:

post: Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears). Unique users: 320,364.

pre: Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts. Unique users: 327,289.

2019: Jan 1 to April 20, 2019 (r/EDAnonymous appears). A control for seasonal fluctuations to match post data. Unique users: 282,560.

2018: Jan 1 to April 20, 2018. A control for seasonal fluctuations to match post data. Unique users: 177,089

Unique users across all time windows (pre and 2019 overlap): 826,961.

See manuscript Supplementary Materials (https://doi.org/10.31234/osf.io/xvwcy) for more information.

Note: if subsampling (e.g., to balance subreddits), we recommend bootstrapping analyses for unbiased results.
a
Reddit Public Comments (2007-10 through 2015-05)
academictorrents.com
bittorrent
Updated Jul 10, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stuck_In_the_Matrix (2015). Reddit Public Comments (2007-10 through 2015-05) [Dataset]. https://academictorrents.com/details/7690f71ea949b868080401c749e878f98de34d3d
Explore at:
bittorrent(160678037702)Available download formats
Dataset updated
Jul 10, 2015
Dataset authored and provided by
Stuck_In_the_Matrix
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
~1.7 billion JSON comment objects from reddit.com complete with the comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit s API.
h
tldr-17
huggingface.co
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Webis Group (2023). tldr-17 [Dataset]. https://huggingface.co/datasets/webis/tldr-17
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2023
Dataset authored and provided by
Webis Group
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.

Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
May 2015 Reddit Comments
kaggle.com
zip
Updated Jun 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2019). May 2015 Reddit Comments [Dataset]. https://www.kaggle.com/datasets/kaggle/reddit-comments-may-2015
Explore at:
zip(21429083286 bytes)Available download formats
Dataset updated
Jun 4, 2019
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Recently Reddit released an enormous dataset containing all ~1.7 billion of their publicly available comments. The full dataset is an unwieldy 1+ terabyte uncompressed, so we've decided to host a small portion of the comments here for Kagglers to explore. (You don't even need to leave your browser!)

You can find all the comments from May 2015 on scripts for your natural language processing pleasure. What had redditors laughing, bickering, and NSFW-ing this spring?

Who knows? Top visualizations may just end up on Reddit.

Data Description

The database has one table, May2015, with the following fields:

created_utc

ups

subreddit_id

link_id

name

score_hidden

author_flair_css_class

author_flair_text

subreddit

id

removal_reason

gilded

downs

archived

author

score

retrieved_on

body

distinguished

edited

controversiality

parent_id
h
pushshift-reddit
huggingface.co
Updated Feb 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franco Demarco (2025). pushshift-reddit [Dataset]. https://huggingface.co/datasets/fddemarco/pushshift-reddit
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2025
Authors
Franco Demarco
Description
Dataset Card for "pushshift-reddit"

More Information needed
Subreddits
kaggle.com
zip
Updated May 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Severan (2018). Subreddits [Dataset]. https://www.kaggle.com/datasets/rayraegah/subreddits
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 12, 2018
Authors
Severan
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Context

Analyse the popularity of public subreddits

Content

The CSV contains a long list of every subreddit on Reddit. There are a total of 1067472 subreddits and the columns in the dataset are:

base10 id,

base36 reddit id,

creation epoch,

subreddit name,

number of subscribers

Acknowledgements

This dataset was originally published on /r/datasets by /u/Stuck_In_the_Matrix

Inspiration

What's on Reddit?

Find your subreddit
Reddit
redivis.com
application/jsonl +7
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2021). Reddit [Dataset]. https://redivis.com/datasets/prpw-49sqq9ehv
Explore at:
sas, stata, csv, avro, parquet, spss, application/jsonl, arrowAvailable download formats
Dataset updated
Oct 27, 2021
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Description
Abstract

Reddit posts, 2019-01-01 thru 2019-08-01.

Documentation

Source: https://console.cloud.google.com/bigquery?p=fh-bigquery&page=project
h
AITA-Reddit-Dataset
huggingface.co
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osama Bsher (2023). AITA-Reddit-Dataset [Dataset]. https://huggingface.co/datasets/OsamaBsher/AITA-Reddit-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2023
Authors
Osama Bsher
Description
Dataset Card for AITA Reddit Posts and Comments

Posts of the AITA subreddit, with the 2 top voted comments that share the post verdict. Extracted using REDDIT PushShift (from 2013 to April 2023)

Dataset Details

The dataset contains 270,709 entiries each of which contain the post title, text, verdict, comment1, comment2 and score (number of upvotes) For more details see paper: https://arxiv.org/abs/2310.18336

Dataset Sources

The Reddit PushShift data dumps are… See the full description on the dataset page: https://huggingface.co/datasets/OsamaBsher/AITA-Reddit-Dataset.
t
Reddit dataset
service.tib.eu
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Reddit dataset [Dataset]. https://service.tib.eu/ldmservice/dataset/reddit-dataset
Explore at:
Dataset updated
Nov 25, 2024
Description
The Reddit dataset contains tuples of user name, a subreddit where the user makes a comment to a thread, and a timestamp for the interaction, split into sessions manually.
Distribution of Reddit.com traffic 2024, by country
statista.com
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Distribution of Reddit.com traffic 2024, by country [Dataset]. https://www.statista.com/statistics/325144/reddit-global-active-user-distribution/
Explore at:
Dataset updated
Aug 20, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
In the six months ending March 2024, the United States accounted for 48.46 percent of traffic to the online forum Reddit.com. The United Kingdom was ranked second, accounting for 7.16 percent of web visits to the social media platform. Reddit in the United States In August 2023, Reddit accounted for slightly over 1.6 percent of social media website traffic in the United States. Founded in 2005, Reddit is a discussion website which enables users to aggregate news by posting links and let other users vote and comment on them. There are thousands of subforums, called subreddits, on a wide range of topics available. One of the most popular subreddits is the AMA (“Ask Me Anything”), where celebrities, public figures or people in unique positions post threads that allow other Reddit users to ask them anything. In 2022, Nicolas Cage's AMA post generated over 238.5 thousand upvotes, making it the most popular AMA of the year. Reddit users in the United States Reddit use in the United States is more prevalent among younger online audiences. During a February 2021 survey, it was found that 36 percent of internet users aged 18 to 29 years and 22 percent of users aged 30 to 49 years used Reddit. However, the reach of the social platform strongly declines with age. Also, whilst around a 23 of male adults in the U.S. access Reddit, only 12 percent of women do the same.
Suicidal Ideation - Reddit Dataset
kaggle.com
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varun (2023). Suicidal Ideation - Reddit Dataset [Dataset]. https://www.kaggle.com/datasets/rvarun11/suicidal-ideation-reddit-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Varun
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The posts were manually annotated all the posts as Suicidal or Non-Suicidal based on the following rules: 1. Suicidal Text - Posts that conveyed definite signs of suicidal ideation or even showed signs of suffering extremely from mental health illnesses like depression etc. were marked in this category due to their relation with suicidal intent. - Posts that included detailed planning of suicide or asked questions related to committing suicide, for eg. “Hello, hypothetically what would be a good way to go without loved ones knowing?”. - Posts like "I weather today is so awful that it makes me want to kill myself hahaha" were carefully removed. - These posts were marked as “1”. 2. Non Suicidal Text - Posts that did not have anything related to suicide or self-harm were marked in this category. - Posts that used words related to suicide or self-harm in the context of news or information. - Posts that talked about suicide of some other person at some other time.
- These posts were marked as “0”. This was the default category.

Our annotators included one university professor and three university students who were very carefully instructed on how to annotate each post. The instructions are given below: 1. Select only one of the two categories mentioned above. 2. To select the default category in case of any doubt. 3. To remove any ambiguous posts which seemed very confusing after discussing with other annotators. 3. Maximum 100-200 posts were to be annotated in one session to avoid any mental fatigue. 4. Since the majority of posts in the dataset were extremely long (with words > 1000), a maximum of two annotation sessions were allowed in a day.

Once the annotators completed their tasks, they were divided into pairs of two where they verified the annotations of the other annotator. Any disagreement was carefully resolved and the final annotation was mutually agreed upon by the pair. This helped in validating each annotation.
H
Data from: Reddit Dataset on Meme Stock: GameStop
dataverse.harvard.edu
Updated Jul 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Han (2022). Reddit Dataset on Meme Stock: GameStop [Dataset]. http://doi.org/10.7910/DVN/TUMIPC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/TUMIPC
Dataset updated
Jul 9, 2022
Dataset provided by
Harvard Dataverse
Authors
Jing Han
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset collects one-year Reddit posts, post metadata, post title sentiments, and post comments threads from subreddit: r/GME, r/superstonk, r/DDintoGME, and r/GMEJungle.
h
reddit-randomness
huggingface.co
Updated Sep 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Wisdom (2021). reddit-randomness [Dataset]. https://huggingface.co/datasets/davidwisdom/reddit-randomness
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 19, 2021
Authors
David Wisdom
Description
Reddit Randomness Dataset

A dataset I created because I was curious about how "random" r/random really is. This data was collected by sending GET requests to https://www.reddit.com/r/random for a few hours on September 19th, 2021. I scraped a bit of metadata about the subreddits as well. randomness_12k_clean.csv reports the random subreddits as they happened and summary.csv lists some metadata about each subreddit.

The Data randomness_12k_clean.csv

This… See the full description on the dataset page: https://huggingface.co/datasets/davidwisdom/reddit-randomness.
subreddits
redivis.com
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2025). subreddits [Dataset]. https://redivis.com/datasets/prpw-49sqq9ehv
Explore at:
Dataset updated
Jun 6, 2025
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Description
The table subreddits is part of the dataset Reddit, available at https://redivis.com/datasets/prpw-49sqq9ehv. It contains 2499 rows across 7 variables.
h
one-million-reddit-questions
huggingface.co
Updated Dec 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SocialGrep (2021). one-million-reddit-questions [Dataset]. https://huggingface.co/datasets/SocialGrep/one-million-reddit-questions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2021
Authors
SocialGrep
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for one-million-reddit-questions

Dataset Summary

This corpus contains a million posts on /r/AskReddit, annotated with their score.

Languages

Mainly English.

Dataset Structure Data Instances

A data point is a Reddit post.

Data Fields

'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID of… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-questions.
a
Reddit comments/submissions 2023-02
academictorrents.com
bittorrent
Updated Mar 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stuck_in_the_matrix and Watchful1 (2023). Reddit comments/submissions 2023-02 [Dataset]. https://academictorrents.com/details/9971c68d2909843a100ae955c6ab6de3e09c04a1
Explore at:
bittorrent(34428747095)Available download formats
Dataset updated
Mar 19, 2023
Dataset authored and provided by
stuck_in_the_matrix and Watchful1
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Reddit comments and submissions from 2023-02 collected by pushshift which can be found here These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here
a
Reddit comments/submissions 2005-06 to 2022-12
academictorrents.com
bittorrent
Updated Jan 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stuck_in_the_matrix and Watchful1 (2023). Reddit comments/submissions 2005-06 to 2022-12 [Dataset]. https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee
Explore at:
bittorrent(1994218462768)Available download formats
Dataset updated
Jan 13, 2023
Dataset authored and provided by
stuck_in_the_matrix and Watchful1
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Reddit comments and submissions from 2005-06 to 2022-12 collected by pushshift which can be found here These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here

Facebook

Twitter

Click to copy link

Link copied

Cite

Bright Data (2023). Reddit Datasets [Dataset]. https://brightdata.com/products/datasets/reddit

Reddit Datasets

Explore at:

.json, .csv, .xlsxAvailable download formats

Dataset updated

Jan 11, 2023

Dataset authored and provided by

Bright Datahttps://brightdata.com/

License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered

Worldwide

Description

Access our extensive Reddit datasets that provide detailed information on posts, communities (subreddits), and user engagement. Gain insights into post performance, user comments, community statistics, and content trends with our ethically sourced data. Free samples are available for evaluation. 3M+ records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

Post ID, Title & URL Post Description & Date Username of Poster Upvotes & Comment Count Community Name, URL & Description Community Member Count Attached Photos & Videos Full Post Comments Related Posts Post Karma Post Tags And more

Clear search

Close search

Google apps

Main menu

Reddit Datasets

the-reddit-dataset-dataset

Reddit Datasets

Reddit Mental Health Dataset

Reddit Public Comments (2007-10 through 2015-05)

tldr-17

May 2015 Reddit Comments

Data Description

pushshift-reddit

Subreddits

Context

Content

Acknowledgements

Inspiration

Reddit

Abstract

Documentation

AITA-Reddit-Dataset

Reddit dataset

Distribution of Reddit.com traffic 2024, by country

Suicidal Ideation - Reddit Dataset

Data from: Reddit Dataset on Meme Stock: GameStop

reddit-randomness

subreddits

one-million-reddit-questions

Reddit comments/submissions 2023-02

Reddit comments/submissions 2005-06 to 2022-12

Reddit Datasets