https://brightdata.com/licensehttps://brightdata.com/license
Access our extensive Reddit datasets that provide detailed information on posts, communities (subreddits), and user engagement. Gain insights into post performance, user comments, community statistics, and content trends with our ethically sourced data. Free samples are available for evaluation. 3M+ records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:
Post ID, Title & URL Post Description & Date Username of Poster Upvotes & Comment Count Community Name, URL & Description Community Member Count Attached Photos & Videos Full Post Comments Related Posts Post Karma Post Tags And more
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A meta dataset of Reddit's own /r/datasets community.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Extracting Insights from Online DiscussionsReddit is one of the largest social discussion platforms, making it a valuable source for real-time opinions, trends, sentiment analysis, and user interactions across various industries. Scraping Reddit data allows businesses, researchers, and analysts to explore public discussions, track sentiment, and gain actionable insights from user-generated content. Benefits and Impact: Trend […]
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This dataset contains posts from 28 subreddits (15 mental health support groups) from 2018-2020. We used this dataset to understand the impact of COVID-19 on mental health support groups from January to April, 2020 and included older timeframes to obtain baseline posts before COVID-19.
Please cite if you use this dataset:
Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635.
@article{low2020natural, title={Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study}, author={Low, Daniel M and Rumker, Laurie and Torous, John and Cecchi, Guillermo and Ghosh, Satrajit S and Talkar, Tanya}, journal={Journal of medical Internet research}, volume={22}, number={10}, pages={e22635}, year={2020}, publisher={JMIR Publications Inc., Toronto, Canada} }
License
This dataset is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/
It was downloaded using pushshift API. Re-use of this data is subject to Reddit API terms.
Reddit Mental Health Dataset
Contains posts and text features for the following timeframes from 28 mental health and non-mental health subreddits:
filenames
and corresponding timeframes:
post:
Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears). Unique users: 320,364. pre:
Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts. Unique users: 327,289.2019:
Jan 1 to April 20, 2019 (r/EDAnonymous appears). A control for seasonal fluctuations to match post
data. Unique users: 282,560.2018:
Jan 1 to April 20, 2018. A control for seasonal fluctuations to match post
data. Unique users: 177,089Unique users across all time windows (pre and 2019 overlap): 826,961.
See manuscript Supplementary Materials (https://doi.org/10.31234/osf.io/xvwcy) for more information.
Note: if subsampling (e.g., to balance subreddits), we recommend bootstrapping analyses for unbiased results.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
~1.7 billion JSON comment objects from reddit.com complete with the comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit s API.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.
Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Recently Reddit released an enormous dataset containing all ~1.7 billion of their publicly available comments. The full dataset is an unwieldy 1+ terabyte uncompressed, so we've decided to host a small portion of the comments here for Kagglers to explore. (You don't even need to leave your browser!)
You can find all the comments from May 2015 on scripts for your natural language processing pleasure. What had redditors laughing, bickering, and NSFW-ing this spring?
Who knows? Top visualizations may just end up on Reddit.
The database has one table, May2015
, with the following fields:
Dataset Card for "pushshift-reddit"
More Information needed
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Analyse the popularity of public subreddits
The CSV contains a long list of every subreddit on Reddit. There are a total of 1067472 subreddits and the columns in the dataset are:
This dataset was originally published on /r/datasets by /u/Stuck_In_the_Matrix
Reddit posts, 2019-01-01 thru 2019-08-01.
Source: https://console.cloud.google.com/bigquery?p=fh-bigquery&page=project
Dataset Card for AITA Reddit Posts and Comments
Posts of the AITA subreddit, with the 2 top voted comments that share the post verdict. Extracted using REDDIT PushShift (from 2013 to April 2023)
Dataset Details
The dataset contains 270,709 entiries each of which contain the post title, text, verdict, comment1, comment2 and score (number of upvotes) For more details see paper: https://arxiv.org/abs/2310.18336
Dataset Sources
The Reddit PushShift data dumps are… See the full description on the dataset page: https://huggingface.co/datasets/OsamaBsher/AITA-Reddit-Dataset.
The Reddit dataset contains tuples of user name, a subreddit where the user makes a comment to a thread, and a timestamp for the interaction, split into sessions manually.
In the six months ending March 2024, the United States accounted for 48.46 percent of traffic to the online forum Reddit.com. The United Kingdom was ranked second, accounting for 7.16 percent of web visits to the social media platform. Reddit in the United States In August 2023, Reddit accounted for slightly over 1.6 percent of social media website traffic in the United States. Founded in 2005, Reddit is a discussion website which enables users to aggregate news by posting links and let other users vote and comment on them. There are thousands of subforums, called subreddits, on a wide range of topics available. One of the most popular subreddits is the AMA (“Ask Me Anything”), where celebrities, public figures or people in unique positions post threads that allow other Reddit users to ask them anything. In 2022, Nicolas Cage's AMA post generated over 238.5 thousand upvotes, making it the most popular AMA of the year. Reddit users in the United States Reddit use in the United States is more prevalent among younger online audiences. During a February 2021 survey, it was found that 36 percent of internet users aged 18 to 29 years and 22 percent of users aged 30 to 49 years used Reddit. However, the reach of the social platform strongly declines with age. Also, whilst around a 23 of male adults in the U.S. access Reddit, only 12 percent of women do the same.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The posts were manually annotated all the posts as Suicidal or Non-Suicidal based on the following rules:
1. Suicidal Text
- Posts that conveyed definite signs of suicidal ideation or even showed signs of suffering extremely from mental health illnesses like depression etc. were marked in this category due to their relation with suicidal intent.
- Posts that included detailed planning of suicide or asked questions related to committing suicide, for eg. “Hello, hypothetically what would be a good way to go without loved ones knowing?”.
- Posts like "I weather today is so awful that it makes me want to kill myself hahaha" were carefully removed.
- These posts were marked as “1”.
2. Non Suicidal Text
- Posts that did not have anything related to suicide or self-harm were marked in this category.
- Posts that used words related to suicide or self-harm in the context of news or information.
- Posts that talked about suicide of some other person at some other time.
- These posts were marked as “0”. This was the default category.
Our annotators included one university professor and three university students who were very carefully instructed on how to annotate each post. The instructions are given below: 1. Select only one of the two categories mentioned above. 2. To select the default category in case of any doubt. 3. To remove any ambiguous posts which seemed very confusing after discussing with other annotators. 3. Maximum 100-200 posts were to be annotated in one session to avoid any mental fatigue. 4. Since the majority of posts in the dataset were extremely long (with words > 1000), a maximum of two annotation sessions were allowed in a day.
Once the annotators completed their tasks, they were divided into pairs of two where they verified the annotations of the other annotator. Any disagreement was carefully resolved and the final annotation was mutually agreed upon by the pair. This helped in validating each annotation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset collects one-year Reddit posts, post metadata, post title sentiments, and post comments threads from subreddit: r/GME, r/superstonk, r/DDintoGME, and r/GMEJungle.
Reddit Randomness Dataset
A dataset I created because I was curious about how "random" r/random really is. This data was collected by sending GET requests to https://www.reddit.com/r/random for a few hours on September 19th, 2021. I scraped a bit of metadata about the subreddits as well. randomness_12k_clean.csv reports the random subreddits as they happened and summary.csv lists some metadata about each subreddit.
The Data
randomness_12k_clean.csv
This… See the full description on the dataset page: https://huggingface.co/datasets/davidwisdom/reddit-randomness.
The table subreddits is part of the dataset Reddit, available at https://redivis.com/datasets/prpw-49sqq9ehv. It contains 2499 rows across 7 variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for one-million-reddit-questions
Dataset Summary
This corpus contains a million posts on /r/AskReddit, annotated with their score.
Languages
Mainly English.
Dataset Structure
Data Instances
A data point is a Reddit post.
Data Fields
'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID of… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-questions.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Reddit comments and submissions from 2023-02 collected by pushshift which can be found here These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Reddit comments and submissions from 2005-06 to 2022-12 collected by pushshift which can be found here These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here
https://brightdata.com/licensehttps://brightdata.com/license
Access our extensive Reddit datasets that provide detailed information on posts, communities (subreddits), and user engagement. Gain insights into post performance, user comments, community statistics, and content trends with our ethically sourced data. Free samples are available for evaluation. 3M+ records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:
Post ID, Title & URL Post Description & Date Username of Poster Upvotes & Comment Count Community Name, URL & Description Community Member Count Attached Photos & Videos Full Post Comments Related Posts Post Karma Post Tags And more