Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A meta dataset of Reddit's own /r/datasets community.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We pre-processed and built posts and comments posted during 2010-2016 on the subreddit r/depression.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
RedditClustering.v2 An MTEB dataset Massive Text Embedding Benchmark
Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.
Task category t2c
Domains Web, Social, Written Reference https://arxiv.org/abs/2104.07081
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_tasks(["RedditClustering.v2"])… See the full description on the dataset page: https://huggingface.co/datasets/mteb/reddit-clustering.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Extracting Insights from Online DiscussionsReddit is one of the largest social discussion platforms, making it a valuable source for real-time opinions, trends, sentiment analysis, and user interactions across various industries. Scraping Reddit data allows businesses, researchers, and analysts to explore public discussions, track sentiment, and gain actionable insights from user-generated content. Benefits and Impact: Trend […]
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset collects one-year Reddit posts, post metadata, post title sentiments, and post comments threads from subreddit: r/GME, r/superstonk, r/DDintoGME, and r/GMEJungle.
Reddit is a web traffic powerhouse: in March 2024 approximately 2.2 billion visits were measured to the online forum, making it one of the most-visited websites online. The front page of the internet Formerly known as “the front page of the internet”, Reddit is an online forum platform with over 130,000 sub-forums and communities. The platform allows registered users, called Redditors, to post content. Each post is open to the entire Reddit community to vote upon, either by down- or upvotes. The most popular posts are featured directly on the front page. Subreddits are available by category and Redditors can follow selected subreddits relevant to their interest and also control what content they see on their custom front page. Some of the most popular subreddits are r/AskReddit or r/AMA – the “Ask Me Anything” format. According to the company, Reddit hosted 1,800 AMAs in 2018, with a wide range of topics and hosts. One of the most popular Reddit AMA of 2022 by number of upvotes was by actor Nicolas Cagem with more than 238.5 thousand upvotes. Reddit usage The United States account for the biggest share of Reddit's desktop traffic, followed by the UK, and Canada. As of March 2023, Reddit ranked among the most popular social media websites in the United States.
The data are in conll IOB2 format. Each instance in batch 1-8 is annotated by one annotator, while each instance in batch 9 and 10 is annotated by two annotators followed by adjudication.
Accessing this dataset implies automatic agreement to the following guidelines:
Use of the corpus is limited to research purposes only. Redistribution of the corpus without the authors’ permission is prohibited. Compliance with Reddit’s policy is mandatory.
Reddit posts, 2019-01-01 thru 2019-08-01.
Source: https://console.cloud.google.com/bigquery?p=fh-bigquery&page=project
peternasser99/reddit dataset hosted on Hugging Face and contributed by the HF Datasets community
In 2023, it was estimated that the number of Reddit posts published on the platform during the year reached 470 million. It is estimated that the output volume has experienced a constant increase between 2018 and 2023, with the social and news aggregator tripling the amount of posts users published in this period. Reddit, which was launched in 2005, is a social forum and news aggregator with high traffic volumes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit Comments Dataset for Text Style Transfer Tasks
A dataset of Reddit comments prepared for Text Style Transfer Tasks.
The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used:
"Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {"
This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models.
The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews.
The quality of formal translations was assessed with BERTScore and chrF++:
The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77
The dataset consists of 3 components.
reddit_commments.csv
This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected:
- subreddit (name of the subreddit in which the comment was posted)
- id (ID of the comment)
- submission_id (ID of the submission to which the comment was posted)
- body (the comment itself)
- created_utc (timestamp in seconds)
- parent_id (The ID of the comment or submission to which the comment is a reply)
- permalink (The URL to the original comment)-
- token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer)
- perplexity (What perplexity does GPT-2 calculate for the comment)
The comments were filtered. This file contains only comments that:
- have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens.
- are not [removed] or [deleted]
- do not contain URLs
This file was used as a source for the other two file types.
Labeled Files (training_labeled.csv and eval_labeled.csv)
These files contain the formal translations of the Reddit comments.
The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience.
They are structured as follows:
- Subreddit (name of the subreddit where the comment was posted).
- Original Comment
- Formal Comment
Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json)
These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment.
These files can be used to train models to perform style transfers based on given examples.
The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples.
An entry in this file is structured as follows:
"data":[
{
"input_sentence":"The original Reddit comment",
"style_samples":[
"sample1",
"sample2",
"sample3"
],
"results_sentence":"The formal translated input_sentence",
"subreddit":"The subreddit from which the comments originated"
},
"..."
]
This is aimed to build a conversation AI bot or for next word prediction. Please Upvote the dataset so that it reaches to maximum Kagglers and it can help them to build a well chat bot as the size of dataset is 2.6GB
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
suicidal ideation dataset extracted from reddit and twitter social media platform.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Reddit comments and submissions from 2005-06 to 2023-09 collected by pushshift and u/RaiderBDev. These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here The more recent dumps are collected by u/RaiderBDev
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for ScandiReddit
Dataset Summary
ScandiReddit is a filtered and post-processed corpus consisting of comments from Reddit. All Reddit comments from December 2005 up until October 2022 were downloaded through PushShift, after which these were filtered based on the FastText language detection model. Any comment which was classified as Danish (da), Norwegian (no), Swedish (sv) or Icelandic (is) with a confidence score above 70% was kept. The resulting comments… See the full description on the dataset page: https://huggingface.co/datasets/alexandrainst/scandi-reddit.
In the six months ending March 2024, the United States accounted for 48.46 percent of traffic to the online forum Reddit.com. The United Kingdom was ranked second, accounting for 7.16 percent of web visits to the social media platform. Reddit in the United States In August 2023, Reddit accounted for slightly over 1.6 percent of social media website traffic in the United States. Founded in 2005, Reddit is a discussion website which enables users to aggregate news by posting links and let other users vote and comment on them. There are thousands of subforums, called subreddits, on a wide range of topics available. One of the most popular subreddits is the AMA (“Ask Me Anything”), where celebrities, public figures or people in unique positions post threads that allow other Reddit users to ask them anything. In 2022, Nicolas Cage's AMA post generated over 238.5 thousand upvotes, making it the most popular AMA of the year. Reddit users in the United States Reddit use in the United States is more prevalent among younger online audiences. During a February 2021 survey, it was found that 36 percent of internet users aged 18 to 29 years and 22 percent of users aged 30 to 49 years used Reddit. However, the reach of the social platform strongly declines with age. Also, whilst around a 23 of male adults in the U.S. access Reddit, only 12 percent of women do the same.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Reddit Politosphere is a large-scale resource of online political discourse covering more than 600 political discussion groups over a period of 12 years. Based on the Pushshift Reddit Dataset, it is to the best of our knowledge the largest and ideologically most comprehensive dataset of its type now available. One key feature of the Reddit Politosphere is that it consists of both text and network data. We also release annotated metadata for subreddits and users.
Documentation and scripts for easy data access are provided in an associated repository on GitHub.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Depression: Reddit Dataset (Cleaned)
Dataset Summary
The raw data is collected through web scrapping Subreddits and is cleaned using multiple NLP techniques. The data is only in English language. It mainly targets mental health classification.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/reddit-depression-cleaned.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for one-million-reddit-jokes
Dataset Summary
This corpus contains a million posts from /r/jokes. Posts are annotated with their score.
Languages
Mainly English.
Dataset Structure
Data Instances
A data point is a Reddit post.
Data Fields
'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type.… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A meta dataset of Reddit's own /r/datasets community.