42 datasets found

Reddit AskScience Flair Analysis Dataset
kaggle.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/sumitm004/reddit-raskscience-flair-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sumit Mishra
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Context

Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

Content

This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

Mendeley Data

Ideas for Usage

Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.

NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.

Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.

Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.

Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.
P
REDDIT-BINARY Dataset
paperswithcode.com
opendatalab.com
Updated Nov 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pinar Yanardag; S. V. N. Vishwanathan (2021). REDDIT-BINARY Dataset [Dataset]. https://paperswithcode.com/dataset/reddit-binary
Explore at:
Dataset updated
Nov 16, 2021
Authors
Pinar Yanardag; S. V. N. Vishwanathan
Description
REDDIT-BINARY consists of graphs corresponding to online discussions on Reddit. In each graph, nodes represent users, and there is an edge between them if at least one of them respond to the other’s comment. There are four popular subreddits, namely, IAmA, AskReddit, TrollXChromosomes, and atheism. IAmA and AskReddit are two question/answer based subreddits, and TrollXChromosomes and atheism are two discussion-based subreddits. A graph is labeled according to whether it belongs to a question/answer-based community or a discussion-based community.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
d
Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends,...
datarade.ai
.json, .csv
Updated Aug 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataplex (2024). Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-consumer-behavior-data-2-1m-subred-dataplex
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 14, 2024
Dataset authored and provided by
Dataplex
Area covered
Saint Barthélemy, Tunisia, Togo, Cuba, Cocos (Keeling) Islands, Burkina Faso, Netherlands, Lithuania, Belize, Croatia
Description
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.

User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.

Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.

AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.

Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.

Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.

Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.

Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.

Researchers: Explore consumer behavior data of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conducting acade...
Reddit Submissions
kaggle.com
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad (2023). Reddit Submissions [Dataset]. https://www.kaggle.com/datasets/pypiahmad/reddit-submissions/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Reddit Submissions dataset encompasses submissions of Reddit posts, particularly focusing on resubmissions of the same content, along with pertinent metadata. This dataset covers a timespan from July 2008 to January 2013 and provides an insightful view into the dynamics of content sharing and engagement within the Reddit community.

Basic Statistics: - Number of Submissions (images): 132,308 - Number of Unique Images: 16,736 - Timespan: July 2008 - January 2013

Metadata: - Timestamps: The time when a post was submitted. - Upvotes/Downvotes: The number of upvotes and downvotes a post received. - Post Title: The title of the submitted post. - Subreddit: The subreddit to which the post was submitted. - Additional metadata such as total votes, Reddit ID, number of comments, and username of the submitter.

Examples: ```plaintext

image_id, unixtime, rawtime, title, total_votes, reddit_id,...

number_of_downvotes, localtime, score, number_of_comments, username 1005, 1335861624, 2012-05-01T15:40:24.968266-07:00, I immediately regret this decision, 27, t296r, 20, pics, 7, 1335886824, 13, 0, ninjaroflmaster 1005, 1336470481, 2012-05-08T16:48:01.418140-07:00, "Pushing your friend into the water, Level: 99", 18, tds4i, 16, funny, 2, 1336495681, 14, 0, hme4 1005, 1339566752, 2012-06-13T12:52:32.371941-07:00, I told him. He Didn't Listen, 6, v0cma, 4, funny, 2, 1339591952, 2, 0, HeyPatWhatsUp 1005, 1342200476, 2012-07-14T00:27:56.857805-07:00, Don't end up as this guy., 16, wjivx, 7, funny, 9, 1342225676, -2, 2, catalyst24 ```

Download Links: - Resubmissions Data (7.3MB) - Raw HTML of Resubmissions (1.8GB)

Citation: - Understanding the interplay between titles, content, and communities in social media, Himabindu Lakkaraju, Julian McAuley, Jure Leskovec, ICWSM, 2013. pdf

Use Cases: 1. Content Resubmission Analysis: Analyzing the pattern and impact of content resubmissions across different subreddits. 2. Community Engagement: Studying how different titles, content, and subreddits influence user engagement in terms of upvotes, downvotes, and comments. 3. Temporal Analysis: Investigating how the popularity of certain content changes over time and how resubmissions are accepted by the community at different time intervals. 4. Subreddit Analysis: Understanding the characteristics of different subreddits in terms of content sharing and resubmissions. 5. User Behavior Analysis: Examining user behavior in terms of content submission, resubmission, and interaction. 6. Social Media Marketing: For marketers, understanding the dynamics of content resubmission could help in optimizing the content sharing strategy on Reddit. 7. Machine Learning: Utilizing the dataset to build models that can predict the success of a post or resubmission based on various factors. 8. NLP Applications: Analyzing text data for sentiment analysis, topic modeling, and other Natural Language Processing (NLP) applications. 9. Spam Detection: Identifying spam or redundant content through the analysis of resubmissions and user behaviors.

This dataset is valuable for researchers, social media analysts, marketers, and data scientists interested in studying social media dynamics, especially on a platform like Reddit where content resubmission is common.
Z
Comprehensive dataset of over 4000 subreddits across 13 categories
data.niaid.nih.gov
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinod, Dayanand (2024). Comprehensive dataset of over 4000 subreddits across 13 categories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13343577
Explore at:
Dataset updated
Oct 1, 2024
Dataset provided by
S, Arjhun
Vinod, Dayanand
Deepak, Pranav
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset encompasses a rich collection of 4000 subreddits organized into 13 distinct categories, providing a valuable resource for researchers and data scientists in the fields of social media analysis, natural language processing, and community dynamics. The subreddits and the respective categories were obtained here.

Each subreddit contains an average of over 400 posts and 11 million unique users.

The dataset is formatted in JSON.

The data is structured in the following manner.

id: the post's unique identifier

post_user: the post's author (anonymized)

post_time: the time at which the post was created, in unix time

post_body: the post's body

comments: a list of comments on the post, where each comment is a dictionary with the following keys:

id: the comment's unique identifier

user: the comment's author (anonymized)

time: the time at which the comment was created, in unix time

body: the comment's body

replies: a list of replies to the comment, where each reply is a dictionary with the same information as a comment.

The comments and replies are threaded within.
Z
Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...
data.niaid.nih.gov
Updated Jan 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cresci, Stefano (2023). Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation Interventions on r/The_Donald [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6250576
Explore at:
Dataset updated
Jan 10, 2023
Dataset provided by
Cresci, Stefano
Trujillo, Amaury
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.

An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again

If you use this dataset please cite the related article.

The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.

The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.

The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.

The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.

A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.
m
Reddit Ideological and Extreme Bias Dataset - Part 1
data.mendeley.com
Updated Feb 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamalakkannan Ravi (2024). Reddit Ideological and Extreme Bias Dataset - Part 1 [Dataset]. http://doi.org/10.17632/2tdr9sjd83.3
Explore at:
Unique identifier
https://doi.org/10.17632/2tdr9sjd83.3
Dataset updated
Feb 28, 2024
Authors
Kamalakkannan Ravi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data 1: Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles. Data 2: Dataset with articles posted in the Liberal, Conservative, and Restricted (private or banned) subreddits. In total, we collected a corpus of 1.3 million articles. We have collected news articles to understand radicalized communities through the shared news articles.

Part 1 has Data 1 (all) and Data 2 (Raw and Labeled Data - Restricted.json) Part 2 has Data 2 (Raw and Labeled Data - Liberal.json, and Conservative.json) and Data 2 (Raw and Unlabeled Data - first 40 of the 76 .json files) Part 3 has Data 2 (Raw and Unlabeled Data - reamaining 36 of the 76 .json files)

WikiReddit: Tracing Information and Attention Flows Between Online Platforms...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

Reddit Italy Coffee Dataset
kaggle.com
Updated Mar 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luigi Cerone (2023). Reddit Italy Coffee Dataset [Dataset]. https://www.kaggle.com/datasets/gigggi/reddit-italy-coffee-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Luigi Cerone
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Italy
Description
r/italy is a subreddit focused on discussions related to Italy, including news, culture, politics, and society. Users can post and comment on various topics related to Italy, including travel, language, cuisine, and more. Among the many threads that populate the subreddit, one of the most popular is the daily thread named "Caffè Italia." As the name suggests, this thread is a virtual coffeehouse where users can gather and exchange ideas on a variety of topics.

Every day, a new "Caffè Italia" thread is created, and users are encouraged to participate by sharing their opinions, asking for advice, or simply chatting with others. The topics discussed in this thread can be very diverse, ranging from Italian cuisine and travel to politics, news, and social issues.

The "Caffè Italia" thread provides an informal and friendly space where users can express themselves freely and connect with others who share their interests or concerns. It's a place where they can ask for recommendations on the best places to visit in Italy, share their thoughts on the latest news or events, or discuss cultural topics, such as literature, art, or music.

What makes the "Caffè Italia" thread so unique is its sense of community. Users feel welcome and valued, and they often return to the thread to catch up with the latest discussions or to contribute to ongoing conversations. Many users have formed friendships and connections through the thread, which has become a hub for the r/italy community.

In summary, the "Caffè Italia" thread is a daily gathering place for r/italy users to engage in conversations, share their experiences, and connect with others. Whether you're a first-time visitor to the subreddit or a seasoned member of the community, you're sure to find something interesting and engaging in the "Caffè Italia" thread.

This dataset contains several months of data scraped from it. The code used to generate it is available in my Github profile.
Z
Dataset for: The Evolution of the Manosphere Across the Web
data.niaid.nih.gov
Updated Aug 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gianluca Stringhini (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
Explore at:
Dataset updated
Aug 30, 2020
Dataset provided by
Emiliano De Cristofaro
Gianluca Stringhini
Summer Long
Savvas Zannettou
Stephanie Greenberg
Jeremy Blackburn
Barry Bradlyn
Manoel Horta Ribeiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Evolution of the Manosphere Across the Web

We make available data related to subreddit and standalone forums from the manosphere.

We also make available Perspective API annotations for all posts.

You can find the code in GitHub.

Please cite this paper if you use this data:

@article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

Reddit data

We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

{ "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

Tallcels are fakecels and they all can (and should) suck my cock.

If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

Forums

We here describe the .sqlite and .ndjson files that contain the data from the following forums.

(avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

The files are in folders /sqlite/ and /ndjson.

2.1 .sqlite

All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

"type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
"title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

"post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

2.2 .ndjson

Each line consists of a json object representing a different comment with the following fields:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

Perspective

We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

{ "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

Working with sqlite

A nice way to read some of the files of the dataset is using SqliteDict, for example:

from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

Helpers

Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

These are used in the paper for the migration analyses.

Examples and particularities for forums

Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

6.1 incels

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

6.2 LoveShy

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: no types were parsed. There are some rules in the forum, but not significant.

quotes: quotes were obtained from exact text+author match, or author match + a jaccard
Reddit users in the United States 2019-2028
statista.com
ai-chatbox.pro
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). Reddit users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
Explore at:
Dataset updated
Jun 13, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United States
Description
The number of Reddit users in the United States was forecast to continuously increase between 2024 and 2028 by in total 10.3 million users (+5.21 percent). After the ninth consecutive increasing year, the Reddit user base is estimated to reach 208.12 million users and therefore a new peak in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Reddit users in countries like Mexico and Canada.
Data from: The Reddit Politosphere: A Large-Scale Text and Network Resource...
zenodo.org
explore.openaire.eu
+1more
bz2, csv, json
Updated Jan 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentin Hofmann; Valentin Hofmann; Hinrich Schütze; Hinrich Schütze; Janet B. Pierrehumbert; Janet B. Pierrehumbert (2022). The Reddit Politosphere: A Large-Scale Text and Network Resource of Online Political Discourse [Dataset]. http://doi.org/10.5281/zenodo.5851729
Explore at:
bz2, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5851729
Dataset updated
Jan 16, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentin Hofmann; Valentin Hofmann; Hinrich Schütze; Hinrich Schütze; Janet B. Pierrehumbert; Janet B. Pierrehumbert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Reddit Politosphere is a large-scale resource of online political discourse covering more than 600 political discussion groups over a period of 12 years. Based on the Pushshift Reddit Dataset, it is to the best of our knowledge the largest and ideologically most comprehensive dataset of its type now available. One key feature of the Reddit Politosphere is that it consists of both text and network data. We also release annotated metadata for subreddits and users.

Documentation and scripts for easy data access are provided in an associated repository on GitHub.
f
reddit user posting behavior (mid-2013)
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Randy Olson (2023). reddit user posting behavior (mid-2013) [Dataset]. http://doi.org/10.6084/m9.figshare.874101.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.874101.v2
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Randy Olson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file contains the posting preferences for over 850,000 active reddit users. This sample was taken in mid-2013. This data was used to generate the interactive visualization, "redditviz," and will be analyzed in detail in an upcoming research article. Please cite our paper "Navigating the massive world of reddit" if you use this data in your work. URL: http://arxiv.org/abs/1312.3387 The file is organized as follows: Each line is an entry for an anonymous user. Each user was randomly assigned a unique ID, which is what shows in the first entry of each line. Following the user ID, separated by commas, are the subreddits (i.e., interests) that the user regularly posts in. In order for a user to be considered "active" in that subreddit, they had to post or comment there at least 10 times in their last 1,000 posts and comments.
d
Reddit blackout announcements: 2023 API protest
search.dataone.org
data.niaid.nih.gov
+1more
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Pettis (2024). Reddit blackout announcements: 2023 API protest [Dataset]. http://doi.org/10.5061/dryad.qfttdz0qd
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.qfttdz0qd
Dataset updated
Feb 8, 2024
Dataset provided by
Dryad Digital Repository
Authors
Ben Pettis
Time period covered
Jan 1, 2024
Description
Starting June 12, 2023, many Reddit communities (subreddits) began a protest where they "went dark" - by changing to private mode - as a protest in response to Reddit's plans to change its API access policies and fee structure. Supporters of the protest criticize the planned changes for being prohibitively expensive for 3rd party apps. Beyond 3rd party apps, there is significant concern that the API changes are a move by the platform to increase monetization, degrade the user experience, and eventually kill off other custom features such as the old.reddit.com interface, the Reddit Enhancement Suite browser extension, and more. Additionally, there are concerns that the API changes will impede the ability of subreddit moderators (who are all unpaid users) to access tools to keep their communities on-topic and free of spam. This dataset includes the "stickied" posts that appeared on 5,351 subreddits on June 11, 2023 and June 12, 2023 - including many subreddits announcing their plans to pa..., The list of subreddits was created from the ist of participating subreddits that had been collated in the /r/ModCoord subreddit. An initial Python script looks at three reddit posts and grabs the list of participating subreddits:

https://www.reddit.com/r/ModCoord/comments/1401qw5/incomplete_and_growing_list_of_participating/ https://www.reddit.com/r/ModCoord/comments/143fzf6/incomplete_and_growing_list_of_participating/ https://www.reddit.com/r/ModCoord/comments/146ffpb/incomplete_and_growing_list_of_participating/

It uses the requests library to get the HTTP response body. Then it uses re to search for links that look like r/iphone, e.g. what the list looks like in the post. Next it's just a bit of string cleanup and then writing to an output file. This script does not use the Reddit API at all. It's just basic HTTP requests. A second Python script then reads that list and uses the Reddit API to request information about current posts in each subr..., , # Reddit Blackout Announcements - 2023 API Protest

Reddit Blackout Announcements - 2023 API Protest

This dataset includes the list of scraped subreddits, a single CSV file for each subreddit, and a copy of the Python scripts used to scrape the data.

Description of the data and file structure

The dataset is uploaded as a single .zip file. Once it is downloaded and decompressed, it will include several files and directories. Here is how they are organized . â””â”€â”€ subreddit-list.txt â””â”€â”€ CSVs â””â”€â”€ [subreddit-name].csv â””â”€â”€ [...] â””â”€â”€ code â””â”€â”€ [...] â””â”€â”€ parsed TXTs â””â”€â”€ API.txt â””â”€â”€ blackout.txt â””â”€â”€ community.txt â””â”€â”€ mod-team.txt â””â”€â”€ moderator.txt â””â”€â”€ platform.txt â””â”€â”€ protest.txt

Subreddit List

The subreddit-list.txt file contains a list of 5,351 subreddit names. Each appears on its own line. This list was generated using the list-subreddits.py script, as described below.

Stickied Posts - CSVs

The "CSVs" directory contains 5,351 CSV (Comma Separated Value) files, each named ...
A
‘One Million Reddit Confessions’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘One Million Reddit Confessions’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-one-million-reddit-confessions-8471/839caf4b/?iid=000-410&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘One Million Reddit Confessions’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/one-million-reddit-confessions-samplee on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

NOTICE

Due to the platform's limitations, we can only provide a sample of this dataset. Please download the full version (free, no registration) from SocialGrep.

Context

For one reason or another, people are compelled to be frank with strangers. Whether it's making a fast friend on a train ride, or posting an anonymous confession online, we just tend to find it easier to let our secrets out to someone we'll never know again. A brief, beautiful window of candid honesty is somewhere in there. That's what this dataset was inspired by.

Content

The following dataset comprises a million confession posts from Sep 30 2021 and backwards, proportionally taken from the following subreddits:

/r/trueoffmychest

/r/confession

/r/confessions

/r/offmychest

All the posts are annotated with their score.

The dataset was procured using SocialGrep.

To preserve users' anonymity and to prevent targeted harassment, the data does not include usernames.

Inspiration

In this dataset, we wanted to explore the nature of sympathy. Which confessions are met with forgiveness? Which aren't? It's our most candid corpus to date.

This dataset was created by SocialGrep and contains around 100 samples along with Subreddit.nsfw, Domain, technical information and other features such as: - Subreddit.name - Subreddit.id - and more.

How to use this dataset

Analyze Type in relation to Score

Study the influence of Selftext on Url

More datasets

Acknowledgements

If you use this dataset in your research, please credit SocialGrep

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
h
reddit-ArtistHate
huggingface.co
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trent Kelly (2025). reddit-ArtistHate [Dataset]. https://huggingface.co/datasets/trentmkelly/reddit-ArtistHate
Explore at:
Dataset updated
Jun 19, 2025
Authors
Trent Kelly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What is this dataset?

This is 831 thread, comment, comment reply triplets from r/ArtistHate. You can use this dataset to create a fine-tuned LLM that hates AI as much as the r/ArtistHate users do. Each row in this dataset has, in its system prompt, LLM-generated tone and instruction texts, allowing the resulting fine-tune to be steered. See the data explorer for examples of how to properly format the system prompt.

Notice of Soul Trappin

By permitting the inclusion of… See the full description on the dataset page: https://huggingface.co/datasets/trentmkelly/reddit-ArtistHate.
P
REDDIT-12K Dataset
paperswithcode.com
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pinar Yanardag; S. V. N. Vishwanathan (2021). REDDIT-12K Dataset [Dataset]. https://paperswithcode.com/dataset/reddit-12k
Explore at:
Dataset updated
Mar 15, 2021
Authors
Pinar Yanardag; S. V. N. Vishwanathan
Description
Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the other user. There is 1 of 11 graph labels associated with each of these 11929 discussion graphs, representing the category of the community.
o
Reddit: /r/Bitcoin
opendatabay.com
.undefined
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Reddit: /r/Bitcoin [Dataset]. https://www.opendatabay.com/data/ai-ml/afb22b14-6266-47ec-be7f-c936582d61ab
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 17, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
his dataset provides a window into the user perspectives on one of the world's most popular cryptocurrencies—Bitcoin. The dataset contains rich information from Reddit comments from the Bitcoin Subreddit across 2020 and beyond, letting you learn about user conversations, topics discussed and sentiments expressed in this vibrant community. Dive deep into different aspects of cryptocurrency by using this comprehensive collection of Reddit comments - Break down comments based on time, replies, score and more to gain unique insights. Follow trends over time and identify primary hot topics that excite the Bitcoin subreddit - all at your fingertips! Get a better understanding of who is driving cryptocurrency discussions today with this invaluable resource!

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset contains user comments from the Bitcoin subreddit over the past year and a half, providing insight into user perspectives on the popular cryptocurrency. In order to make use of this data, it is helpful to have a working understanding of some common statistical concepts such as descriptive statistics, central tendency, and distributions. As well as basic SQL queries.

Research Ideas Sentiment analysis of Bitcoin Subreddit comments to examine the public’s perception of cryptocurrency. Identification and visualization of correlations between Reddit comments and changes in the value of Bitcoin cryptocurrency markets over time. Identifying user trends in topic preferences for Bitcoin discussions on Reddit by analyzing the body content, topics discussed and URL associated with each comment made on the subreddit Acknowledgements If you use this dataset in your research, please credit the original authors. Data Source

License

CC0

Original Data Source: Reddit: /r/Bitcoin
h
reddit_threads
huggingface.co
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graph Datasets (2023). reddit_threads [Dataset]. https://huggingface.co/datasets/graphs-datasets/reddit_threads
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2023
Dataset authored and provided by
Graph Datasets
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Dataset Card for Reddit threads

Dataset Summary

The Reddit threads dataset contains 'discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them' (doc).

Supported Tasks and Leaderboards

The related task is the binary classification to predict whether a thread is discussion based or not.

External Use PyGeometric

To load in… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/reddit_threads.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/sumitm004/reddit-raskscience-flair-dataset

Reddit AskScience Flair Analysis Dataset

Dataset for Predicting Post Flair Categories on r/AskScience Subreddit

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 15, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sumit Mishra

License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

Context

Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

Content

This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

Mendeley Data

Ideas for Usage

Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.
NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.
Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.
Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.
Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.

Clear search

Close search

Google apps

Main menu

Reddit AskScience Flair Analysis Dataset

Context

Content

Ideas for Usage

REDDIT-BINARY Dataset

Reddit r/AskScience Flair Dataset

Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends,...

Reddit Submissions

image_id, unixtime, rawtime, title, total_votes, reddit_id,...

Comprehensive dataset of over 4000 subreddits across 13 categories

Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...

Reddit Ideological and Extreme Bias Dataset - Part 1

WikiReddit: Tracing Information and Attention Flows Between Online Platforms...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Reddit Italy Coffee Dataset

Dataset for: The Evolution of the Manosphere Across the Web

Reddit users in the United States 2019-2028

Data from: The Reddit Politosphere: A Large-Scale Text and Network Resource...

reddit user posting behavior (mid-2013)

Reddit blackout announcements: 2023 API protest

Reddit Blackout Announcements - 2023 API Protest

Description of the data and file structure

Subreddit List

Stickied Posts - CSVs

‘One Million Reddit Confessions’ analyzed by Analyst-2

About this dataset

NOTICE

Context

Content

Inspiration

How to use this dataset

Acknowledgements

Start A New Notebook!

reddit-ArtistHate

REDDIT-12K Dataset

Reddit: /r/Bitcoin

License

reddit_threads

Reddit AskScience Flair Analysis Dataset

Dataset for Predicting Post Flair Categories on r/AskScience Subreddit

Context

Content

Ideas for Usage

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`