6 datasets found

Reddit AskScience Flair Analysis Dataset
kaggle.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/sumitm004/reddit-raskscience-flair-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sumit Mishra
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Context

Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

Content

This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

Mendeley Data

Ideas for Usage

Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.

NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.

Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.

Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.

Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.
1 million Reddit comments from 40 subreddits
kaggle.com
Updated Feb 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Magnan (2020). 1 million Reddit comments from 40 subreddits [Dataset]. https://www.kaggle.com/smagnan/1-million-reddit-comments-from-40-subreddits/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samuel Magnan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F949630%2F1a380791014d44ae3581e006f4540b9a%2F898dc7.png?generation=1580627804062875&alt=media" alt="Reddit Banner">

Content

This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc...).

For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced.

I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json -> csv).

This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety.

The information kept here is:

subreddit (categorical): on which subreddit the comment was posted

body (str): comment content

controversiality (binary): a reddit aggregated metric

score (scalar): upvotes minus downvotes

Acknowledgements

The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data.

What can I do with that?

Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models.

Note

If you think the License (CC0: Public Domain) should be different, contact me
Reddit Posts Relating to Russia-Ukraine War
kaggle.com
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan H (2023). Reddit Posts Relating to Russia-Ukraine War [Dataset]. https://www.kaggle.com/danhealey/russia-ukraine-sentiment-analysis/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dan H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Russia, Ukraine
Description
This dataset contains data on 12K Reddit posts made to the r/UkraineRussiaReport subreddit. Information about sentiment (pro-Ukraine, pro-Russia, neither) was extracted from the post titles.

The dataset's sentiment labels are somewhat noisy. This is because post sentiment is classified by the author of a post.

Data was collected using Pushshift Reddit API during May 2023.

Each post includes information about: - post ID - pov (sentiment) - post title - score (upvotes) - author - number of comments - when the post was created
125,000 Reddit Comments about Diabetes
kaggle.com
zip
Updated Jan 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AndrewMalinow, PhD (2017). 125,000 Reddit Comments about Diabetes [Dataset]. https://www.kaggle.com/amalinow/125000-reddit-comments-about-diabetes
Explore at:
zip(17469081 bytes)Available download formats
Dataset updated
Jan 31, 2017
Authors
AndrewMalinow, PhD
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Data is pipe delimited .txt file
Random Acts of Pizza
kaggle.com
Updated Jan 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2017). Random Acts of Pizza [Dataset]. https://www.kaggle.com/kaggle/random-acts-of-pizza/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 19, 2017
Dataset authored and provided by
Kaggle
Description
Context

This dataset includes 5671 requests collected from the Reddit community Random Acts of Pizza between December 8, 2010 and September 29, 2013 (retrieved on September 30, 2013). All requests ask for the same thing: a free pizza. The outcome of each request -- whether its author received a pizza or not -- is known. Meta-data includes information such as: time of the request, activity of the requester, community-age of the requester, etc.

This dataset was featured in our completed playground competition entitled Random Acts of Pizza. The objective of the competition was to create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.

Content

The data are stored in JSON format. Each JSON entry corresponds to one request (the first and only request by the requester on Random Acts of Pizza). We have removed fields from the test set which would not be available at the time of posting. The datasets include the following fields:

"giver_username_if_known": Reddit username of giver if known, i.e. the person satisfying the request ("N/A" otherwise).

"number_of_downvotes_of_request_at_retrieval": Number of downvotes at the time the request was collected.

"number_of_upvotes_of_request_at_retrieval": Number of upvotes at the time the request was collected.

"post_was_edited": Boolean indicating whether this post was edited (from Reddit).

"request_id": Identifier of the post on Reddit, e.g. "t3_w5491".

"request_number_of_comments_at_retrieval": Number of comments for the request at time of retrieval.

"request_text": Full text of the request.

"request_text_edit_aware": Edit aware version of "request_text". We use a set of rules to strip edited comments indicating the success of the request such as "EDIT: Thanks /u/foo, the pizza was delicous".

"request_title": Title of the request.

"requester_account_age_in_days_at_request": Account age of requester in days at time of request.

"requester_account_age_in_days_at_retrieval": Account age of requester in days at time of retrieval.

"requester_days_since_first_post_on_raop_at_request": Number of days between requesters first post on RAOP and this request (zero if requester has never posted before on RAOP).

"requester_days_since_first_post_on_raop_at_retrieval": Number of days between requesters first post on RAOP and time of retrieval.

"requester_number_of_comments_at_request": Total number of comments on Reddit by requester at time of request.

"requester_number_of_comments_at_retrieval": Total number of comments on Reddit by requester at time of retrieval.

"requester_number_of_comments_in_raop_at_request": Total number of comments in RAOP by requester at time of request.

"requester_number_of_comments_in_raop_at_retrieval": Total number of comments in RAOP by requester at time of retrieval.

"requester_number_of_posts_at_request": Total number of posts on Reddit by requester at time of request.

"requester_number_of_posts_at_retrieval": Total number of posts on Reddit by requester at time of retrieval.

"requester_number_of_posts_on_raop_at_request": Total number of posts in RAOP by requester at time of request.

"requester_number_of_posts_on_raop_at_retrieval": Total number of posts in RAOP by requester at time of retrieval.

"requester_number_of_subreddits_at_request": The number of subreddits in which the author had already posted in at the time of request.

"requester_received_pizza": Boolean indicating the success of the request, i.e., whether the requester received pizza.

"requester_subreddits_at_request": The list of subreddits in which the author had already posted in at the time of request.

"requester_upvotes_minus_downvotes_at_request": Difference of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_minus_downvotes_at_retrieval": Difference of total upvotes and total downvotes of requester at time of retrieval.

"requester_upvotes_plus_downvotes_at_request": Sum of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_plus_downvotes_at_retrieval": Sum of total upvotes and total downvotes of requester at time of retrieval.

"requester_user_flair": Users on RAOP receive badges (Reddit calls them flairs) which is a small picture next to their username. In our data set the user flair is either None (neither given nor received pizza, N=4282), "shroom" (received pizza, but not given, N=1306), or "PIF" (pizza given after having received, N=83).

"requester_username": Reddit username of requester.

"unix_timestamp_of_request": Unix timestamp of request (supposedly in timezone of user, but in most cases it is equal to the UTC timestamp -- which is incorrect since most RAOP ...
EA Star Wars Exclusive Rights - Comment
kaggle.com
Updated Jan 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kirbyj45 (2021). EA Star Wars Exclusive Rights - Comment [Dataset]. https://www.kaggle.com/datasets/kirbyj45/reddit-ea-star-wars-exclusive-rights-comment
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2021
Dataset provided by
Kaggle
Authors
Kirbyj45
Description
Context

I recently saw that EA lost their exclusive rights to developing Star Wars games. This will allow other studios to develop games using the Star Wars title. I thought this subject would be interesting to look into because there has been some negative press associated with some of the games that EA has created under the Star Wars franchise. In order to see the feedback from this news, I went to reddit to grab comments associated with this news.

Content

Column Description: Author - Author of the comment Upvote - Subreddit_id - The ID of the subreddit that the comment belongs to Score - Number of upvotes Replies - This will a forest of comments starting with the top-level comment Comments - The actual comment made

I used the PRAW API that allows you to pull information from reddit. Information on the API can be found in the following location: https://praw.readthedocs.io/en/latest/index.html#

The subreddit used was: https://www.reddit.com/r/gaming/comments/kwi9yr/ea_will_no_longer_have_exclusive_rights_of_the/

Task

Determine the whether the users have a positive or negative reaction to the change in Star Wars rights associated with EA
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/sumitm004/reddit-raskscience-flair-dataset

Reddit AskScience Flair Analysis Dataset

Dataset for Predicting Post Flair Categories on r/AskScience Subreddit

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 15, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sumit Mishra

License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

Context

Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

Content

This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

Mendeley Data

Ideas for Usage

Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.
NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.
Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.
Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.
Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.

Clear search

Close search

Google apps

Main menu

Reddit AskScience Flair Analysis Dataset

Context

Content

Ideas for Usage

1 million Reddit comments from 40 subreddits

Content

Acknowledgements

What can I do with that?

Note

Reddit Posts Relating to Russia-Ukraine War

125,000 Reddit Comments about Diabetes

Random Acts of Pizza

Context

Content

EA Star Wars Exclusive Rights - Comment

Context

Content

Task

Reddit AskScience Flair Analysis Dataset

Dataset for Predicting Post Flair Categories on r/AskScience Subreddit

Context

Content

Ideas for Usage