Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.
This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F949630%2F1a380791014d44ae3581e006f4540b9a%2F898dc7.png?generation=1580627804062875&alt=media" alt="Reddit Banner">
This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc...).
For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced.
I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json -> csv).
This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety.
The information kept here is:
The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data.
Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models.
If you think the License (CC0: Public Domain) should be different, contact me
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains data on 12K Reddit posts made to the r/UkraineRussiaReport subreddit. Information about sentiment (pro-Ukraine, pro-Russia, neither) was extracted from the post titles.
The dataset's sentiment labels are somewhat noisy. This is because post sentiment is classified by the author of a post.
Data was collected using Pushshift Reddit API during May 2023.
Each post includes information about: - post ID - pov (sentiment) - post title - score (upvotes) - author - number of comments - when the post was created
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Data is pipe delimited .txt file
This dataset includes 5671 requests collected from the Reddit community Random Acts of Pizza between December 8, 2010 and September 29, 2013 (retrieved on September 30, 2013). All requests ask for the same thing: a free pizza. The outcome of each request -- whether its author received a pizza or not -- is known. Meta-data includes information such as: time of the request, activity of the requester, community-age of the requester, etc.
This dataset was featured in our completed playground competition entitled Random Acts of Pizza. The objective of the competition was to create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.
The data are stored in JSON format. Each JSON entry corresponds to one request (the first and only request by the requester on Random Acts of Pizza). We have removed fields from the test set which would not be available at the time of posting. The datasets include the following fields:
"giver_username_if_known": Reddit username of giver if known, i.e. the person satisfying the request ("N/A" otherwise).
"number_of_downvotes_of_request_at_retrieval": Number of downvotes at the time the request was collected.
"number_of_upvotes_of_request_at_retrieval": Number of upvotes at the time the request was collected.
"post_was_edited": Boolean indicating whether this post was edited (from Reddit).
"request_id": Identifier of the post on Reddit, e.g. "t3_w5491".
"request_number_of_comments_at_retrieval": Number of comments for the request at time of retrieval.
"request_text": Full text of the request.
"request_text_edit_aware": Edit aware version of "request_text". We use a set of rules to strip edited comments indicating the success of the request such as "EDIT: Thanks /u/foo, the pizza was delicous".
"request_title": Title of the request.
"requester_account_age_in_days_at_request": Account age of requester in days at time of request.
"requester_account_age_in_days_at_retrieval": Account age of requester in days at time of retrieval.
"requester_days_since_first_post_on_raop_at_request": Number of days between requesters first post on RAOP and this request (zero if requester has never posted before on RAOP).
"requester_days_since_first_post_on_raop_at_retrieval": Number of days between requesters first post on RAOP and time of retrieval.
"requester_number_of_comments_at_request": Total number of comments on Reddit by requester at time of request.
"requester_number_of_comments_at_retrieval": Total number of comments on Reddit by requester at time of retrieval.
"requester_number_of_comments_in_raop_at_request": Total number of comments in RAOP by requester at time of request.
"requester_number_of_comments_in_raop_at_retrieval": Total number of comments in RAOP by requester at time of retrieval.
"requester_number_of_posts_at_request": Total number of posts on Reddit by requester at time of request.
"requester_number_of_posts_at_retrieval": Total number of posts on Reddit by requester at time of retrieval.
"requester_number_of_posts_on_raop_at_request": Total number of posts in RAOP by requester at time of request.
"requester_number_of_posts_on_raop_at_retrieval": Total number of posts in RAOP by requester at time of retrieval.
"requester_number_of_subreddits_at_request": The number of subreddits in which the author had already posted in at the time of request.
"requester_received_pizza": Boolean indicating the success of the request, i.e., whether the requester received pizza.
"requester_subreddits_at_request": The list of subreddits in which the author had already posted in at the time of request.
"requester_upvotes_minus_downvotes_at_request": Difference of total upvotes and total downvotes of requester at time of request.
"requester_upvotes_minus_downvotes_at_retrieval": Difference of total upvotes and total downvotes of requester at time of retrieval.
"requester_upvotes_plus_downvotes_at_request": Sum of total upvotes and total downvotes of requester at time of request.
"requester_upvotes_plus_downvotes_at_retrieval": Sum of total upvotes and total downvotes of requester at time of retrieval.
"requester_user_flair": Users on RAOP receive badges (Reddit calls them flairs) which is a small picture next to their username. In our data set the user flair is either None (neither given nor received pizza, N=4282), "shroom" (received pizza, but not given, N=1306), or "PIF" (pizza given after having received, N=83).
"requester_username": Reddit username of requester.
"unix_timestamp_of_request": Unix timestamp of request (supposedly in timezone of user, but in most cases it is equal to the UTC timestamp -- which is incorrect since most RAOP ...
I recently saw that EA lost their exclusive rights to developing Star Wars games. This will allow other studios to develop games using the Star Wars title. I thought this subject would be interesting to look into because there has been some negative press associated with some of the games that EA has created under the Star Wars franchise. In order to see the feedback from this news, I went to reddit to grab comments associated with this news.
Column Description: Author - Author of the comment Upvote - Subreddit_id - The ID of the subreddit that the comment belongs to Score - Number of upvotes Replies - This will a forest of comments starting with the top-level comment Comments - The actual comment made
I used the PRAW API that allows you to pull information from reddit. Information on the API can be found in the following location: https://praw.readthedocs.io/en/latest/index.html#
The subreddit used was: https://www.reddit.com/r/gaming/comments/kwi9yr/ea_will_no_longer_have_exclusive_rights_of_the/
Determine the whether the users have a positive or negative reaction to the change in Star Wars rights associated with EA
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.
This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.