6 datasets found
  1. Reddit AskScience Flair Analysis Dataset

    • kaggle.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/sumitm004/reddit-raskscience-flair-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sumit Mishra
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Context

    Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

    Content

    This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

    Mendeley Data

    Ideas for Usage

    • Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.
    • NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.
    • Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.
    • Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.
    • Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.
  2. 1 million Reddit comments from 40 subreddits

    • kaggle.com
    Updated Feb 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Magnan (2020). 1 million Reddit comments from 40 subreddits [Dataset]. https://www.kaggle.com/smagnan/1-million-reddit-comments-from-40-subreddits/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samuel Magnan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F949630%2F1a380791014d44ae3581e006f4540b9a%2F898dc7.png?generation=1580627804062875&alt=media" alt="Reddit Banner">

    Content

    This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc...).

    For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced.

    I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json -> csv).

    This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety.

    The information kept here is:

    • subreddit (categorical): on which subreddit the comment was posted
    • body (str): comment content
    • controversiality (binary): a reddit aggregated metric
    • score (scalar): upvotes minus downvotes

    Acknowledgements

    The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data.

    What can I do with that?

    Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models.

    Note

    If you think the License (CC0: Public Domain) should be different, contact me

  3. Reddit Posts Relating to Russia-Ukraine War

    • kaggle.com
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan H (2023). Reddit Posts Relating to Russia-Ukraine War [Dataset]. https://www.kaggle.com/danhealey/russia-ukraine-sentiment-analysis/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dan H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Russia, Ukraine
    Description

    This dataset contains data on 12K Reddit posts made to the r/UkraineRussiaReport subreddit. Information about sentiment (pro-Ukraine, pro-Russia, neither) was extracted from the post titles.

    The dataset's sentiment labels are somewhat noisy. This is because post sentiment is classified by the author of a post.

    Data was collected using Pushshift Reddit API during May 2023.

    Each post includes information about: - post ID - pov (sentiment) - post title - score (upvotes) - author - number of comments - when the post was created

  4. 125,000 Reddit Comments about Diabetes

    • kaggle.com
    zip
    Updated Jan 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AndrewMalinow, PhD (2017). 125,000 Reddit Comments about Diabetes [Dataset]. https://www.kaggle.com/amalinow/125000-reddit-comments-about-diabetes
    Explore at:
    zip(17469081 bytes)Available download formats
    Dataset updated
    Jan 31, 2017
    Authors
    AndrewMalinow, PhD
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Data is pipe delimited .txt file

  5. Random Acts of Pizza

    • kaggle.com
    Updated Jan 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2017). Random Acts of Pizza [Dataset]. https://www.kaggle.com/kaggle/random-acts-of-pizza/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 19, 2017
    Dataset authored and provided by
    Kaggle
    Description

    Context

    This dataset includes 5671 requests collected from the Reddit community Random Acts of Pizza between December 8, 2010 and September 29, 2013 (retrieved on September 30, 2013). All requests ask for the same thing: a free pizza. The outcome of each request -- whether its author received a pizza or not -- is known. Meta-data includes information such as: time of the request, activity of the requester, community-age of the requester, etc.

    This dataset was featured in our completed playground competition entitled Random Acts of Pizza. The objective of the competition was to create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.

    Content

    The data are stored in JSON format. Each JSON entry corresponds to one request (the first and only request by the requester on Random Acts of Pizza). We have removed fields from the test set which would not be available at the time of posting. The datasets include the following fields:

    • "giver_username_if_known": Reddit username of giver if known, i.e. the person satisfying the request ("N/A" otherwise).

    • "number_of_downvotes_of_request_at_retrieval": Number of downvotes at the time the request was collected.

    • "number_of_upvotes_of_request_at_retrieval": Number of upvotes at the time the request was collected.

    • "post_was_edited": Boolean indicating whether this post was edited (from Reddit).

    • "request_id": Identifier of the post on Reddit, e.g. "t3_w5491".

    • "request_number_of_comments_at_retrieval": Number of comments for the request at time of retrieval.

    • "request_text": Full text of the request.

    • "request_text_edit_aware": Edit aware version of "request_text". We use a set of rules to strip edited comments indicating the success of the request such as "EDIT: Thanks /u/foo, the pizza was delicous".

    • "request_title": Title of the request.

    • "requester_account_age_in_days_at_request": Account age of requester in days at time of request.

    • "requester_account_age_in_days_at_retrieval": Account age of requester in days at time of retrieval.

    • "requester_days_since_first_post_on_raop_at_request": Number of days between requesters first post on RAOP and this request (zero if requester has never posted before on RAOP).

    • "requester_days_since_first_post_on_raop_at_retrieval": Number of days between requesters first post on RAOP and time of retrieval.

    • "requester_number_of_comments_at_request": Total number of comments on Reddit by requester at time of request.

    • "requester_number_of_comments_at_retrieval": Total number of comments on Reddit by requester at time of retrieval.

    • "requester_number_of_comments_in_raop_at_request": Total number of comments in RAOP by requester at time of request.

    • "requester_number_of_comments_in_raop_at_retrieval": Total number of comments in RAOP by requester at time of retrieval.

    • "requester_number_of_posts_at_request": Total number of posts on Reddit by requester at time of request.

    • "requester_number_of_posts_at_retrieval": Total number of posts on Reddit by requester at time of retrieval.

    • "requester_number_of_posts_on_raop_at_request": Total number of posts in RAOP by requester at time of request.

    • "requester_number_of_posts_on_raop_at_retrieval": Total number of posts in RAOP by requester at time of retrieval.

    • "requester_number_of_subreddits_at_request": The number of subreddits in which the author had already posted in at the time of request.

    • "requester_received_pizza": Boolean indicating the success of the request, i.e., whether the requester received pizza.

    • "requester_subreddits_at_request": The list of subreddits in which the author had already posted in at the time of request.

    • "requester_upvotes_minus_downvotes_at_request": Difference of total upvotes and total downvotes of requester at time of request.

    • "requester_upvotes_minus_downvotes_at_retrieval": Difference of total upvotes and total downvotes of requester at time of retrieval.

    • "requester_upvotes_plus_downvotes_at_request": Sum of total upvotes and total downvotes of requester at time of request.

    • "requester_upvotes_plus_downvotes_at_retrieval": Sum of total upvotes and total downvotes of requester at time of retrieval.

    • "requester_user_flair": Users on RAOP receive badges (Reddit calls them flairs) which is a small picture next to their username. In our data set the user flair is either None (neither given nor received pizza, N=4282), "shroom" (received pizza, but not given, N=1306), or "PIF" (pizza given after having received, N=83).

    • "requester_username": Reddit username of requester.

    • "unix_timestamp_of_request": Unix timestamp of request (supposedly in timezone of user, but in most cases it is equal to the UTC timestamp -- which is incorrect since most RAOP ...

  6. EA Star Wars Exclusive Rights - Comment

    • kaggle.com
    Updated Jan 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kirbyj45 (2021). EA Star Wars Exclusive Rights - Comment [Dataset]. https://www.kaggle.com/datasets/kirbyj45/reddit-ea-star-wars-exclusive-rights-comment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2021
    Dataset provided by
    Kaggle
    Authors
    Kirbyj45
    Description

    Context

    I recently saw that EA lost their exclusive rights to developing Star Wars games. This will allow other studios to develop games using the Star Wars title. I thought this subject would be interesting to look into because there has been some negative press associated with some of the games that EA has created under the Star Wars franchise. In order to see the feedback from this news, I went to reddit to grab comments associated with this news.

    Content

    Column Description: Author - Author of the comment Upvote - Subreddit_id - The ID of the subreddit that the comment belongs to Score - Number of upvotes Replies - This will a forest of comments starting with the top-level comment Comments - The actual comment made

    I used the PRAW API that allows you to pull information from reddit. Information on the API can be found in the following location: https://praw.readthedocs.io/en/latest/index.html#

    The subreddit used was: https://www.reddit.com/r/gaming/comments/kwi9yr/ea_will_no_longer_have_exclusive_rights_of_the/

    Task

    Determine the whether the users have a positive or negative reaction to the change in Star Wars rights associated with EA

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/sumitm004/reddit-raskscience-flair-dataset
Organization logo

Reddit AskScience Flair Analysis Dataset

Dataset for Predicting Post Flair Categories on r/AskScience Subreddit

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sumit Mishra
License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

Context

Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

Content

This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

Mendeley Data

Ideas for Usage

  • Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.
  • NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.
  • Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.
  • Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.
  • Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.
Search
Clear search
Close search
Google apps
Main menu