77 datasets found
  1. Reddit Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jan 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2023). Reddit Datasets [Dataset]. https://brightdata.com/products/datasets/reddit
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jan 11, 2023
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Access our extensive Reddit datasets that provide detailed information on posts, communities (subreddits), and user engagement. Gain insights into post performance, user comments, community statistics, and content trends with our ethically sourced data. Free samples are available for evaluation. 3M+ records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

    Post ID, Title & URL Post Description & Date Username of Poster Upvotes & Comment Count Community Name, URL & Description Community Member Count Attached Photos & Videos Full Post Comments Related Posts Post Karma Post Tags And more

  2. d

    Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

    • datarade.ai
    .json, .csv
    Updated Aug 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Aug 12, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    Macao, Christmas Island, Chile, Jersey, Mexico, Botswana, Gambia, Martinique, Côte d'Ivoire, Holy See
    Description

    The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

    Dataset Overview:

    This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

    2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

    Sourced Directly from Reddit:

    All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

    Key Features:

    • Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.
    • User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.
    • Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.
    • AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

    Use Cases:

    • Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.
    • Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.
    • Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.
    • Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

    Data Quality and Reliability:

    The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

    Integration and Usability:

    The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

    User-Friendly Structure and Metadata:

    The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

    Ideal For:

    • Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.
    • Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.
    • Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

    This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...

  3. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  4. Data from: Hybrid Approaches to Detect Comments Violating Macro Norms on...

    • zenodo.org
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eshwar Chandrasekharan; Mattia Samory; Eric Gilbert; Eshwar Chandrasekharan; Mattia Samory; Eric Gilbert (2020). Hybrid Approaches to Detect Comments Violating Macro Norms on Reddit [Dataset]. http://doi.org/10.5281/zenodo.3338698
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eshwar Chandrasekharan; Mattia Samory; Eric Gilbert; Eshwar Chandrasekharan; Mattia Samory; Eric Gilbert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    [Content warning: Files may contain instances of highly inflammatory and offensive content.]


    This dataset was generated as an extension of our CSCW 2018 paper:

    Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. 2018. The Internet’s Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and Macro Scales. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 32.

    Description:

    Working with over 2M removed comments collected from 100 different communities on Reddit (subreddit names listed in data/study-subreddits.csv), we identified 8 macro norms, i.e., norms that are widely enforced on most parts of Reddit. We extracted these macro norms by employing a hybrid approach—classification, topic modeling, and open-coding—on comments identified to be norm violations within at least 85 out of the 100 study subreddits. Finally, we labelled over 40K Reddit comments removed by moderators according to the specific type of macro norm being violated, and make this dataset publicly available (also available on Github).

    For each of the labeled topics, we identified the top 5000 removed comments that were best fit by the LDA topic model. In this way, we identified over 5000 removed comments that are examples of each type of macro norm violation described in the paper. The removed comments were sorted by their topic fit, stored into respective files based on the type of norm violation they represent, and are made available on this repo.

    Here we make the following datasets publicly available:

    * 1 file containing the log of over 2M removed comments obtained from the top 100 subreddits between May 2016 to March 2017, after filtering out the following comments: 1) comments by u/AutoModerator, 2) replies to removed comments (i.e., children of the poisoned tree - refer to the paper for more information), and 3) non-readable comments (not utf-8 encoded).

    * 8 files, each containing 5000+ removed comments obtained from Reddit, are stored in: data/macro-norm-violations/ , and they are split into different files based on the macro norm they violated. Each new line in the files represent a comment that was posted on Reddit between May 2016 to March 2017, and subsequently removed by subreddit moderators for violating community norms. All comments were preprocessed using the script in code/preprocessing-reddit-comments.py , in order to do the following: 1. remove new lines, 2. convert text to lowercase, and 3. strip numbers and punctuations from comments.

    Description of 1 file containing over 2M removed comments from 100 subreddits.

    • "reddit-removal-log.csv" - all comments that were removed from the 100 study subreddits during the study period described above (post-filtering).

    Descriptions of each file containing 5059 comments (that were removed from Reddit, and preprocessed) violating macro norms present in data/macro-norm-violations/:

    • "macro-norm-violations-n10-t0-misogynistic-slurs.csv" - Comments that use misogynistic slurs.
    • "macro-norm-violations-n15-t2-hatespeech-racist-homophobic.csv" - Comments containing hate speech that is racist or homophobic.
    • "macro-norm-violations-n10-t3-opposing-political-views-trump.csv", "macro-norm-violations-n15-t10-opposing-political-views-trump.csv" - Comments with opposing political views around Trump (depends on originating sub).
    • "macro-norm-violations-n10-t4-verbal-attacks-on-Reddit.csv" - Comments containing verbal attacks on Reddit or specific subreddits.
    • "macro-norm-violations-n10-t5-porno-links.csv" - Comments with pornographic links.
    • "macro-norm-violations-n10-t8-personal-attacks.csv", "macro-norm-violations-n10-t9-personal-attacks.csv"- Comments containing personal attacks.
    • "macro-norm-violations-n15-t3-abusing-and-criticisizing-mods.csv" - Comments abusing and criticisizng moderators.
    • "macro-norm-violations-n15-t9-namecalling-claiming-other-too-sensitive.csv" - Comments with name-calling, or claiming that the other person is too sensitive.

    More details about the dataset can be found on arXiv: https://arxiv.org/abs/1904.03596

  5. a

    Subreddit comments/submissions 2005-06 to 2024-12

    • academictorrents.com
    bittorrent
    Updated Feb 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Watchful1 (2025). Subreddit comments/submissions 2005-06 to 2024-12 [Dataset]. https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
    Explore at:
    bittorrent(3275329715321)Available download formats
    Dataset updated
    Feb 16, 2025
    Dataset authored and provided by
    Watchful1
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    This is the top 40,000 subreddits from reddit s history in separate files. You can use your torrent client to only download the subreddit s you re interested in. These are from the pushshift dumps from 2005-06 to 2024-12 which can be found here These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here If you have questions, please reply to this reddit post or DM u/Watchful on reddit or respond to this post

  6. h

    reddit-randomness

    • huggingface.co
    Updated Sep 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Wisdom (2021). reddit-randomness [Dataset]. https://huggingface.co/datasets/davidwisdom/reddit-randomness
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 19, 2021
    Authors
    David Wisdom
    Description

    Reddit Randomness Dataset

    A dataset I created because I was curious about how "random" r/random really is. This data was collected by sending GET requests to https://www.reddit.com/r/random for a few hours on September 19th, 2021. I scraped a bit of metadata about the subreddits as well. randomness_12k_clean.csv reports the random subreddits as they happened and summary.csv lists some metadata about each subreddit.

      The Data
    
    
    
    
    
    
    
      randomness_12k_clean.csv
    

    This… See the full description on the dataset page: https://huggingface.co/datasets/davidwisdom/reddit-randomness.

  7. Subreddits

    • kaggle.com
    zip
    Updated May 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Severan (2018). Subreddits [Dataset]. https://www.kaggle.com/rayraegah/subreddits
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    May 12, 2018
    Authors
    Severan
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    Context

    Analyse the popularity of public subreddits

    Content

    The CSV contains a long list of every subreddit on Reddit. There are a total of 1067472 subreddits and the columns in the dataset are:

    • base10 id,
    • base36 reddit id,
    • creation epoch,
    • subreddit name,
    • number of subscribers

    Acknowledgements

    This dataset was originally published on /r/datasets by /u/Stuck_In_the_Matrix

    Inspiration

  8. m

    Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

    • dataplex.mydatastorefront.com
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://dataplex.mydatastorefront.com/products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
    Explore at:
    Dataset updated
    Aug 19, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    Palau, Turks and Caicos Islands, Brazil, South Korea, Norway, Mali, The Democratic Republic of the, Vietnam, Lithuania, Tuvalu
    Description

    The Reddit data dataset offers social media data tracking 2.1+ million subreddits with daily subscriber counts since January 2023. We also leverage AI to append subreddit attributes, allowing you to categorize and find subreddits by subject.

  9. a

    Reddit subreddits metadata, rules and wikis 2025-01

    • academictorrents.com
    bittorrent
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RaiderBDev (2025). Reddit subreddits metadata, rules and wikis 2025-01 [Dataset]. https://academictorrents.com/details/5d0bf258a025a5b802572ddc29cde89bf093185c
    Explore at:
    bittorrent(2648464712)Available download formats
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    RaiderBDev
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description
    • subreddit about pages and metadata - includes description, subscriber count, nsfw flag, icon urls, and more - 22 million subreddits - subreddit metadata only - subreddits that could not be retrieved, but at some point appeared in the pushshift or arctic shift data dumps - metadata includes number of posts+comments and the date of the first post+comment - 1.6 million subreddits - subreddit rules - posting/commenting rules of subreddits that go beyond the site wide rules - 345k subreddits - subreddit wiki pages - wiki text contents of URLs that can be found in the pushshift or arctic shift data dumps - 323k pages Data was retrieved in January and February 2025. General documentation, APIs and other reddit related data can be found at JSON schemas specifically are at
  10. Reddit Mental Health Dataset

    • zenodo.org
    csv
    Updated Oct 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel M. Low; Daniel M. Low; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh (2020). Reddit Mental Health Dataset [Dataset]. http://doi.org/10.17605/osf.io/7peyq
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 16, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel M. Low; Daniel M. Low; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh; Laurie Rumker; Tanya Talker; John Torous; Guillermo Cecchi; Satrajit S. Ghosh
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    This dataset contains posts from 28 subreddits (15 mental health support groups) from 2018-2020. We used this dataset to understand the impact of COVID-19 on mental health support groups from January to April, 2020 and included older timeframes to obtain baseline posts before COVID-19.

    Please cite if you use this dataset:

    Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635.

    @article{low2020natural,
     title={Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study},
     author={Low, Daniel M and Rumker, Laurie and Torous, John and Cecchi, Guillermo and Ghosh, Satrajit S and Talkar, Tanya},
     journal={Journal of medical Internet research},
     volume={22},
     number={10},
     pages={e22635},
     year={2020},
     publisher={JMIR Publications Inc., Toronto, Canada}
    }


    License

    This dataset is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/

    It was downloaded using pushshift API. Re-use of this data is subject to Reddit API terms.

    Reddit Mental Health Dataset

    Contains posts and text features for the following timeframes from 28 mental health and non-mental health subreddits:

    • 15 specific mental health support groups (r/EDAnonymous, r/addiction, r/alcoholism, r/adhd, r/anxiety, r/autism, r/bipolarreddit, r/bpd, r/depression, r/healthanxiety, r/lonely, r/ptsd, r/schizophrenia, r/socialanxiety, and r/suicidewatch)
    • 2 broad mental health subreddits (r/mentalhealth, r/COVID19_support)
    • 11 non-mental health subreddits (r/conspiracy, r/divorce, r/fitness, r/guns, r/jokes, r/legaladvice, r/meditation, r/parenting, r/personalfinance, r/relationships, r/teaching).

    filenames and corresponding timeframes:

    • post: Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears). Unique users: 320,364.
    • pre: Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts. Unique users: 327,289.
    • 2019: Jan 1 to April 20, 2019 (r/EDAnonymous appears). A control for seasonal fluctuations to match post data. Unique users: 282,560.
    • 2018: Jan 1 to April 20, 2018. A control for seasonal fluctuations to match post data. Unique users: 177,089

    Unique users across all time windows (pre and 2019 overlap): 826,961.

    See manuscript Supplementary Materials (https://doi.org/10.31234/osf.io/xvwcy) for more information.

    Note: if subsampling (e.g., to balance subreddits), we recommend bootstrapping analyses for unbiased results.

  11. Mental Health-related subreddits data

    • zenodo.org
    zip
    Updated Nov 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabricio Murai; Fabricio Murai; Barbara Silveira Fraga; Ana Paula Couto da Silva; Ana Paula Couto da Silva; Barbara Silveira Fraga (2020). Mental Health-related subreddits data [Dataset]. http://doi.org/10.5281/zenodo.4266616
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 11, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fabricio Murai; Fabricio Murai; Barbara Silveira Fraga; Ana Paula Couto da Silva; Ana Paula Couto da Silva; Barbara Silveira Fraga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We gathered all posts, comments and metadata created during 2017 from the four mental health related Reddit communities with the largest number of publications, namely Depression, SuicideWatch, Anxiety e Bipolar. Unprocessed data is publicly available at http://files.pushshift.io/reddit. After extracting the zipped file, there will be three files for each subreddit

    If you use this dataset, please cite
    Silveira, Bárbara, Fabricio Murai, and Ana Paula Couto da Silva. "Predicting User Emotional Tone in Mental Disorder Online Communities." arXiv preprint arXiv:2005.07473 (2020).

  12. Subreddit Interactions for 25,000 Users

    • kaggle.com
    Updated Feb 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    colemaclean (2017). Subreddit Interactions for 25,000 Users [Dataset]. https://www.kaggle.com/colemaclean/subreddit-interactions/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    colemaclean
    Description

    Context

    The dataset is a csv file compiled using a python scrapper developed using Reddit's PRAW API. The raw data is a list of 3-tuples of [username,subreddit,utc timestamp]. Each row represents a single comment made by the user, representing about 5 days worth of Reddit data. Note that the actual comment text is not included, only the user, subreddit and comment timestamp of the users comment. The goal of the dataset is to provide a lens in discovering user patterns from reddit meta-data alone. The original use case was to compile a dataset suitable for training a neural network in developing a subreddit recommender system. That final system can be found here

    A very unpolished EDA for the dataset can be found here. Note the published dataset is only half of the one used in the EDA and recommender system, to meet kaggle's 500MB size limitation.

    Content

    user - The username of the person submitting the comment
    subreddit - The title of the subreddit the user made the comment in
    utc_stamp - the utc timestamp of when the user made the comment

    Acknowledgements

    The dataset was compiled as part of a school project. The final project report, with my collaborators, can be found here

    Inspiration

    We were able to build a pretty cool subreddit recommender with the dataset. A blog post for it can be found here, and the stand alone jupyter notebook for it here. Our final model is very undertuned, so there's definitely improvements to be made there, but I think there are many other cool data projects and visualizations that could be built from this dataset. One example would be to analyze the spread of users through the Reddit ecosystem, whether the average user clusters in close communities, or traverses wide and far to different corners. If you do end up building something on this, please share! And have fun!

    Released under Reddit's API licence

  13. a

    Subreddit comments/submissions 2005-06 to 2022-12

    • academictorrents.com
    bittorrent
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Watchful1 (2023). Subreddit comments/submissions 2005-06 to 2022-12 [Dataset]. https://academictorrents.com/details/c398a571976c78d346c325bd75c47b82edf6124e
    Explore at:
    bittorrent(1657656143449)Available download formats
    Dataset updated
    Feb 28, 2023
    Dataset authored and provided by
    Watchful1
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    This is the top 20,000 subreddits from reddit s history in separate files. You can use your torrent client to only download the subreddit s you re interested in. These are from the pushshift dumps from 2005-06 to 2022-12 which can be found here These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here

  14. m

    Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends,...

    • dataplex.mydatastorefront.com
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://dataplex.mydatastorefront.com/products/dataplex-reddit-data-consumer-behavior-data-2-1m-subred-dataplex
    Explore at:
    Dataset updated
    Aug 12, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    American Samoa, Monaco, Ghana, Denmark, Martinique, Yemen, Christmas Island, Pakistan, Timor-Leste, Serbia
    Description

    Reddit data tracks attributes of 2.1+ million subreddits and subscriber counts daily since January 2023. Additionally, we leverage AI to append subreddit attributes that allow you to categorize and find subreddits by subject. Track consumer behavior data. Perfect for ML.

  15. Gendered terms in 100 most popular subreddits

    • figshare.com
    zip
    Updated Jun 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Thelwall; Emma Stuart (2018). Gendered terms in 100 most popular subreddits [Dataset]. http://doi.org/10.6084/m9.figshare.6470930.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 10, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mike Thelwall; Emma Stuart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gendered terms in 100 most popular subreddits, as judged by a chi-squared test. Spreadsheets are organised by subjectively determined subreddit theme.This is data associated with the paper: She’s Reddit: A new source of Gendered interest information? with Emma Stuart.

  16. subreddits

    • redivis.com
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2025). subreddits [Dataset]. https://redivis.com/datasets/prpw-49sqq9ehv
    Explore at:
    Dataset updated
    Mar 6, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Description

    The table subreddits is part of the dataset Reddit, available at https://redivis.com/datasets/prpw-49sqq9ehv. It contains 2499 rows across 7 variables.

  17. h

    reddit-r-bitcoin-data-for-jun-2022

    • huggingface.co
    • opendatalab.com
    Updated Jun 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SocialGrep (2022). reddit-r-bitcoin-data-for-jun-2022 [Dataset]. https://huggingface.co/datasets/SocialGrep/reddit-r-bitcoin-data-for-jun-2022
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2022
    Authors
    SocialGrep
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Lite version of our Reddit /r/Bitcoin dataset - CSV of all posts & comments to the /r/Bitcoin subreddit over Jun 2022.

  18. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  19. Total global visitor traffic to Reddit.com 2024

    • statista.com
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Total global visitor traffic to Reddit.com 2024 [Dataset]. https://www.statista.com/statistics/443332/reddit-monthly-visitors/
    Explore at:
    Dataset updated
    Aug 20, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    Reddit is a web traffic powerhouse: in March 2024 approximately 2.2 billion visits were measured to the online forum, making it one of the most-visited websites online. The front page of the internet Formerly known as “the front page of the internet”, Reddit is an online forum platform with over 130,000 sub-forums and communities. The platform allows registered users, called Redditors, to post content. Each post is open to the entire Reddit community to vote upon, either by down- or upvotes. The most popular posts are featured directly on the front page. Subreddits are available by category and Redditors can follow selected subreddits relevant to their interest and also control what content they see on their custom front page. Some of the most popular subreddits are r/AskReddit or r/AMA – the “Ask Me Anything” format. According to the company, Reddit hosted 1,800 AMAs in 2018, with a wide range of topics and hosts. One of the most popular Reddit AMA of 2022 by number of upvotes was by actor Nicolas Cagem with more than 238.5 thousand upvotes. Reddit usage The United States account for the biggest share of Reddit's desktop traffic, followed by the UK, and Canada. As of March 2023, Reddit ranked among the most popular social media websites in the United States.

  20. Reddit r/AITA post and comments

    • kaggle.com
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Posewitz (2022). Reddit r/AITA post and comments [Dataset]. https://www.kaggle.com/oliverposewitz/reddit-raita-post-and-comments/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Kaggle
    Authors
    Oliver Posewitz
    Description

    Dataset

    This dataset was created by Oliver Posewitz

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bright Data (2023). Reddit Datasets [Dataset]. https://brightdata.com/products/datasets/reddit
Organization logo

Reddit Datasets

Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jan 11, 2023
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered
Worldwide
Description

Access our extensive Reddit datasets that provide detailed information on posts, communities (subreddits), and user engagement. Gain insights into post performance, user comments, community statistics, and content trends with our ethically sourced data. Free samples are available for evaluation. 3M+ records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

Post ID, Title & URL Post Description & Date Username of Poster Upvotes & Comment Count Community Name, URL & Description Community Member Count Attached Photos & Videos Full Post Comments Related Posts Post Karma Post Tags And more

Search
Clear search
Close search
Google apps
Main menu