35 datasets found
  1. P

    Reddit Dataset

    • paperswithcode.com
    Updated Jun 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William L. Hamilton; Rex Ying; Jure Leskovec (2017). Reddit Dataset [Dataset]. https://paperswithcode.com/dataset/reddit
    Explore at:
    Dataset updated
    Jun 9, 2017
    Authors
    William L. Hamilton; Rex Ying; Jure Leskovec
    Description

    The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.

  2. Reddit Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jan 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2023). Reddit Datasets [Dataset]. https://brightdata.com/products/datasets/reddit
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jan 11, 2023
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Access our extensive Reddit datasets that provide detailed information on posts, communities (subreddits), and user engagement. Gain insights into post performance, user comments, community statistics, and content trends with our ethically sourced data. Free samples are available for evaluation. 3M+ records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

    Post ID, Title & URL Post Description & Date Username of Poster Upvotes & Comment Count Community Name, URL & Description Community Member Count Attached Photos & Videos Full Post Comments Related Posts Post Karma Post Tags And more

  3. Reddit user worldwide 2024, by country

    • statista.com
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit user worldwide 2024, by country [Dataset]. https://www.statista.com/forecasts/1174696/reddit-user-by-country
    Explore at:
    Dataset updated
    Jul 10, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 1, 2024 - Dec 31, 2024
    Area covered
    Albania
    Description

    Comparing the *** selected regions regarding the number of Reddit users , the United States is leading the ranking (****** million users) and is followed by the United Kingdom with ***** million users. At the other end of the spectrum is Gabon with **** million users, indicating a difference of ****** million users to the United States. User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to *** countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).

  4. d

    Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends,...

    • datarade.ai
    .json, .csv
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-consumer-behavior-data-2-1m-subred-dataplex
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    Saint Barthélemy, Tunisia, Cuba, Togo, Cocos (Keeling) Islands, Netherlands, Lithuania, Burkina Faso, Belize, Croatia
    Description

    The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

    Dataset Overview:

    This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

    2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

    Sourced Directly from Reddit:

    All data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

    Key Features:

    • Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.
    • User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.
    • Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.
    • AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

    Use Cases:

    • Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.
    • Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.
    • Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.
    • Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

    Data Quality and Reliability:

    The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

    Integration and Usability:

    The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

    User-Friendly Structure and Metadata:

    The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

    Ideal For:

    • Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.
    • Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.
    • Researchers: Explore consumer behavior data of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

    This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conducting acade...

  5. Reddit users in France 2020-2028

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit users in France 2020-2028 [Dataset]. https://www.statista.com/forecasts/1144401/reddit-users-in-france
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    France
    Description

    The number of Reddit users in France was forecast to continuously increase between 2024 and 2028 by in total *** million users (+**** percent). After the eighth consecutive increasing year, the Reddit user base is estimated to reach ***** million users and therefore a new peak in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to *** countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Reddit users in countries like Netherlands and Luxembourg.

  6. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  7. Reddit users in Brazil 2020-2028

    • statista.com
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit users in Brazil 2020-2028 [Dataset]. https://www.statista.com/forecasts/1145141/reddit-users-in-brazil
    Explore at:
    Dataset updated
    Jul 10, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Brazil
    Description

    The number of Reddit users in Brazil was forecast to continuously increase between 2024 and 2028 by in total ************ users (+***** percent). After the ****** consecutive increasing year, the Reddit user base is estimated to reach ************ users and therefore a new peak in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to *** countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).

  8. 1 million Reddit comments from 40 subreddits

    • kaggle.com
    Updated Feb 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Magnan (2020). 1 million Reddit comments from 40 subreddits [Dataset]. https://www.kaggle.com/smagnan/1-million-reddit-comments-from-40-subreddits/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samuel Magnan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F949630%2F1a380791014d44ae3581e006f4540b9a%2F898dc7.png?generation=1580627804062875&alt=media" alt="Reddit Banner">

    Content

    This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc...).

    For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced.

    I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json -> csv).

    This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety.

    The information kept here is:

    • subreddit (categorical): on which subreddit the comment was posted
    • body (str): comment content
    • controversiality (binary): a reddit aggregated metric
    • score (scalar): upvotes minus downvotes

    Acknowledgements

    The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data.

    What can I do with that?

    Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models.

    Note

    If you think the License (CC0: Public Domain) should be different, contact me

  9. p

    REDDIT-BINARY Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Nov 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pinar Yanardag; S. V. N. Vishwanathan (2021). REDDIT-BINARY Dataset [Dataset]. https://paperswithcode.com/dataset/reddit-binary
    Explore at:
    Dataset updated
    Nov 16, 2021
    Authors
    Pinar Yanardag; S. V. N. Vishwanathan
    Description

    REDDIT-BINARY consists of graphs corresponding to online discussions on Reddit. In each graph, nodes represent users, and there is an edge between them if at least one of them respond to the other’s comment. There are four popular subreddits, namely, IAmA, AskReddit, TrollXChromosomes, and atheism. IAmA and AskReddit are two question/answer based subreddits, and TrollXChromosomes and atheism are two discussion-based subreddits. A graph is labeled according to whether it belongs to a question/answer-based community or a discussion-based community.

  10. Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Jan 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amaury Trujillo; Amaury Trujillo; Stefano Cresci; Stefano Cresci (2023). Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation Interventions on r/The_Donald [Dataset]. http://doi.org/10.5281/zenodo.6250577
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jan 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amaury Trujillo; Amaury Trujillo; Stefano Cresci; Stefano Cresci
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.

    An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again

    If you use this dataset please cite the related article.

    The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.

    The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.

    The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.

    The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.

    A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.

  11. MBTI type and digital footprints for reddit users

    • kaggle.com
    Updated Nov 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kitchener (2020). MBTI type and digital footprints for reddit users [Dataset]. https://www.kaggle.com/michaelkitchener/mbti-type-and-digital-footprints-for-reddit-users
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2020
    Dataset provided by
    Kaggle
    Authors
    Michael Kitchener
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    MBTI type and digital footprint for reddit users

    Each row contains an anonymized reddit user's MBTI personality type. Each column represents how much a user posts or comments in a particular subreddit. Specifically, the 'posts_ examplesubreddit' refers to how many of the users top 100 posts of all time are in 'r/examplesubreddit', and 'comments_examplesubreddit' refers to how many of the users most recent 100 comments are in 'r/examplesubreddit'.

    This data was obtained using the PRAW (Reddit's API wrapper for python) to scrape a list of reddit users who comment on the r/mbti subreddit along with their self identified MBTI type (as illustrated in their flair). Then, for each user whose MBTI type we are aware of, we go through their top 100 posts and newest 100 comments to record the frequency of their interaction in various subreddits. Thus creating a user-footprint matrix.

    The purpose of this data set is to see how well MBTI personality types (or even just specific traits i.e. extraversion vs. introversion) can be predicted on the basis of a user's subreddit interactions.

    You will almost certainly need to perform some kind of dimensionality reduction in order to develop an effective classification model.

    Caveats:

    The MBTI type personality test is controversial and some consider it illegitimate. However, both extraversion/introversion and sensing/intuition correlate strongly with extraversion and openness as measured in the much more accepted big 5 model of personality. As such, it might be best to focus efforts on attempting to classify these traits based on the data provided.

  12. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  13. R/The_Donald Reddit Dataset

    • figshare.com
    txt
    Updated Jun 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivian Ferrillo (2022). R/The_Donald Reddit Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.19991777.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2022
    Dataset provided by
    figshare
    Authors
    Vivian Ferrillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains publically available posting data on users who posted on r/The_Donald in January 2017. All data was scraped via the PushShift.io project. Dataset contains monthly posting data of each individual and the results of term frequency analysis. All sampled users were anonymized.

  14. A

    ‘One Million Reddit Confessions’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘One Million Reddit Confessions’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-one-million-reddit-confessions-8471/839caf4b/?iid=000-410&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘One Million Reddit Confessions’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/one-million-reddit-confessions-samplee on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    NOTICE

    Due to the platform's limitations, we can only provide a sample of this dataset. Please download the full version (free, no registration) from SocialGrep.

    Context

    For one reason or another, people are compelled to be frank with strangers. Whether it's making a fast friend on a train ride, or posting an anonymous confession online, we just tend to find it easier to let our secrets out to someone we'll never know again. A brief, beautiful window of candid honesty is somewhere in there. That's what this dataset was inspired by.

    Content

    The following dataset comprises a million confession posts from Sep 30 2021 and backwards, proportionally taken from the following subreddits:

    • /r/trueoffmychest
    • /r/confession
    • /r/confessions
    • /r/offmychest

    All the posts are annotated with their score.

    The dataset was procured using SocialGrep.

    To preserve users' anonymity and to prevent targeted harassment, the data does not include usernames.

    Inspiration

    In this dataset, we wanted to explore the nature of sympathy. Which confessions are met with forgiveness? Which aren't? It's our most candid corpus to date.

    This dataset was created by SocialGrep and contains around 100 samples along with Subreddit.nsfw, Domain, technical information and other features such as: - Subreddit.name - Subreddit.id - and more.

    How to use this dataset

    • Analyze Type in relation to Score
    • Study the influence of Selftext on Url
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit SocialGrep

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  15. Reddit users in Romania 2020-2028

    • statista.com
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reddit users in Romania 2020-2028 [Dataset]. https://www.statista.com/forecasts/1146961/reddit-users-in-romania
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Romania
    Description

    The number of Reddit users in Romania was forecast to increase between 2024 and 2028 by in total *** million users (+**** percent). This overall increase does not happen continuously, notably not in 2027 and 2028. The Reddit user base is estimated to amount to **** million users in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to *** countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Reddit users in countries like Bulgaria and Serbia.

  16. CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

    • zenodo.org
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Puneet Kumar; Puneet Kumar; Sarthak Malik; Sarthak Malik; Balasubramanian Raman; Balasubramanian Raman; Xiaobai Li; Xiaobai Li (2025). CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis [Dataset]. http://doi.org/10.5281/zenodo.11409612
    Explore at:
    Dataset updated
    May 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Puneet Kumar; Puneet Kumar; Sarthak Malik; Sarthak Malik; Balasubramanian Raman; Balasubramanian Raman; Xiaobai Li; Xiaobai Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 1, 2024
    Description

    Overview
    The Controllable Multimodal Feedback Synthesis (CMFeed) Dataset is designed to enable the generation of sentiment-controlled feedback from multimodal inputs, including text and images. This dataset can be used to train feedback synthesis models in both uncontrolled and sentiment-controlled manners. Serving a crucial role in advancing research, the CMFeed dataset supports the development of human-like feedback synthesis, a novel task defined by the dataset's authors. Additionally, the corresponding feedback synthesis models and benchmark results are presented in the associated code and research publication.

    Task Uniqueness: The task of controllable multimodal feedback synthesis is unique, distinct from LLMs and tasks like VisDial, and not addressed by multi-modal LLMs. LLMs often exhibit errors and hallucinations, as evidenced by their auto-regressive and black-box nature, which can obscure the influence of different modalities on the generated responses [Ref1; Ref2]. Our approach includes an interpretability mechanism, as detailed in the supplementary material of the corresponding research publication, demonstrating how metadata and multimodal features shape responses and learn sentiments. This controllability and interpretability aim to inspire new methodologies in related fields.

    Data Collection and Annotation
    Data was collected by crawling Facebook posts from major news outlets, adhering to ethical and legal standards. The comments were annotated using four sentiment analysis models: FLAIR, SentimentR, RoBERTa, and DistilBERT. Facebook was chosen for dataset construction because of the following factors:
    • Facebook was chosen for data collection because it uniquely provides metadata such as news article link, post shares, post reaction, comment like, comment rank, comment reaction rank, and relevance scores, not available on other platforms.
    • Facebook is the most used social media platform, with 3.07 billion monthly users, compared to 550 million Twitter and 500 million Reddit users. [Ref]
    • Facebook is popular across all age groups (18-29, 30-49, 50-64, 65+), with at least 58% usage, compared to 6% for Twitter and 3% for Reddit. [Ref]. Trends are similar for gender, race, ethnicity, income, education, community, and political affiliation [Ref]
    • The male-to-female user ratio on Facebook is 56.3% to 43.7%; on Twitter, it's 66.72% to 23.28%; Reddit does not report this data. [Ref]

    Filtering Process: To ensure high-quality and reliable data, the dataset underwent two levels of filtering:
    a) Model Agreement Filtering: Retained only comments where at least three out of the four models agreed on the sentiment.
    b) Probability Range Safety Margin: Comments with a sentiment probability between 0.49 and 0.51, indicating low confidence in sentiment classification, were excluded.
    After filtering, 4,512 samples were marked as XX. Though these samples have been released for the reader's understanding, they were not used in training the feedback synthesis model proposed in the corresponding research paper.

    Dataset Description
    • Total Samples: 61,734
    • Total Samples Annotated: 57,222 after filtering.
    • Total Posts: 3,646
    • Average Likes per Post: 65.1
    • Average Likes per Comment: 10.5
    • Average Length of News Text: 655 words
    • Average Number of Images per Post: 3.7

    Components of the Dataset
    The dataset comprises two main components:
    CMFeed.csv File: Contains metadata, comment, and reaction details related to each post.
    Images Folder: Contains folders with images corresponding to each post.

    Data Format and Fields of the CSV File
    The dataset is structured in CMFeed.csv file along with corresponding images in related folders. This CSV file includes the following fields:
    Id: Unique identifier
    Post: The heading of the news article.
    News_text: The text of the news article.
    News_link: URL link to the original news article.
    News_Images: A path to the folder containing images related to the post.
    Post_shares: Number of times the post has been shared.
    Post_reaction: A JSON object capturing reactions (like, love, etc.) to the post and their counts.
    Comment: Text of the user comment.
    Comment_like: Number of likes on the comment.
    Comment_reaction_rank: A JSON object detailing the type and count of reactions the comment received.
    Comment_link: URL link to the original comment on Facebook.
    Comment_rank: Rank of the comment based on engagement and relevance.
    Score: Sentiment score computed based on the consensus of sentiment analysis models.
    Agreement: Indicates the consensus level among the sentiment models, ranging from -4 (all negative) to 4 (all positive). 3 negative and 1 positive will result into -2 and 3 positives and 1 negative will result into +2.
    Sentiment_class: Categorizes the sentiment of the comment into 1 (positive) or 0 (negative).

    More Considerations During Dataset Construction
    We thoroughly considered issues such as the choice of social media platform for data collection, bias and generalizability of the data, selection of news handles/websites, ethical protocols, privacy and potential misuse before beginning data collection. While achieving completely unbiased and fair data is unattainable, we endeavored to minimize biases and ensure as much generalizability as possible. Building on these considerations, we made the following decisions about data sources and handling to ensure the integrity and utility of the dataset:

    • Why not merge data from different social media platforms?
    We chose not to merge data from platforms such as Reddit and Twitter with Facebook due to the lack of comprehensive metadata, clear ethical guidelines, and control mechanisms—such as who can comment and whether users' anonymity is maintained—on these platforms other than Facebook. These factors are critical for our analysis. Our focus on Facebook alone was crucial to ensure consistency in data quality and format.

    • Choice of four news handles: We selected four news handles—BBC News, Sky News, Fox News, and NY Daily News—to ensure diversity and comprehensive regional coverage. These news outlets were chosen for their distinct regional focuses and editorial perspectives: BBC News is known for its global coverage with a centrist view, Sky News offers geographically targeted and politically varied content learning center/right in the UK/EU/US, Fox News is recognized for its right-leaning content in the US, and NY Daily News provides left-leaning coverage in New York. Many other news handles such as NDTV, The Hindu, Xinhua, and SCMP are also large-scale but may contain information in regional languages such as Indian and Chinese, hence, they have not been selected. This selection ensures a broad spectrum of political discourse and audience engagement.

    • Dataset Generalizability and Bias: With 3.07 billion of the total 5 billion social media users, the extensive user base of Facebook, reflective of broader social media engagement patterns, ensures that the insights gained are applicable across various platforms, reducing bias and strengthening the generalizability of our findings. Additionally, the geographic and political diversity of these news sources, ranging from local (NY Daily News) to international (BBC News), and spanning political spectra from left (NY Daily News) to right (Fox News), ensures a balanced representation of global and political viewpoints in our dataset. This approach not only mitigates regional and ideological biases but also enriches the dataset with a wide array of perspectives, further solidifying the robustness and applicability of our research.

    • Dataset size and diversity: Facebook prohibits the automatic scraping of its users' personal data. In compliance with this policy, we manually scraped publicly available data. This labor-intensive process requiring around 800 hours of manual effort, limited our data volume but allowed for precise selection. We followed ethical protocols for scraping Facebook data , selecting 1000 posts from each of the four news handles to enhance diversity and reduce bias. Initially, 4000 posts were collected; after preprocessing (detailed in Section 3.1), 3646 posts remained. We then processed all associated comments, resulting in a total of 61734 comments. This manual method ensures adherence to Facebook’s policies and the integrity of our dataset.

    Ethical considerations, data privacy and misuse prevention
    The data collection adheres to Facebook’s ethical guidelines [<a href="https://developers.facebook.com/terms/"

  17. h

    Supporting data for "A Meta-Intervention: Quantifying the Impact of Social...

    • datahub.hku.hk
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingzhe Quan (2025). Supporting data for "A Meta-Intervention: Quantifying the Impact of Social Media Information on Adherence to Non-Pharmaceutical Interventions" [Dataset]. http://doi.org/10.25442/hku.29068061.v1
    Explore at:
    Dataset updated
    May 23, 2025
    Dataset provided by
    HKU Data Repository
    Authors
    Mingzhe Quan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset supports a research project in the field of digital medicine, which aims to quantify the impact of disseminating scientific information on social media—as a form of "meta-intervention"—on public adherence to Non-Pharmaceutical Interventions (NPIs) during health crises such as the COVID-19 pandemic. The research encompasses multiple sub-studies and pilot experiments, drawing data from various global and China-specific social media platforms.The data included in this submission has been collected from several sources:From Sina Weibo and Tencent WeChat, 189 online poll datasets were collected, involving a total of 1,391,706 participants. These participants are users of Sina Weibo or Tencent WeChat.From Twitter, 187 tweets published by scientists (verified with a blue checkmark) related to COVID-19 were collected.From Xiaohongshu and Bilibili, textual content from 143 user posts/videos concerning COVID-19, along with associated user comments and specific user responses to a question, were gathered.It is important to note that while the broader research project also utilized a 3TB Reddit corpus hosted on Academic Torrents (academictorrents.com), this specific Reddit dataset is publicly available directly from Academic Torrents and is not included in this particular DataHub submission. The submitted dataset comprises publicly available data, formatted as Excel files (.xlsx), and includes the following:Filename: scientists' discourse (source from screenshot of tweets)Description: This file contains screenshots of tweets published by scientists on Twitter concerning COVID-19 research, its current status, and related topics. It also includes a coded analysis of the textual content from these tweets. Specific details regarding the coding scheme can be found in the readme.txt file.Filename: The links of online polls (Weibo & WeChat)Description: This data file includes information from online polls conducted on Weibo and WeChat after December 7, 2022. These polls, often initiated by verified users (who may or may not be science popularizers), aimed to track the self-reported proportion of participants testing positive for COVID-19 (via PCR or rapid antigen test) or remaining negative, particularly during periods of rapid Omicron infection spread. The file contains links to the original polls, links to the social media accounts that published these polls, and relevant metadata about both the poll-creating accounts and the online polls themselves.Filename: Online posts & comments (From Xiaohongshu & Bilibili)Description: This file contains textual content from COVID-19 related posts and videos published by users on the Xiaohongshu and Bilibili platforms. It also includes user-generated comments reacting to these posts/videos, as well as user responses to a specific question posed within the context of the original content.Key Features of this Dataset:Data Type: Mixed, including textual data, screenshots of social media posts, web links to original sources, and coded metadata.Source Platforms: Twitter (global), Weibo/WeChat (primarily China), Xiaohongshu (China), and Bilibili (video-sharing platform, primarily China).Use Case: This dataset is intended for the analysis of public discourse, the dissemination of scientific information, and user engagement patterns across different cultural contexts and social media platforms, particularly in relation to public health information.

  18. f

    Reddit social dimensions

    • figshare.com
    bz2
    Updated Nov 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luca Maria Aiello; Sagar Joglekar; Daniele Quercia (2022). Reddit social dimensions [Dataset]. http://doi.org/10.6084/m9.figshare.19918231.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    Nov 27, 2022
    Dataset provided by
    figshare
    Authors
    Luca Maria Aiello; Sagar Joglekar; Daniele Quercia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accompanying dataset for paper "Multidimensional Tie Strength and Economic Development"

    reddit_messages_dimensions:

    author: reddit users who sent the message time: timestamp of message dest_author: recipient of message author_state_code: estimated state of residence for the sender dest_state_code: estimated state of residence for the receiver {dimension}_binary_adaptive_{threshold}: binary score that indicates whether the message expresses the given dimension when filtering messages according to the specified threshold

    regression_file:

    state_code: the 2-letter code of the state state: name of the state lat_centroid: latitude of the state centroid lon_centroid: longitude of the state centroid population_2010: resident population in 2010 population_2015: resident population in 2019 population_2019: resident population in 2019 gdp_per_capita_2017: GDP per capita in year 2017 user_count: number of reddit users in the dataset {spatial|social}_diversity_{dimension}_{threshold}: measure of spatial|social diversity calculated for a given social dimension, using the specified threshold on the dimension scores capital: capital of the state diversity_{social|spatial}_all_minstrength_{n}: measure of spatial|social diversity calculated using all social links and a minimum threshold on n messages sent

  19. g

    Reddit Self-reported Depression Diagnosis (RSDD)

    • ir.cs.georgetown.edu
    json
    Updated Jun 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgetown University Information Retrieval Lab (2018). Reddit Self-reported Depression Diagnosis (RSDD) [Dataset]. https://ir.cs.georgetown.edu/awards/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 13, 2018
    Dataset authored and provided by
    Georgetown University Information Retrieval Lab
    Description

    Posts from thousands of Reddit users who claim to have been diagnosed with depression, and carefully-selected control users.

  20. Descriptives of User Perceived Disinformation on Reddit

    • figshare.com
    txt
    Updated Dec 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vlad Achimescu (2020). Descriptives of User Perceived Disinformation on Reddit [Dataset]. http://doi.org/10.6084/m9.figshare.13174145.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 21, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Vlad Achimescu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Jupyter notebook showing how the results presented in the paper "Raising the Flag: Monitoring User Perceived Disinformation on Reddit" were obtained.- Section 4.2 POS matching on sentences- Section 4.4.3. Machine Learning- Section 5. Results - general descriptives- Section 5.1. Trends and peaks (including topics for peaks)- Section 5.5 Comparison with a fact checking websiteAlso, datasets needed to run some of the analyses.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
William L. Hamilton; Rex Ying; Jure Leskovec (2017). Reddit Dataset [Dataset]. https://paperswithcode.com/dataset/reddit

Reddit Dataset

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 9, 2017
Authors
William L. Hamilton; Rex Ying; Jure Leskovec
Description

The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.

Search
Clear search
Close search
Google apps
Main menu