11 datasets found
  1. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  2. Reddit: /r/technology (Submissions & Comments)

    • kaggle.com
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/technology (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-technology-insights-through-reddit-di
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/technology (Submissions & Comments)

    Title, Score, ID, URL, Comment Number, and Timestamp

    By Reddit [source]

    About this dataset

    This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .

    Research Ideas

    • Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.
    • Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.
    • Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

  3. p

    Reddit Datasets

    • promptcloud.com
    csv
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2025). Reddit Datasets [Dataset]. https://www.promptcloud.com/dataset/reddit/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 28, 2025
    Dataset authored and provided by
    PromptCloud
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Extracting Insights from Online DiscussionsReddit is one of the largest social discussion platforms, making it a valuable source for real-time opinions, trends, sentiment analysis, and user interactions across various industries. Scraping Reddit data allows businesses, researchers, and analysts to explore public discussions, track sentiment, and gain actionable insights from user-generated content. Benefits and Impact: Trend […]

  4. Reddit: global paid subscription revenues 2018-2026

    • statista.com
    • tokrwards.com
    • +4more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department, Reddit: global paid subscription revenues 2018-2026 [Dataset]. https://www.statista.com/topics/1164/social-networks/
    Explore at:
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Description

    In 2023, it was estimated that social forum and news aggregator Reddit saw over 26.5 million U.S. dollars in revenues from global paying users with an annual subscription. A premium Reddit subscription comes with an ad-free environment, as well as the possibility to join premium subreddits such as r/lounge. In 2022, Reddit counted approximately 530 thousand paying users. By 2026, Reddit annual subscription revenues are estimated to bring in 36.5 million U.S. dollars in revenues.

  5. Fake News

    • zenodo.org
    • data.niaid.nih.gov
    bin, png
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Solomiia Fedushko; Solomiia Fedushko (2024). Fake News [Dataset]. http://doi.org/10.5281/zenodo.11370330
    Explore at:
    png, binAvailable download formats
    Dataset updated
    Jul 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Solomiia Fedushko; Solomiia Fedushko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data collection focuses on capturing user-generated content from the popular social network Reddit in 2024. The dataset “Fake News” comprises collected data from 3636 users of Reddit. This dataset consists of .csv .xls, and .xlsx files, containing textual data associated with fake news.

    Funded by the EU NextGeneration EU through the Recovery and Resilience Plan for Slovakia under the project No. 09I03-03-V01-000153

  6. RedditMix - Stock and investment

    • kaggle.com
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2023). RedditMix - Stock and investment [Dataset]. http://doi.org/10.34740/kaggle/ds/4138984
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AnthonyTherrien
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    Dataset Description: Aggregated Reddit Stock Market Discussions

    Description: This dataset presents an aggregated collection of discussion threads from a variety of stock market-related subreddits, compiled into a single .json file. It offers a comprehensive overview of community-driven discussions, opinions, analyses, and sentiments about various aspects of the stock market. This dataset is a valuable resource for understanding diverse perspectives on different stocks and investment strategies.

    The single .json file contains aggregated data from the following subreddits: | Subreddit Name | Subreddit Name | Subreddit Name | Subreddit Name | | --- | --- | --- | --- | | r/AlibabaStock | r/IndiaInvestments | r/StockMarket | | r/amcstock | r/IndianStockMarket | r/StocksAndTrading | | r/AMD_Stock | r/investing_discussion | r/stocks | | r/ATERstock | r/investing | r/StockTradingIdeas | | r/ausstocks | r/pennystocks | r/teslainvestorsclub | | r/BB_Stock | r/realestateinvesting | r/trakstocks | | r/Bitcoin | r/RobinHoodPennyStocks | r/UKInvesting | | r/Canadapennystocks | r/SOSStock | r/ValueInvesting | | r/CanadianInvestor | r/STOCKMARKETNEWS | |

    Dataset Format: - The dataset is in .json format, facilitating easy parsing and analysis. - Each entry in the file represents a distinct post or thread, complete with details such as title, score, number of comments, body, creation date, and comments.

    Potential Applications: - Sentiment analysis across different investment communities. - Comparative analysis of discussions and trends across various stocks and sectors. - Behavioral analysis of investors in different market scenarios.

    Caveats: - The content is user-generated and may contain biases or subjective opinions. - The data reflects specific time periods and may not represent current market sentiments or trends.

  7. MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane

    • zenodo.org
    csv
    Updated Aug 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2025). MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane [Dataset]. http://doi.org/10.5281/zenodo.15401479
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    We present a Multiplatform Annotated Dataset for Societal Impact of Hurricane (MASH) that includes 98,662 relevant social media data posts from Reddit, X, TikTok, and YouTube.
    In addition, all relevant posts are annotated on three dimensions: Humanitarian Classes, Bias Classes, and Information Integrity Classes in a multi-modal approach that considers both textual and visual content (text, images, and videos), providing a rich labeled dataset for in-depth analysis.
    To our best knowledge, MASH is the first large-scale, multi-platform, multimodal, and multi-dimensionally annotated hurricane dataset. We envision that MASH can contribute to the study of hurricanes' impact on society, such as disaster severity classification, event detections, public sentiment analysis, and bias identification.

    Usage Notice

    This dataset includes four annotation files:
    • reddit_anno_publish.csv
    • tiktok_anno_publish.csv
    • twitter_anno_publish.csv
    • youtube_anno_publish.csv
    Each file contains post IDs and corresponding annotations on three dimensions: Humanitarian Classes, Bias Classes, and Information Integrity Classes.
    To protect user privacy, only post IDs are released. We recommend retrieving the full post content via the official APIs of each platform, in accordance with their respective terms of service.

    Humanitarian Classes

    Each post is annotated with seven binary humanitarian classes. For each class, the label is either:
    • True – the post contains this humanitarian information
    • False – the post does not contain this information
    These seven humanitarian classes include:
    • Casualty: The post reports people or animals who are killed, injured, or missing during the hurricane.
    • Evacuation: The post describes the evacuation, relocation, rescue, or displacement of individuals or animals due to the hurricane.
    • Damage: The post reports damage to infrastructure or public utilities caused by the hurricane.
    • Advice: The post provides advice, guidance, or suggestions related to hurricanes, including how to stay safe, protect property, or prepare for the disaster.
    • Request: Request for help, support, or resources due to the hurricane
    • Assistance: This includes both physical aid and emotional or psychological support provided by individuals, communities, or organizations.
    • Recovery: The post describes efforts or activities related to the recovery and rebuilding process after the hurricane.
    Note: A single post may be labeled as True for multiple humanitarian categories.

    Bias Classes

    Each post is annotated with five binary bias classes. For each class, the label is either:
    • True – the post contains this bias information
    • False – the post does not contain this information
    These five bias classes include:
    • Linguistic Bias: The post contains biased, inappropriate, or offensive language, with a focus on word choice, tone, or expression.
    • Political Bias: The post expresses political ideology, showing favor or disapproval toward specific political actors, parties, or policies.
    • Gender Bias: The post contains biased, stereotypical, or discriminatory language or viewpoints related to gender.
    • Hate Speech: The post contains language that expresses hatred, hostility, or dehumanization toward a specific group or individual, especially those belonging to minority or marginalized communities.
    • Racial Bias: The post contains biased, discriminatory, or stereotypical statements directed toward one or more racial or ethnic groups.
    Note: A single post may be labeled as True for multiple bias categories.

    Information Integrity Classes

    Each post is also annotated with a single information integrity class, represented by an integer:
    • -1 → False information (i.e., misinformation or disinformation)
    • 0 → Unverifiable information (unclear or lacking sufficient evidence)
    • 1 → True information (verifiable and accurate)

    Key Notes

    1. Version 1 is no longer available.
  8. Customer Support on Twitter

    • kaggle.com
    zip
    Updated Nov 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thought Vector (2017). Customer Support on Twitter [Dataset]. https://www.kaggle.com/thoughtvector/customer-support-on-twitter
    Explore at:
    zip(149959515 bytes)Available download formats
    Dataset updated
    Nov 27, 2017
    Dataset authored and provided by
    Thought Vector
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.

    https://i.imgur.com/nTv3Iuu.png" alt="Example Analysis - Inbound Volume for the Top 20 Brands">

    Context

    Natural language remains the densest encoding of human experience we have, and innovation in NLP has accelerated to power understanding of that data, but the datasets driving this innovation don't match the real language in use today. The Customer Support on Twitter dataset offers a large corpus of modern English (mostly) conversations between consumers and customer support agents on Twitter, and has three important advantages over other conversational text datasets:

    • Focused - Consumers contact customer support to have a specific problem solved, and the manifold of problems to be discussed is relatively small, especially compared to unconstrained conversational datasets like the reddit Corpus.
    • Natural - Consumers in this dataset come from a much broader segment than those in the Ubuntu Dialogue Corpus and have much more natural and recent use of typed text than the Cornell Movie Dialogs Corpus.
    • Succinct - Twitter's brevity causes more natural responses from support agents (rather than scripted), and to-the-point descriptions of problems and solutions. Also, its convenient in allowing for a relatively low message limit size for recurrent nets.

    Inspiration

    The size and breadth of this dataset inspires many interesting questions:

    • Can we predict company responses? Given the bounded set of subjects handled by each company, the answer seems like yes!
    • Do requests get stale? How quickly do the best companies respond, compared to the worst?
    • Can we learn high quality dense embeddings or representations of similarity for topical clustering?
    • How does tone affect the customer support conversation? Does saying sorry help?
    • Can we help companies identify new problems, or ones most affecting their customers?

    Content

    The dataset is a CSV, where each row is a tweet. The different columns are described below. Every conversation included has at least one request from a consumer and at least one response from a company. Which user IDs are company user IDs can be calculated using the inbound field.

    tweet_id

    A unique, anonymized ID for the Tweet. Referenced by response_tweet_id and in_response_to_tweet_id.

    author_id

    A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.

    inbound

    Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.

    created_at

    Date and time when the tweet was sent.

    text

    Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like _email_.

    response_tweet_id

    IDs of tweets that are responses to this tweet, comma-separated.

    in_response_to_tweet_id

    ID of the tweet this tweet is in response to, if any.

    Contributing

    Know of other brands the dataset should include? Found something that needs to be fixed? Start a discussion, or email me directly at $FIRSTNAME@$LASTNAME.com!

    Acknowledgements

    A huge thank you to my friends who helped bootstrap the list of companies that do customer support on Twitter! There are many rocks that would have been left un-turned were it not for your suggestions!

    Relevant Resources

  9. f

    Overview of summary statistics used.

    • plos.figshare.com
    xls
    Updated Aug 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Kilroy; Graham Healy; Simon Caton (2024). Overview of summary statistics used. [Dataset]. http://doi.org/10.1371/journal.pone.0307180.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    David Kilroy; Graham Healy; Simon Caton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, computational approaches for extracting customer needs from user generated content have been proposed. However, there is a lack of studies that focus on extracting unmet needs for future popular products. Therefore, this study presents a supervised keyphrase classification model which predicts needs that will become popular in real products in the marketplace. To do this, we utilize Trending Customer Needs (TCN)—a monthly dataset of trending keyphrase customer needs occurring in new products during 2011-2021 across multiple categories of Consumer Packaged Goods e.g. toothpaste, eyeliner, beer, etc. We are the first study to use this specific dataset and employ it by training a time series algorithm to learn the relationship between features we generate for each candidate keyphrase on Reddit to the ones in the dataset 1-3 years in the future. We show that our approach outperforms a baseline in the literature and through Multi-Task Learning can accurately predict needs for a category it wasn’t trained on e.g. train on toothpaste, cereal, and beer products yet still predict for shampoo products. The findings from this research could provide many advantages to businesses such as gaining early access into markets.

  10. h

    RealWorldQuestioning

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sonal Prabhune, RealWorldQuestioning [Dataset]. https://huggingface.co/datasets/SonalPrabhune/RealWorldQuestioning
    Explore at:
    Authors
    Sonal Prabhune
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    RealWorldQuestioning Benchmark

    RealWorldQuestioning is a benchmark dataset of 400+ real-world user questions collected from public discussion forums (e.g., Reddit, Quora), designed to support evaluation of gender bias and information disparity in Large Language Models (LLMs). The dataset spans four business-relevant domains: Education, Jobs, Investment, and Health. Each question is annotated with:

    User persona (Male or Female framing) Source forum Domain category

    Four anonymized… See the full description on the dataset page: https://huggingface.co/datasets/SonalPrabhune/RealWorldQuestioning.

  11. TweepFake - Twitter deep Fake text Dataset

    • kaggle.com
    Updated Apr 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maurizio Tesconi (2021). TweepFake - Twitter deep Fake text Dataset [Dataset]. https://www.kaggle.com/mtesconi/twitter-deep-fake-text/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Maurizio Tesconi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    Social media have always been the perfect vehicle to manipulate and alter public opinion through bots, i.e. agents that behave as human users by liking, re-posting and publishing multimedia content which can be real or machine-generated. In the latter case, the spreading of deep-fakes - potentially deceptive images, video, audio or text autonomously generated by a deep neural network - in social media, have been sowing mistrust, hate and deceits at the expense of the people.

    As far as it concerns deep-fake texts, the great improvement over their generation has been obtained by the language models (RNN, LSTM, GPT-2, GROVER, CLTR, OPTIMUS, GPT-3): several studies - such as Ippolito et al. (2020) and Adelani et al. (2019) - have shown that humans are capable of detecting those deep-fake texts as machine-generated with a detection rate around the chance value.

    Even though examples of deep-fake messages can already be found in social media (as our dataset shows), there is still no episode of misuse on them; however, the language models' generative capability deeply worries: it is, therefore, necessary to quickly raise shields against this threat as well. Some deep-fake text detection techniques have already been investigated - Bakhtin et al. (2019), Grover (2019), GLTR (2019), Adelani et al. (2019), Ippolito et al. (2020) -, but there is still a lack of knowledge on how those state-of-the-art deepfake text detection techniques perform in a "**real-social-media-setting**", in which the text generation method is unknown and the text content is often short (especially on Twitter).

    A dataset of deep-fake social media messages is required to start the research. Unfortunately, at the best of our knowledge, no one has ever created a properly labelled social media dataset containing only human and deep-fake messages (thus excluding cheap-fake texts that employ simple generative techniques as gap-filling and search-and-replace methods) that can already be found on social media timelines.

    Focusing on Twitter, we have collected human and deepfake tweets to support the research on deepfake social media text detection in a "real-setting".

    Content

    To collect machine-generated tweets, the only known way is to heuristically search for Twitter accounts on the web (especially on Github and Twitter, of course) looking for keywords related either to automatic or AI text generation, deep-fake test/tweets, or to specific technologies such as "GPT-2", "RNN", "LSTM" and so on. After gathering some accounts, two filters were applied: first, only the bot accounts referring to autonomous text generation methods either in the Twitter description, in the profile URLs or the Github description were selected; then, the subset of accounts mimicking (often fine-tuned on) human Twitter profiles were chosen.

    (ACCOUNTS) In the end, 23 bots and 17 human accounts were collected (some human accounts are associated with more than one bot account, e.g. Trump or Elon Musk's profiles). Three main text generation technologies were identified: GPT-2 (11 accounts, 3861 tweets), RNN (7 accounts, 4181 tweets), and Others (5 accounts, 4876 tweets). "Others" contains techniques that are non-better specified or found (Markov Chain, RNN + Markov Chain, LSTM, CharRNN) either in a couple or just one account. The following table specifies some articles were the Twitter accounts are taken from.

    Generative MethodReference ArticlesClass_Type
    GPT-2https://minimaxir.com/2020/01/twitter-gpt2-bot/ - https://github.com/osirisguitar/botus-twitter - Max Woolf's Colab - https://www.reddit.com/r/MachineLearning/comments/df0q8f/p_i_finetuned_a_gpt2_language_model_to_generate/?ref=bestofml - https://www.reddit.com/r/OpenAI/comments/es0stw/weird_twitter_gpt2_bot/GPT-2
    RNNhttps://qz.com/631497/mit-built-a-donald-trump-ai-twitter-bot-that-sounds-scarily-like-him/ - https://mc.ai/develop-and-publish-text-generating-twitter-bot/ - https://qz.com/631497/mit-built-a-donald-trump-ai-twitter-bot-that-sounds-scarily-like-him/RNN
    Torch RNNhttps://github.com/DaveSmith227/deep-elon-tweet-generatorRNN
    Markov Chainhttps://hackernoon.com/create-a-twitter-politician-bot-with-markov-chains-node-js-and-stdlib-14df8cc1c68aOthers
    CharRNhttps://github.com/Jmete/deepThorinOthers

    (TWEETS) Via the Twitter REST API, the timelines of both deep-fake accounts and their corresponding ...

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
Organization logo

Data from: WikiReddit: Tracing Information and Attention Flows Between Online Platforms

Related Article
Explore at:
binAvailable download formats
Dataset updated
May 4, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Jan 15, 2025
Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


SQL Database Schema

Table: posts

Column NameTypeDescription
subreddit_idTEXTThe unique identifier for the subreddit.
crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
post_idTEXTUnique identifier for the Reddit post.
created_atTIMESTAMPThe timestamp when the post was created.
updated_atTIMESTAMPThe timestamp when the post was last updated.
language_codeTEXTThe language code of the post.
scoreINTEGERThe score (upvotes minus downvotes) of the post.
upvote_ratioREALThe ratio of upvotes to total votes.
gildingsINTEGERNumber of awards (gildings) received by the post.
num_commentsINTEGERNumber of comments on the post.

Table: comments

Column NameTypeDescription
subreddit_idTEXTThe unique identifier for the subreddit.
post_idTEXTThe ID of the Reddit post the comment belongs to.
parent_idTEXTThe ID of the parent comment (if a reply).
comment_idTEXTUnique identifier for the comment.
created_atTIMESTAMPThe timestamp when the comment was created.
last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
scoreINTEGERThe score (upvotes minus downvotes) of the comment.
upvote_ratioREALThe ratio of upvotes to total votes for the comment.
gildedINTEGERNumber of awards (gildings) received by the comment.

Table: postlinks

Column NameTypeDescription
post_idTEXTUnique identifier for the Reddit post.
end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
end_processed_urlTEXTThe extracted URL from the Reddit post.
final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
final_statusINTEGERHTTP status code of the final URL.
final_urlTEXTThe final URL after redirections.
redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

Table: commentlinks

Column NameTypeDescription
comment_idTEXTUnique identifier for the Reddit comment.
end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
end_processed_urlTEXTThe extracted URL from the comment.
final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
final_statusINTEGERHTTP status code of the final

Search
Clear search
Close search
Google apps
Main menu