11 datasets found

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

Reddit: /r/technology (Submissions & Comments)
kaggle.com
Updated Dec 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/technology (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-technology-insights-through-reddit-di
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2022
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/technology (Submissions & Comments)

Title, Score, ID, URL, Comment Number, and Timestamp

By Reddit [source]

About this dataset

This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .

Research Ideas

Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.

Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.

Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
p
Reddit Datasets
promptcloud.com
csv
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PromptCloud (2025). Reddit Datasets [Dataset]. https://www.promptcloud.com/dataset/reddit/
Explore at:
csvAvailable download formats
Dataset updated
Mar 28, 2025
Dataset authored and provided by
PromptCloud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Extracting Insights from Online DiscussionsReddit is one of the largest social discussion platforms, making it a valuable source for real-time opinions, trends, sentiment analysis, and user interactions across various industries. Scraping Reddit data allows businesses, researchers, and analysts to explore public discussions, track sentiment, and gain actionable insights from user-generated content. Benefits and Impact: Trend […]
Reddit: global paid subscription revenues 2018-2026
statista.com
tokrwards.com
+4more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department, Reddit: global paid subscription revenues 2018-2026 [Dataset]. https://www.statista.com/topics/1164/social-networks/
Explore at:
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Description
In 2023, it was estimated that social forum and news aggregator Reddit saw over 26.5 million U.S. dollars in revenues from global paying users with an annual subscription. A premium Reddit subscription comes with an ad-free environment, as well as the possibility to join premium subreddits such as r/lounge. In 2022, Reddit counted approximately 530 thousand paying users. By 2026, Reddit annual subscription revenues are estimated to bring in 36.5 million U.S. dollars in revenues.
Fake News
zenodo.org
data.niaid.nih.gov
bin, png
Updated Jul 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Solomiia Fedushko; Solomiia Fedushko (2024). Fake News [Dataset]. http://doi.org/10.5281/zenodo.11370330
Explore at:
png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11370330
Dataset updated
Jul 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Solomiia Fedushko; Solomiia Fedushko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data collection focuses on capturing user-generated content from the popular social network Reddit in 2024. The dataset “Fake News” comprises collected data from 3636 users of Reddit. This dataset consists of .csv .xls, and .xlsx files, containing textual data associated with fake news.

Funded by the EU NextGeneration EU through the Recovery and Resilience Plan for Slovakia under the project No. 09I03-03-V01-000153
RedditMix - Stock and investment
kaggle.com
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AnthonyTherrien (2023). RedditMix - Stock and investment [Dataset]. http://doi.org/10.34740/kaggle/ds/4138984
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/4138984
Dataset updated
Dec 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
AnthonyTherrien
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Dataset Description: Aggregated Reddit Stock Market Discussions

Description: This dataset presents an aggregated collection of discussion threads from a variety of stock market-related subreddits, compiled into a single .json file. It offers a comprehensive overview of community-driven discussions, opinions, analyses, and sentiments about various aspects of the stock market. This dataset is a valuable resource for understanding diverse perspectives on different stocks and investment strategies.

The single .json file contains aggregated data from the following subreddits: | Subreddit Name | Subreddit Name | Subreddit Name | Subreddit Name | | --- | --- | --- | --- | | r/AlibabaStock | r/IndiaInvestments | r/StockMarket | | r/amcstock | r/IndianStockMarket | r/StocksAndTrading | | r/AMD_Stock | r/investing_discussion | r/stocks | | r/ATERstock | r/investing | r/StockTradingIdeas | | r/ausstocks | r/pennystocks | r/teslainvestorsclub | | r/BB_Stock | r/realestateinvesting | r/trakstocks | | r/Bitcoin | r/RobinHoodPennyStocks | r/UKInvesting | | r/Canadapennystocks | r/SOSStock | r/ValueInvesting | | r/CanadianInvestor | r/STOCKMARKETNEWS | |

Dataset Format: - The dataset is in .json format, facilitating easy parsing and analysis. - Each entry in the file represents a distinct post or thread, complete with details such as title, score, number of comments, body, creation date, and comments.

Potential Applications: - Sentiment analysis across different investment communities. - Comparative analysis of discussions and trends across various stocks and sectors. - Behavioral analysis of investors in different market scenarios.

Caveats: - The content is user-generated and may contain biases or subjective opinions. - The data reflects specific time periods and may not represent current market sentiments or trends.
MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane
zenodo.org
csv
Updated Aug 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). MASH: A Multiplatform Annotated Dataset for Societal Impact of Hurricane [Dataset]. http://doi.org/10.5281/zenodo.15401479
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15401479
Dataset updated
Aug 1, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a Multiplatform Annotated Dataset for Societal Impact of Hurricane (MASH) that includes 98,662 relevant social media data posts from Reddit, X, TikTok, and YouTube.

In addition, all relevant posts are annotated on three dimensions: Humanitarian Classes, Bias Classes, and Information Integrity Classes in a multi-modal approach that considers both textual and visual content (text, images, and videos), providing a rich labeled dataset for in-depth analysis.

To our best knowledge, MASH is the first large-scale, multi-platform, multimodal, and multi-dimensionally annotated hurricane dataset. We envision that MASH can contribute to the study of hurricanes' impact on society, such as disaster severity classification, event detections, public sentiment analysis, and bias identification.

Usage Notice

This dataset includes four annotation files:

• reddit_anno_publish.csv

• tiktok_anno_publish.csv

• twitter_anno_publish.csv

• youtube_anno_publish.csv

Each file contains post IDs and corresponding annotations on three dimensions: Humanitarian Classes, Bias Classes, and Information Integrity Classes.

To protect user privacy, only post IDs are released. We recommend retrieving the full post content via the official APIs of each platform, in accordance with their respective terms of service.

- Reddit API (https://www.reddit.com/dev/api)

- TikTok API (https://developers.tiktok.com/products/research-api)

- X/Twitter API (https://developer.x.com/en/docs/x-api)

- YouTube API (https://developers.google.com/youtube/v3)

Humanitarian Classes

Each post is annotated with seven binary humanitarian classes. For each class, the label is either:

• True – the post contains this humanitarian information

• False – the post does not contain this information

These seven humanitarian classes include:

• Casualty: The post reports people or animals who are killed, injured, or missing during the hurricane.

• Evacuation: The post describes the evacuation, relocation, rescue, or displacement of individuals or animals due to the hurricane.

• Damage: The post reports damage to infrastructure or public utilities caused by the hurricane.

• Advice: The post provides advice, guidance, or suggestions related to hurricanes, including how to stay safe, protect property, or prepare for the disaster.

• Request: Request for help, support, or resources due to the hurricane

• Assistance: This includes both physical aid and emotional or psychological support provided by individuals, communities, or organizations.

• Recovery: The post describes efforts or activities related to the recovery and rebuilding process after the hurricane.

Note: A single post may be labeled as True for multiple humanitarian categories.

Bias Classes

Each post is annotated with five binary bias classes. For each class, the label is either:

• True – the post contains this bias information

• False – the post does not contain this information

These five bias classes include:

• Linguistic Bias: The post contains biased, inappropriate, or offensive language, with a focus on word choice, tone, or expression.

• Political Bias: The post expresses political ideology, showing favor or disapproval toward specific political actors, parties, or policies.

• Gender Bias: The post contains biased, stereotypical, or discriminatory language or viewpoints related to gender.

• Hate Speech: The post contains language that expresses hatred, hostility, or dehumanization toward a specific group or individual, especially those belonging to minority or marginalized communities.

• Racial Bias: The post contains biased, discriminatory, or stereotypical statements directed toward one or more racial or ethnic groups.

Note: A single post may be labeled as True for multiple bias categories.

Information Integrity Classes

Each post is also annotated with a single information integrity class, represented by an integer:

• -1 → False information (i.e., misinformation or disinformation)

• 0 → Unverifiable information (unclear or lacking sufficient evidence)

• 1 → True information (verifiable and accurate)

Key Notes

Version 1 is no longer available.
Customer Support on Twitter
kaggle.com
zip
Updated Nov 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thought Vector (2017). Customer Support on Twitter [Dataset]. https://www.kaggle.com/thoughtvector/customer-support-on-twitter
Explore at:
zip(149959515 bytes)Available download formats
Dataset updated
Nov 27, 2017
Dataset authored and provided by
Thought Vector
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.

https://i.imgur.com/nTv3Iuu.png" alt="Example Analysis - Inbound Volume for the Top 20 Brands">

Context

Natural language remains the densest encoding of human experience we have, and innovation in NLP has accelerated to power understanding of that data, but the datasets driving this innovation don't match the real language in use today. The Customer Support on Twitter dataset offers a large corpus of modern English (mostly) conversations between consumers and customer support agents on Twitter, and has three important advantages over other conversational text datasets:

Focused - Consumers contact customer support to have a specific problem solved, and the manifold of problems to be discussed is relatively small, especially compared to unconstrained conversational datasets like the reddit Corpus.

Natural - Consumers in this dataset come from a much broader segment than those in the Ubuntu Dialogue Corpus and have much more natural and recent use of typed text than the Cornell Movie Dialogs Corpus.

Succinct - Twitter's brevity causes more natural responses from support agents (rather than scripted), and to-the-point descriptions of problems and solutions. Also, its convenient in allowing for a relatively low message limit size for recurrent nets.

Inspiration

The size and breadth of this dataset inspires many interesting questions:

Can we predict company responses? Given the bounded set of subjects handled by each company, the answer seems like yes!

Do requests get stale? How quickly do the best companies respond, compared to the worst?

Can we learn high quality dense embeddings or representations of similarity for topical clustering?

How does tone affect the customer support conversation? Does saying sorry help?

Can we help companies identify new problems, or ones most affecting their customers?

Content

The dataset is a CSV, where each row is a tweet. The different columns are described below. Every conversation included has at least one request from a consumer and at least one response from a company. Which user IDs are company user IDs can be calculated using the inbound field.

tweet_id

A unique, anonymized ID for the Tweet. Referenced by response_tweet_id and in_response_to_tweet_id.

author_id

A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.

inbound

Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.

created_at

Date and time when the tweet was sent.

text

Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like _email_.

response_tweet_id

IDs of tweets that are responses to this tweet, comma-separated.

in_response_to_tweet_id

ID of the tweet this tweet is in response to, if any.

Contributing

Know of other brands the dataset should include? Found something that needs to be fixed? Start a discussion, or email me directly at $FIRSTNAME@$LASTNAME.com!

Acknowledgements

A huge thank you to my friends who helped bootstrap the list of companies that do customer support on Twitter! There are many rocks that would have been left un-turned were it not for your suggestions!

Relevant Resources

NLTK - casual_tokenize for social media text tokenizing, vader sentiment analysis for social media text

SciKit Learn - BoW Count Vectorizer, Multinomial Naive Bayes Classifier

Topic Modeling via Phrase detection with gensim

facebook research - fastText text classifier
f
Overview of summary statistics used.
plos.figshare.com
xls
Updated Aug 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Kilroy; Graham Healy; Simon Caton (2024). Overview of summary statistics used. [Dataset]. http://doi.org/10.1371/journal.pone.0307180.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0307180.t003
Dataset updated
Aug 26, 2024
Dataset provided by
PLOS ONE
Authors
David Kilroy; Graham Healy; Simon Caton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, computational approaches for extracting customer needs from user generated content have been proposed. However, there is a lack of studies that focus on extracting unmet needs for future popular products. Therefore, this study presents a supervised keyphrase classification model which predicts needs that will become popular in real products in the marketplace. To do this, we utilize Trending Customer Needs (TCN)—a monthly dataset of trending keyphrase customer needs occurring in new products during 2011-2021 across multiple categories of Consumer Packaged Goods e.g. toothpaste, eyeliner, beer, etc. We are the first study to use this specific dataset and employ it by training a time series algorithm to learn the relationship between features we generate for each candidate keyphrase on Reddit to the ones in the dataset 1-3 years in the future. We show that our approach outperforms a baseline in the literature and through Multi-Task Learning can accurately predict needs for a category it wasn’t trained on e.g. train on toothpaste, cereal, and beer products yet still predict for shampoo products. The findings from this research could provide many advantages to businesses such as gaining early access into markets.
h
RealWorldQuestioning
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sonal Prabhune, RealWorldQuestioning [Dataset]. https://huggingface.co/datasets/SonalPrabhune/RealWorldQuestioning
Explore at:
Authors
Sonal Prabhune
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
RealWorldQuestioning Benchmark

RealWorldQuestioning is a benchmark dataset of 400+ real-world user questions collected from public discussion forums (e.g., Reddit, Quora), designed to support evaluation of gender bias and information disparity in Large Language Models (LLMs). The dataset spans four business-relevant domains: Education, Jobs, Investment, and Health. Each question is annotated with:

User persona (Male or Female framing) Source forum Domain category

Four anonymized… See the full description on the dataset page: https://huggingface.co/datasets/SonalPrabhune/RealWorldQuestioning.

TweepFake - Twitter deep Fake text Dataset

kaggle.com

Updated Apr 29, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Maurizio Tesconi (2021). TweepFake - Twitter deep Fake text Dataset [Dataset]. https://www.kaggle.com/mtesconi/twitter-deep-fake-text/metadata

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 29, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Maurizio Tesconi

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Context

Social media have always been the perfect vehicle to manipulate and alter public opinion through bots, i.e. agents that behave as human users by liking, re-posting and publishing multimedia content which can be real or machine-generated. In the latter case, the spreading of deep-fakes - potentially deceptive images, video, audio or text autonomously generated by a deep neural network - in social media, have been sowing mistrust, hate and deceits at the expense of the people.

As far as it concerns deep-fake texts, the great improvement over their generation has been obtained by the language models (RNN, LSTM, GPT-2, GROVER, CLTR, OPTIMUS, GPT-3): several studies - such as Ippolito et al. (2020) and Adelani et al. (2019) - have shown that humans are capable of detecting those deep-fake texts as machine-generated with a detection rate around the chance value.

Even though examples of deep-fake messages can already be found in social media (as our dataset shows), there is still no episode of misuse on them; however, the language models' generative capability deeply worries: it is, therefore, necessary to quickly raise shields against this threat as well. Some deep-fake text detection techniques have already been investigated - Bakhtin et al. (2019), Grover (2019), GLTR (2019), Adelani et al. (2019), Ippolito et al. (2020) -, but there is still a lack of knowledge on how those state-of-the-art deepfake text detection techniques perform in a "**real-social-media-setting**", in which the text generation method is unknown and the text content is often short (especially on Twitter).

A dataset of deep-fake social media messages is required to start the research. Unfortunately, at the best of our knowledge, no one has ever created a properly labelled social media dataset containing only human and deep-fake messages (thus excluding cheap-fake texts that employ simple generative techniques as gap-filling and search-and-replace methods) that can already be found on social media timelines.

Focusing on Twitter, we have collected human and deepfake tweets to support the research on deepfake social media text detection in a "real-setting".

Content

To collect machine-generated tweets, the only known way is to heuristically search for Twitter accounts on the web (especially on Github and Twitter, of course) looking for keywords related either to automatic or AI text generation, deep-fake test/tweets, or to specific technologies such as "GPT-2", "RNN", "LSTM" and so on. After gathering some accounts, two filters were applied: first, only the bot accounts referring to autonomous text generation methods either in the Twitter description, in the profile URLs or the Github description were selected; then, the subset of accounts mimicking (often fine-tuned on) human Twitter profiles were chosen.

(ACCOUNTS) In the end, 23 bots and 17 human accounts were collected (some human accounts are associated with more than one bot account, e.g. Trump or Elon Musk's profiles). Three main text generation technologies were identified: GPT-2 (11 accounts, 3861 tweets), RNN (7 accounts, 4181 tweets), and Others (5 accounts, 4876 tweets). "Others" contains techniques that are non-better specified or found (Markov Chain, RNN + Markov Chain, LSTM, CharRNN) either in a couple or just one account. The following table specifies some articles were the Twitter accounts are taken from.

Generative Method	Reference Articles	Class_Type
GPT-2	https://minimaxir.com/2020/01/twitter-gpt2-bot/ - https://github.com/osirisguitar/botus-twitter - Max Woolf's Colab - https://www.reddit.com/r/MachineLearning/comments/df0q8f/p_i_finetuned_a_gpt2_language_model_to_generate/?ref=bestofml - https://www.reddit.com/r/OpenAI/comments/es0stw/weird_twitter_gpt2_bot/	GPT-2
RNN	https://qz.com/631497/mit-built-a-donald-trump-ai-twitter-bot-that-sounds-scarily-like-him/ - https://mc.ai/develop-and-publish-text-generating-twitter-bot/ - https://qz.com/631497/mit-built-a-donald-trump-ai-twitter-bot-that-sounds-scarily-like-him/	RNN
Torch RNN	https://github.com/DaveSmith227/deep-elon-tweet-generator	RNN
Markov Chain	https://hackernoon.com/create-a-twitter-politician-bot-with-markov-chains-node-js-and-stdlib-14df8cc1c68a	Others
CharRN	https://github.com/Jmete/deepThorin	Others

(TWEETS) Via the Twitter REST API, the timelines of both deep-fake accounts and their corresponding ...

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Data from: WikiReddit: Tracing Information and Attention Flows Between Online Platforms

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract