This table includes platform data for Facebook participants in the Deactivation experiment. Each row of the dataset corresponds to data from a participant’s Facebook user account. Each column contains a value, or set of values, that aggregates log data for this specific participant over a certain period of time.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The LiLaH-HAG dataset (HAG is short for hate-age-gender) consists of metadata on Facebook comments to Facebook posts of mainstream media in Great Britain, Flanders, Slovenia and Croatia. The metadata available in the dataset are the hatefulness of the comment (0 is acceptable, 1 is hateful), age of the commenter (0-25, 26-30, 36-65, 65-), gender of the commenter (M or F), and the language in which the comment was written (EN, NL, SL, HR).
The hatefulness of the comment was assigned by multiple well-trained annotators by reading comments in the order of appearance in a discussion thread, while the age and gender variables were estimated from the Facebook profile of a specific user by a single annotator.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
171 million names (100 million unique) This torrent contains: The URL of every searchable Facebook user s profile The name of every searchable Facebook user, both unique and by count (perfect for post-processing, datamining, etc) Processed lists, including first names with count, last names with count, potential usernames with count, etc The programs I used to generate everything So, there you have it: lots of awesome data from Facebook. Now, I just have to find one more problem with Facebook so I can write "Revenge of the Facebook Snatchers" and complete the trilogy. Any suggestions? >:-) Limitations So far, I have only indexed the searchable users, not their friends. Getting their friends will be significantly more data to process, and I don t have those capabilities right now. I d like to tackle that in the future, though, so if anybody has any bandwidth they d like to donate, all I need is an ssh account and Nmap installed. An additional limitation is that these are on
https://brightdata.com/licensehttps://brightdata.com/license
Gain valuable insights with our comprehensive Social Media Dataset, designed to help businesses, marketers, and analysts track trends, monitor engagement, and optimize strategies. This dataset provides structured and reliable social media data from multiple platforms.
Dataset Features
User Profiles: Access public social media profiles, including usernames, bios, follower counts, engagement metrics, and more. Ideal for audience analysis, influencer marketing, and competitive research. Posts & Content: Extract posts, captions, hashtags, media (images/videos), timestamps, and engagement metrics such as likes, shares, and comments. Useful for trend analysis, sentiment tracking, and content strategy optimization. Comments & Interactions: Analyze user interactions, including replies, mentions, and discussions. This data helps brands understand audience sentiment and engagement patterns. Hashtag & Trend Tracking: Monitor trending hashtags, topics, and viral content across platforms to stay ahead of industry trends and consumer interests.
Customizable Subsets for Specific Needs Our Social Media Dataset is fully customizable, allowing you to filter data based on platform, region, keywords, engagement levels, or specific user profiles. Whether you need a broad dataset for market research or a focused subset for brand monitoring, we tailor the dataset to your needs.
Popular Use Cases
Brand Monitoring & Reputation Management: Track brand mentions, customer feedback, and sentiment analysis to manage online reputation effectively. Influencer Marketing & Audience Analysis: Identify key influencers, analyze engagement metrics, and optimize influencer partnerships. Competitive Intelligence: Monitor competitor activity, content performance, and audience engagement to refine marketing strategies. Market Research & Consumer Insights: Analyze social media trends, customer preferences, and emerging topics to inform business decisions. AI & Predictive Analytics: Leverage structured social media data for AI-driven trend forecasting, sentiment analysis, and automated content recommendations.
Whether you're tracking brand sentiment, analyzing audience engagement, or monitoring industry trends, our Social Media Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This fileset contains a series of screenshots taken from our facebook advertising account. A few days ago we noticed that some negative "SEO" tactics, for lack of a better term, were having a negative impact on the performance of ads and fan engagement on the facebook page that we've been building.
I developed a custom software package, which utilizes nueural networks I've developed, to identify a target demographic, and suggest advertising content for said target demographic.
After a short training period we were able to create advertisemsents on facebook that averaged a cost of 0.01 cents per like. We also had a fan page engagement of nearly 4 times that of major brands like Wal-Mart.
Shortly after we began to obtain success we started noticing problems with our page. Since we have a stalker issue, we determined that the issues with our page were likely related to him.
We assued this because we had a disproportinately high number of spammy, negative, and inapporpriate comments on our posts. Offline harassment of our staff by the stalker also increased significantly during this time.
Curiously, we believe that the incident with the stalker allowed us to ascertain some interesting observations about Facebook's algorithims, which I've outlined below.
We believe, after reseraching this issue, that Facebook's algorithims suffer from the following issues:
They are easily gamed. We think that Facebook's algorithims are hypersensitive to negative comments being made on a post, and conversely likely positive ones as well. If a post is hidden, the comments are negative, or if a user interacts with the post negatively in some way, then Facebook's algorithims will "punish" your page.
We think that a series of scripted fake bot accounts would easily cause the issues that we've been expriencing.
As you can see from the data provided, over 90% of our likes come from paid facebook advertisement, therefore we do not have a significant number of fake accounts on our page brought in by third party advertising because we didn't do any of that.
Moreover, we did not send any of our fans obtained via mailing lists, or offline contact to our facebook page, those fans participate with us via email and/or through our private Google+ community.
So it is safe to say that our problems have not been caused by purchasing a large amount of fake likes from any third party vendor.
In addition, because our likes were gained very quickly, at a rate of about 2.5k likes a day, we do not believe that we have suffered from changes in the general demographic of our Facebook fan base over time.
Yet almost immediately after we started expericing trolling issues with our page, we also noticed a dip in the number of fans our posts were shown to by Facebook, and the performance of our ads began to go down, even though the content on our page had not changed.
We attributed this to holes in Facebook's algorithims, and potentially to the excessive use of fake bot accounts by Facebook itself.
We cannot prove the latter satement, but there have been similar reports before. Reference - http://www.forbes.com/sites/davidthier/2012/08/01/facebook-investigating-claims-that-80-of-ad-clicks-come-from-bots/
This article from Forbes outlines how one startup company repoted that up to 80% of their Facebook likes were fake bot accounts even though they paid for advertising directly through Facebook.
Our reserach suggests that Facebook's advertising platform functions as follows: - An advertiser pays for likes with Facebook, and the quality of the content on their page is initially assessed by those who are liking the page, but once the page obtains a following, we believe that the quality of the content is assessed by how many people like the posts on the page directly after they are posted.
If a post gets hidden, marked as spammed, skipped over, whatever, then we beleive that Facebook kicks that post out of the newsfeeds. If this happens to a significant number of posts on the page, then we believe that Facebook places the page on an advertising black-list.
Once on this black-list ads will begin to perform poorly, and content will drop out of newsfeeds causing even the most active page to go silent.
We tested this by posting pictures of attractive blond women, which with our demographic would have normally obtained a large number of likes and we struggled to get even 10 likes at over 20k page likes when we would have previosuly obtained almost 100 likes without boosting at only 5k page likes.
Why this probably isn't seen more often: In most cases this probably takes a while to occur as pages become old and fans grow bored, but in our case, because we have a stalker trolling our page with what appears to be hundres of scripted bot accounts, the effect was seen immediately.
Our data suggests that it became a tug of war between our stalker's army of fake bot accounts (making spammy comments, hiding our posts from newsfeeds, etc) and the real fans that actually like our page (who were voting our conent up - i.e. liking it, etc).
If you look at the graph of page likes in the figures provided - you can see that the darker purple are the fans we obtained via facebook advertising, well over 90%. We believe that the light purple (the "organic" fans) is mostly comprised of our stalker's fake drone accounts. We have less than 20 family members and friends liking our page, when we began this experiment we asked them not to interact with our page or the content.
In conclusion: We still have a lot more work to do, but it is highly likely that many Facebook likes are either scripted bots, and/or that Facebook's "weighting" algorithims are very suceptible to gaming via negative "SEO" tactics. Conversely, they are likely sensitive to gaming via positive "SEO" tactics as well.
Of course we cannot say for certain where the Facebook accounts that like a page come from without acess to their internal systems, but the evidence does strongly suggest that Facebook might be plagued with a large quantity of bot accounts, and that their algorithim has to be sensitive to actions from live users, so that the quality of the content can be easily ascertained. Otherwise it would be pretty easy for an advertiser to game Facebook's system by paying for, and getting, a large quantity of likes for content that is not appealing to any significant group of people.
Again we have to reiterate that we have no solid proof of this, but our data strongly suggests that this is the case.
We have reported the issues to Facebook, but interestingly, after we made it clear that we were going to analyze and investigate the issues with our page, we have been suddenly and incessently plagued with a never ending stream of "technical difficulties" related to our advertising account.
If you'd like to collaborate on this project, please feel free to email me at Jamie@ITSmoleculardesign.com.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Facebook is becoming an essential tool for more than just family and friends. Discover how Cheltenham Township (USA), a diverse community just outside of Philadelphia, deals with major issues such as the Bill Cosby trial, everyday traffic issues, sewer I/I problems and lost cats and dogs. And yes, theft.
Communities work when they're connected and exchanging information. What and who are the essential forces making a positive impact, and when and how do conversational threads get directed or misdirected?
Use Any Facebook Public Group
You can leverage the examples here for any public Facebook group. For an example of the source code used to collect this data, and a quick start docker image, take a look at the following project: facebook-group-scrape.
Data Sources
There are 4 csv files in the dataset, with data from the following 5 public Facebook groups:
post.csv
These are the main posts you will see on the page. It might help to take a quick look at the page. Commas in the msg field have been replaced with {COMMA}, and apostrophes have been replaced with {APOST}.
comment.csv
These are comments to the main post. Note, Facebook postings have comments, and comments on comments.
like.csv
These are likes and responses. The two keys in this file (pid,cid) will join to post and comment respectively.
member.csv
These are all the members in the group. Some members never, or rarely, post or comment. You may find multiple entries in this table for the same person. The name of the individual never changes, but they change their profile picture. Each profile picture change is captured in this table. Facebook gives users a new id in this table when they change their profile picture.
The metrics in this dataset measure users who viewed posts with links to civic news domains. The dataset contains domain-level metrics from Facebook activity data for adult U.S. monthly active users, aggregated over the study period. Includes content views, audience size, content attributes, user attributes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contains regional estimates of Facebook users based on data from the Facebook Marketing API. It includes information on the number of individuals aged 18 and older who have accessed Facebook in the past month, with data separated by region. These estimates are intended for trend identification and triangulation purposes and are not designed to match official census data or other government sources.
This data can be used as a proxy of internet access.
It should be noted that there could be duplicates across different regions, and the data is anonymized by Meta.
The metrics in this dataset measure users who engaged with posts with links to civic news URLs and the volume of their engagement. The dataset contains URL-level metrics from Facebook activity data for adult U.S. monthly active users, aggregated over the study period. Includes content views, audience size, content attributes, user attributes.
The metrics in this dataset measure users who potentially viewed posts with links to civic news URLs that were shared by one of their connections. The dataset contains URL-level metrics from Facebook activity data for adult U.S. monthly active users, aggregated over the study period. Includes potential audience size, content attributes, user attributes, political interest.
Losing access to your Facebook account can be a stressful experience, especially if it's your primary social media platform for connecting with friends, family, or business contacts facebook phone number +1 802 487 8095 . Whether your account was hacked, disabled, or you simply forgot your login credentials, there are multiple ways to contact Facebook and attempt to recover your account. This comprehensive guide will walk you through the process of recovering your Facebook account phone number +1 802 487 8095 , including how to use recovery tools, what to do if your account is hacked or disabled, and 1. Common Reasons for Losing Access to a Facebook Account Before initiating the recovery process, it’s important to identify why you lost access to your account. The reason affects how you approach Facebook: Forgotten password or email Lost access to the phone number +1 802 487 8095 or email linked to the account Hacked or compromised Account disabled by Facebook for violating terms Suspicious activity detected Fake identity report Name policy violations Each scenario has a different recovery method, and Facebook has dedicated tools and forms for each one. 2. First Steps Before Contacting Facebook
Before you attempt to reach Facebook’s support directly phone number +1 802 487 8095 , try these general steps: Use a known device and IP address – Access Facebook from a browser or app you’ve used before. Clear cache and cookies if logging in on a web browser. Check if your account is still visible by searching for your name from another Facebook profile. Try logging in with alternate emails or phone numbers associated with the account. If these don’t work, proceed with specific recovery steps based on your situation. 3. Recovering an Account Using the “Forgot Password” Feature The most common way to recover a Facebook account is by using the “Forgot Password” tool. Steps: Go to facebook.com/login Click on “Forgotten password?”
The number of Facebook users in Malaysia was forecast to continuously decrease between 2024 and 2028 by in total 2.2 million users (-9.36 percent). According to this forecast, in 2028, the Facebook user base will have decreased for the sixth consecutive year to 21.33 million users. User figures, shown here regarding the platform facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find further information concerning Indonesia and Singapore.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains list of facebook users who are member of integromat facebook group. Integromat (now make.com) is a popular automation SaaS that allows users to design their own automation flow consisting of multiple marketing tools. Competitors of integromat are Zapier, Integrately, etc You can use this list to find propsects who are most likely interested in SaaS products
Lead Generation
integromat,automation,rpa,SaaS,make.com
17200
$20.00
Context Collection of Facebook spam-legit profile and content-based data. It can be used for classification tasks.
Content The dataset can be used for building machine learning models. To collect the dataset, Facebook API and Facebook Graph API are used and the data is collected from public profiles. There are 500 legit profiles and 100 spam profiles. The list of features is as follows with Label (0-legit, 1-spam). 1. Number of friends 2. Number of followings 3. Number of Community 4. The age of the user account (in days) 5. Total number of posts shared 6. Total number of URLs shared 7. Total number of photos/videos shared 8. Fraction of the posts containing URLs 9. Fraction of the posts containing photos/videos 10. Average number of comments per post 11. Average number of likes per post 12. Average number of tags in a post (Rate of tagging) 13. Average number of hashtags present in a post
Inspiration Dataset helps the community to understand how features can help to differ Facebook legit users from spam users.
This dataset measures the ideological segregation index and favorability score of the potential, exposed and engaged audience of posts with links to domains and URLs classified as civic news. The dataset contains domain- and URL-level metrics from Facebook activity data for adult U.S. monthly active users, aggregated daily over the study period. Includes ideological segregation index, favorability score, content attributes, user attributes.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Insights include statistics on how many people posts reached, how many people engaged with each post and how many people talked about the post with their friends
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
The Controllable Multimodal Feedback Synthesis (CMFeed) Dataset is designed to enable the generation of sentiment-controlled feedback from multimodal inputs, including text and images. This dataset can be used to train feedback synthesis models in both uncontrolled and sentiment-controlled manners. Serving a crucial role in advancing research, the CMFeed dataset supports the development of human-like feedback synthesis, a novel task defined by the dataset's authors. Additionally, the corresponding feedback synthesis models and benchmark results are presented in the associated code and research publication.
Task Uniqueness: The task of controllable multimodal feedback synthesis is unique, distinct from LLMs and tasks like VisDial, and not addressed by multi-modal LLMs. LLMs often exhibit errors and hallucinations, as evidenced by their auto-regressive and black-box nature, which can obscure the influence of different modalities on the generated responses [Ref1; Ref2]. Our approach includes an interpretability mechanism, as detailed in the supplementary material of the corresponding research publication, demonstrating how metadata and multimodal features shape responses and learn sentiments. This controllability and interpretability aim to inspire new methodologies in related fields.
Data Collection and Annotation
Data was collected by crawling Facebook posts from major news outlets, adhering to ethical and legal standards. The comments were annotated using four sentiment analysis models: FLAIR, SentimentR, RoBERTa, and DistilBERT. Facebook was chosen for dataset construction because of the following factors:
• Facebook was chosen for data collection because it uniquely provides metadata such as news article link, post shares, post reaction, comment like, comment rank, comment reaction rank, and relevance scores, not available on other platforms.
• Facebook is the most used social media platform, with 3.07 billion monthly users, compared to 550 million Twitter and 500 million Reddit users. [Ref]
• Facebook is popular across all age groups (18-29, 30-49, 50-64, 65+), with at least 58% usage, compared to 6% for Twitter and 3% for Reddit. [Ref]. Trends are similar for gender, race, ethnicity, income, education, community, and political affiliation [Ref]
• The male-to-female user ratio on Facebook is 56.3% to 43.7%; on Twitter, it's 66.72% to 23.28%; Reddit does not report this data. [Ref]
Filtering Process: To ensure high-quality and reliable data, the dataset underwent two levels of filtering:
a) Model Agreement Filtering: Retained only comments where at least three out of the four models agreed on the sentiment.
b) Probability Range Safety Margin: Comments with a sentiment probability between 0.49 and 0.51, indicating low confidence in sentiment classification, were excluded.
After filtering, 4,512 samples were marked as XX. Though these samples have been released for the reader's understanding, they were not used in training the feedback synthesis model proposed in the corresponding research paper.
Dataset Description
• Total Samples: 61,734
• Total Samples Annotated: 57,222 after filtering.
• Total Posts: 3,646
• Average Likes per Post: 65.1
• Average Likes per Comment: 10.5
• Average Length of News Text: 655 words
• Average Number of Images per Post: 3.7
Components of the Dataset
The dataset comprises two main components:
• CMFeed.csv File: Contains metadata, comment, and reaction details related to each post.
• Images Folder: Contains folders with images corresponding to each post.
Data Format and Fields of the CSV File
The dataset is structured in CMFeed.csv file along with corresponding images in related folders. This CSV file includes the following fields:
• Id: Unique identifier
• Post: The heading of the news article.
• News_text: The text of the news article.
• News_link: URL link to the original news article.
• News_Images: A path to the folder containing images related to the post.
• Post_shares: Number of times the post has been shared.
• Post_reaction: A JSON object capturing reactions (like, love, etc.) to the post and their counts.
• Comment: Text of the user comment.
• Comment_like: Number of likes on the comment.
• Comment_reaction_rank: A JSON object detailing the type and count of reactions the comment received.
• Comment_link: URL link to the original comment on Facebook.
• Comment_rank: Rank of the comment based on engagement and relevance.
• Score: Sentiment score computed based on the consensus of sentiment analysis models.
• Agreement: Indicates the consensus level among the sentiment models, ranging from -4 (all negative) to 4 (all positive). 3 negative and 1 positive will result into -2 and 3 positives and 1 negative will result into +2.
• Sentiment_class: Categorizes the sentiment of the comment into 1 (positive) or 0 (negative).
More Considerations During Dataset Construction
We thoroughly considered issues such as the choice of social media platform for data collection, bias and generalizability of the data, selection of news handles/websites, ethical protocols, privacy and potential misuse before beginning data collection. While achieving completely unbiased and fair data is unattainable, we endeavored to minimize biases and ensure as much generalizability as possible. Building on these considerations, we made the following decisions about data sources and handling to ensure the integrity and utility of the dataset:
• Why not merge data from different social media platforms? We chose not to merge data from platforms such as Reddit and Twitter with Facebook due to the lack of comprehensive metadata, clear ethical guidelines, and control mechanisms—such as who can comment and whether users' anonymity is maintained—on these platforms other than Facebook. These factors are critical for our analysis. Our focus on Facebook alone was crucial to ensure consistency in data quality and format.
• Choice of four news handles: We selected four news handles—BBC News, Sky News, Fox News, and NY Daily News—to ensure diversity and comprehensive regional coverage. These news outlets were chosen for their distinct regional focuses and editorial perspectives: BBC News is known for its global coverage with a centrist view, Sky News offers geographically targeted and politically varied content learning center/right in the UK/EU/US, Fox News is recognized for its right-leaning content in the US, and NY Daily News provides left-leaning coverage in New York. Many other news handles such as NDTV, The Hindu, Xinhua, and SCMP are also large-scale but may contain information in regional languages such as Indian and Chinese, hence, they have not been selected. This selection ensures a broad spectrum of political discourse and audience engagement.
• Dataset Generalizability and Bias: With 3.07 billion of the total 5 billion social media users, the extensive user base of Facebook, reflective of broader social media engagement patterns, ensures that the insights gained are applicable across various platforms, reducing bias and strengthening the generalizability of our findings. Additionally, the geographic and political diversity of these news sources, ranging from local (NY Daily News) to international (BBC News), and spanning political spectra from left (NY Daily News) to right (Fox News), ensures a balanced representation of global and political viewpoints in our dataset. This approach not only mitigates regional and ideological biases but also enriches the dataset with a wide array of perspectives, further solidifying the robustness and applicability of our research.
• Dataset size and diversity: Facebook prohibits the automatic scraping of its users' personal data. In compliance with this policy, we manually scraped publicly available data. This labor-intensive process requiring around 800 hours of manual effort, limited our data volume but allowed for precise selection. We followed ethical protocols for scraping Facebook data , selecting 1000 posts from each of the four news handles to enhance diversity and reduce bias. Initially, 4000 posts were collected; after preprocessing (detailed in Section 3.1), 3646 posts remained. We then processed all associated comments, resulting in a total of 61734 comments. This manual method ensures adherence to Facebook’s policies and the integrity of our dataset.
Ethical considerations, data privacy and misuse prevention
The data collection adheres to Facebook’s ethical guidelines [<a href="https://developers.facebook.com/terms/"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises user feedback data collected from 15 globally acclaimed mobile applications, spanning diverse categories. The included applications are among the most downloaded worldwide, providing a rich and varied source for analysis. The dataset is particularly suitable for Natural Language Processing (NLP) applications, such as text classification and topic modeling. List of Included Applications:
TikTok Instagram Facebook WhatsApp Telegram Zoom Snapchat Facebook Messenger Capcut Spotify YouTube HBO Max Cash App Subway Surfers Roblox Data Columns and Descriptions: Data Columns and Descriptions:
review_id: Unique identifiers for each user feedback/application review. content: User-generated feedback/review in text format. score: Rating or star given by the user. TU_count: Number of likes/thumbs up (TU) received for the review. app_id: Unique identifier for each application. app_name: Name of the application. RC_ver: Version of the app when the review was created (RC). Terms of Use: This dataset is open access for scientific research and non-commercial purposes. Users are required to acknowledge the authors' work and, in the case of scientific publication, cite the most appropriate reference: M. H. Asnawi, A. A. Pravitasari, T. Herawan, and T. Hendrawati, "The Combination of Contextualized Topic Model and MPNet for User Feedback Topic Modeling," in IEEE Access, vol. 11, pp. 130272-130286, 2023, doi: 10.1109/ACCESS.2023.3332644.
Researchers and analysts are encouraged to explore this dataset for insights into user sentiments, preferences, and trends across these top mobile applications. If you have any questions or need further information, feel free to contact the dataset authors.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data describes the use of the social media platform Facebook (http://www.facebook.com) by five (5) Massachusetts police departments over a three (3) month period from May 1st through July 31st, 2014. The five (5) police departments represented the towns/cities of Billerica, Burlington, Peabody, Waltham, and Wellesley. In addition to portraying these local trends, they demonstrate a methodology for systematically measuring social media use by government agencies or other organizations. This data was taken directly from Facebook using API’s provided by Facebook. The data includes all “wall posts” made by the representative police departments during this time period and includes data variables such as the text of the posting, the number of “likes” and “shares” (likes/shares represent features available on the Facebook social media platform), information about who performed the “like” or “share”, and comments others made in response to the “wall post”. There are 5 data files, one for each town represented. The number of variables vary per town depending on the post with the maximum number of certain features found in the row (for example, the top number of comments for one police department could be 20 while another could be 30 – the latter dataset would contain 10 more columns per row to account for the maximum possible). The data collected included the time from May 1st, 2014 through July 31st, 2014.
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }
@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
sources.csv
articles.csv
article_media.csv
article_authors.csv
discussion_posts.csv
discussion_post_authors.csv
fact_checking_articles.csv
fact_checking_article_media.csv
claims.csv
feedback_facebook.csv
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
category of annotation (annotation_category
). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
type of annotation (annotation_type_id
). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
method which created annotation (method_id
). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
its value (value
). The value is stored in JSON format and its structure differs according to particular annotation type.
At the same time, annotations are associated with a particular object identified by:
entity type (parameter entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
entity id (parameter entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation annotations).
The dataset provides specifically these entity annotations:
Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.
Article veracity. Aggregated information about veracity from article-claim pairs.
The dataset provides specifically these relation annotations:
Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.
Claim presence. Determines presence of claim in article.
Claim stance. Determines stance of an article to a claim.
Annotations are contained in these CSV files (and corresponding REST API endpoints):
entity_annotations.csv
relation_annotations.csv
Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
This table includes platform data for Facebook participants in the Deactivation experiment. Each row of the dataset corresponds to data from a participant’s Facebook user account. Each column contains a value, or set of values, that aggregates log data for this specific participant over a certain period of time.