Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This comprehensive dataset offers a deep dive into the social media engagement metrics of nearly 4,000 posts from four of the world's leading news channels: CNN, BBC, Al Jazeera, and Reuters. Curated to provide a holistic view of global news interaction on social media, the collection stands out for its meticulous assembly and broad spectrum of content.
Dataset Overview: Spanning various global events, topics, and narratives, this dataset is a snapshot of how news is consumed and interacted with on social media platforms. It serves as a rich resource for analyzing trends, engagement patterns, and the dissemination of information across international borders.
Data Science Applications: Ideal for researchers and enthusiasts in the fields of data science, media studies, and social analytics, this dataset opens doors to numerous explorations such as engagement analysis, trend forecasting, content strategy optimization, and the study of information flow in digital spaces. It also holds potential for machine learning projects aiming to predict engagement or classify content based on interaction metrics.
Column Descriptors:
Each record in the dataset is detailed with the following columns:
- text: The title or main content of the post.
- likes: The number of likes each post has garnered.
- comments: The number of comments left by viewers.
- shares: How many times the post has been shared.
Ethically Mined Data: The collection of this dataset was conducted with the highest ethical standards in mind, ensuring compliance with data privacy laws and platform policies. By anonymizing data where necessary and focusing solely on publicly available information, it respects both individual privacy and intellectual property rights.
Special thanks are extended to the Facebook platform and the respective news channels for their openness and the rich public data they provide. This dataset not only celebrates the vibrant exchange on social media but also underscores the importance of responsible data use and sharing in fostering understanding and innovation.
Facebook
TwitterDuring a 2024 survey, 77 percent of respondents from Nigeria stated that they used social media as a source of news. In comparison, just 23 percent of Japanese respondents said the same. Large portions of social media users around the world admit that they do not trust social platforms either as media sources or as a way to get news, and yet they continue to access such networks on a daily basis.
Social media: trust and consumption
Despite the majority of adults surveyed in each country reporting that they used social networks to keep up to date with news and current affairs, a 2018 study showed that social media is the least trusted news source in the world. Less than 35 percent of adults in Europe considered social networks to be trustworthy in this respect, yet more than 50 percent of adults in Portugal, Poland, Romania, Hungary, Bulgaria, Slovakia and Croatia said that they got their news on social media.
What is clear is that we live in an era where social media is such an enormous part of daily life that consumers will still use it in spite of their doubts or reservations. Concerns about fake news and propaganda on social media have not stopped billions of users accessing their favorite networks on a daily basis.
Most Millennials in the United States use social media for news every day, and younger consumers in European countries are much more likely to use social networks for national political news than their older peers.
Like it or not, reading news on social is fast becoming the norm for younger generations, and this form of news consumption will likely increase further regardless of whether consumers fully trust their chosen network or not.
Facebook
TwitterDo you ever feel like you're being inundated with news from all sides, and you can't keep up? Well, you're not alone. In today's age of social media and 24-hour news cycles, it can be difficult to know what's going on in the world. And with so many different news sources to choose from, it can be hard to know who to trust.
That's where this dataset comes in. It captures data related to individuals' Sentiment Analysis toward different news sources. The data was collected by administering a survey to individuals who use different news sources. The survey responses were then analyzed to obtain the sentiment score for each news source.
So if you're feeling overwhelmed by the news, don't worry – this dataset has you covered. With its insights on which news sources are trustworthy and which ones aren't, you'll be able to make informed decisions about what to read – and what to skip
The Twitter Sentiment Analysis dataset can be used to analyze the impact of social media on news consumption. This data can be used to study how individuals' sentiments towards different news sources vary based on the source they use. The dataset can also be used to study how different factors, such as the time of day or the topic of the news, affect an individual's sentiments
File: news.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |
File: news_api.csv | Column name | Description | |:--------------|:------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Source | The news source the article is from. (String) |
File: politics.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |
File: sports.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |
File: television.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |
File: trending.csv | Column name | Description ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Social Media has been taking up everything on the Internet. People getting the latest news, useful resources, life partner and what not. In a world where Social media plays a big role in giving news, we must also know that news which affects our sentiments are going to get spread like a wildfire. Based on the Headline and the title, and according to the date given and the Social media platforms, you have to predict how it has affected the human sentiment scores. You have to predict the column “SentimentTitle” and “SentimentHeadline”.
This is a subset of the dataset of the same name available in the UCI Machine Learning Repository The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine.
The attributes for each of the dataset are : - IDLink (numeric): Unique identifier of news items - Title (string): Title of the news item according to the official media sources - Headline (string): Headline of the news item according to the official media sources - Source (string): Original news outlet that published the news item - Topic (string): Query topic used to obtain the items in the official media sources - Publish-Date (timestamp): Date and time of the news items' publication - Facebook (numeric): Final value of the news items' popularity according to the social media source Facebook - Google-Plus (numeric): Final value of the news items' popularity according to the social media source Google+ - LinkedIn (numeric): Final value of the news items' popularity according to the social media source LinkedIn - SentimentTitle: Sentiment score of the title, Higher the score, better is the impact or +ve sentiment and vice-versa. (Target Variable 1) - SentimentHeadline: Sentiment score of the text in the news items' headline. Higher the score, better is the impact or +ve sentiment. (Target Variable 2)
Facebook
TwitterHow many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
Facebook
TwitterThis dataset contains simulated data for social media users' demographics, behaviors, and perceptions related to political content. It includes features such as age, gender, education level, occupation, social media usage frequency, exposure to political content, and perceptions of accuracy and relevance.
the features included in the "Social Media Political Content Analysis Dataset":
Facebook
TwitterFacebook received 73,390 user data requests from federal agencies and courts in the United States during the second half of 2023. The social network produced some user data in 88.84 percent of requests from U.S. federal authorities. The United States accounts for the largest share of Facebook user data requests worldwide.
Facebook
TwitterHow much time do people spend on social media?
As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in
the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively.
People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general.
During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
Facebook
TwitterContext
Social media is a vast pool of content, and among all the content available for users to access, news is an element that is accessed most frequently. These news can be posted by politicians, news channels, newspaper websites, or even common civilians. These posts have to be checked for their authenticity, since spreading misinformation has been a real concern in today’s times, and many firms are taking steps to make the common people aware of the consequences of spread misinformation. The measure of authenticity of the news posted online cannot be definitively measured, since the manual classification of news is tedious and time-consuming, and is also subject to bias.
Content
Data preprocessing has been done on the dataset Getting Real about Fake News and skew has been eliminated.
Facebook
TwitterData Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .
Citation
Please cite our work as
@article{shahi2021overview,
title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
journal={Working Notes of CLEF},
year={2021}
}
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.
Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Task 3a
Task 3b
Output data format
Task 3a
Sample File
public_id, predicted_rating
1, false
2, true
Task 3b
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Submission Link: https://competitions.codalab.org/competitions/31238
Related Work
Facebook
TwitterGovernments may have the capacity to flood social media with fake news, but little is known about the use of flooding by ordinary voters. In this work, we identify 2107 registered US voters that account for 80% of the fake news shared on Twitter during the 2020 US presidential election by an entire panel of 664,391 voters. We find that supersharers are important members of the network, reaching a sizable 5.2% of registered voters on the platform. Supersharers have a significant overrepresentation of women, older adults, and registered Republicans. Supersharers' massive volume does not seem automated but is rather generated through manual and persistent retweeting. These findings highlight a vulnerability of social media for democracy, where a small group of people distort the political reality for many., This dataset contains aggregated information necessary to replicate the results reported in our work on Supersharers of Fake News on Twitter while respecting and preserving the privacy expectations of individuals included in the analysis. No individual-level data is provided as part of this dataset. The data collection process that enabled the creation of this dataset leveraged a large-scale panel of registered U.S. voters matched to Twitter accounts. We examined the activity of 664,391 panel members who were active on Twitter during the months of the 2020 U.S. presidential election (August to November 2020, inclusive), and identified a subset of 2,107 supersharers, which are the most prolific sharers of fake news in the panel that together account for 80% of fake news content shared on the platform. We rely on a source-level definition of fake news, that uses the manually-labeled list of fake news sites by Grinberg et al. 2019 and an updated list based on NewsGuard ratings (commercial..., , # Supersharers of Fake News on Twitter
This repository contains data and code for replication of the results presented in the paper.
The folders are mostly organized by research questions as detailed below. Each folder contains the code and publicly available data necessary for the replication of results. Importantly, no individual-level data is provided as part of this repository. De-identified individual-level data can be attained for IRB-approved uses under the terms and conditions specified in the paper. Once access is granted, the restricted-access data is expected to be located under ./restricted_data.
The folders in this repository are the following:
Code under the preprocessing folder contains the following:
Facebook
TwitterBy downloading the data, you agree with the terms & conditions mentioned below:
Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.
Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.
We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.
Citation
Please cite our work as
@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.
Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Cross-Lingual Task (German)
Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
ID- Unique identifier of the news article
Title- Title of the news article
text- Text mentioned inside the news article
our rating - class of the news article as false, partially false, true, other
Output data format
public_id- Unique identifier of the news article
predicted_rating- predicted class
Sample File
public_id, predicted_rating 1, false 2, true
IMPORTANT!
We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.
Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498
Related Work
Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.
Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.
Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Several media studies have investigated the news consumption of young people and discussed where they get information and what motivates them to consume news. Little is known about the structural factors that influence young people’s news consumption behavior. The aim of this paper is to fill this research gap by focusing on structural factors that play a major role in young people’s news consumption. In a mixed-methods study, we investigated Swiss youth media behavior in news consumption from 2019 to 2020 in Switzerland. The results show that news consumption of people aged 12–20 is determined by three structural factors at home and outside: 1. access to media and internet; 2. regulation by parents and teachers, and 3. routines at home or school. These three factors shape the individual media environment and are related to young people’s news consumption behavior. Changes in news consumption behavior were evident in school transitions where young people not only change teachers and get a new peer group but are often involved in a change of location. These changes can be normative transitions which have an influence on the structural factors of the individual media environment and thus influence the news consumption behavior of young people. Young Swiss people consume news via their smartphones, which are offered to them through news portals, various apps, or via social media feeds, on which they usually come across news by chance and consume it casually in their free time. Structural factors of media environments (i.e., access, regulation, and news consumption routines) play a major role in young people’s news consumption. These structural factors can be influenced by parents, teachers, and peers. For schools in particular, the paradigm that emerges from these findings is to reduce barriers to accessing news content and to rethink certain regulations, and to make recommendations and establish routines that encourage young people to consume news.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract:
Analyzing the spread of information related to a specific event in the news has many potential applications. Consequently, various systems have been developed to facilitate the analysis of information spreadings such as detection of disease propagation and identification of the spreading of fake news through social media. There are several open challenges in the process of discerning information propagation, among them the lack of resources for training and evaluation. This paper describes the process of compiling a corpus from the EventRegistry global media monitoring system. We focus on information spreading in three domains: sports (i.e. the FIFA WorldCup), natural disasters (i.e. earthquakes), and climate change (i.e.global warming). This corpus is a valuable addition to the currently available datasets to examine the spreading of information about various kinds of events.Introduction:Domain-specific gaps in information spreading are ubiquitous and may exist due to economic conditions, political factors, or linguistic, geographical, time-zone, cultural, and other barriers. These factors potentially contribute to obstructing the flow of local as well as international news. We believe that there is a lack of research studies that examine, identify, and uncover the reasons for barriers in information spreading. Additionally, there is limited availability of datasets containing news text and metadata including time, place, source, and other relevant information. When a piece of information starts spreading, it implicitly raises questions such as asHow far does the information in the form of news reach out to the public?Does the content of news remain the same or changes to a certain extent?Do the cultural values impact the information especially when the same news will get translated in other languages?Statistics about datasets:
Statistics about datasets:
--------------------------------------------------------------------------------------------------------------------------------------
# Domain Event Type Articles Per Language Total Articles
1 Sports FIFA World Cup 983-en, 762-sp, 711-de, 10-sl, 216-pt 2679
2 Natural Disaster Earthquake 941-en, 999-sp, 937-de, 19-sl, 251-pt 3194
3 Climate Changes Global Warming 996-en, 298-sp, 545-de, 8-sl, 97-pt 1945
--------------------------------------------------------------------------------------------------------------------------------------
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.
How do we figure out what is true and what is fake? Can we do something about it?
The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!
The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.
This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1EMHTKhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1EMHTK
We demonstrate that the news media causes Americans to take public stands on issues, join national policy conversations, and express themselves publicly more often than they would otherwise --- all key components of democratic politics. We recruited 48 mostly small media outlets that allowed us to choose groups of outlets to write and publish articles, on subjects we approved, and dates we randomly assigned. We estimate the causal effect on proximal measures, such as website pageviews and Twitter discussion of the articles' specific subjects, and distal ones, such as national Twitter conversation in broad policy areas. Our intervention increased discussion in each broad policy area by $\approx$ 62.7% (relative to a day's volume), accounting for 13,166 additional posts, with similar effects across population subgroups.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The cross-lingual natural disaster dataset includes public tweets collected using Twitter’s public API, filtering by location-related keywords and date, without using any additional filtering (e.g., we did not restrict the query to specific languages). We considered two disaster events and two long-term natural disasters across Europe (floods and wildfires) that received substantial news coverage internationally.
Three of the top languages were common to all the studied events: English (ISO 639-1 code: en), Spanish (es), and French (fr). Additionally, we found hundreds of messages for each event in other five languages, including Arabic (ar), German (de), Japanese (ja), Indonesian (id), Italian (it) and Portuguese (pt).
After collecting the data, we labelled tweets that contained potentially informative factual information. We name this group of tweets “informative messages.” Next, we used crowdsourcing to further categorize the messages into various informational categories. We asked three different workers to label each informative messages across languages. The target categories were based on an ontology from TREC-IS 2018, where we grouped some low level ontology categories into higher-level ones.
Facebook
TwitterThe internet and social media have led to a major problem—fake news. Fake news is false information presented as real news, often with the goal of tricking or influencing people. It's difficult to identify fake news because it can look very similar to real news. The Fake News detection dataset deals with the problem indirectly by using tabular summary statistics about each news article to attempt to predict whether the article is real or fake. This dataset is in a tabular format and contains features such as word count, sentence length, unique words, average word length, and a label indicating whether the article is fake or real.
Facebook
TwitterThe global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Many people consume news on social media, yet the production of news items online has come under crossfire due to the common spreading of misinformation. Social media platforms police their content in various ways. Primarily they rely on crowdsourced “flags”: users signal to the platform that a specific news item might be misleading and, if they raise enough of them, the item will be fact-checked. However, real-world data show that the most flagged news sources are also the most popular and – supposedly – reliable ones. In this paper, we show this phenomenon can be explained by the unreasonable assumptions current content policing strategies make about how the online social media environment is shaped. The most realistic assumption is that confirmation bias will prevent a user from flagging a news item if they share the same political bias as the news source producing it. We show, via agent-based simulations, that a model reproducing our current understanding of the social media environment will necessarily result in the most neutral and accurate sources receiving most flags.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This comprehensive dataset offers a deep dive into the social media engagement metrics of nearly 4,000 posts from four of the world's leading news channels: CNN, BBC, Al Jazeera, and Reuters. Curated to provide a holistic view of global news interaction on social media, the collection stands out for its meticulous assembly and broad spectrum of content.
Dataset Overview: Spanning various global events, topics, and narratives, this dataset is a snapshot of how news is consumed and interacted with on social media platforms. It serves as a rich resource for analyzing trends, engagement patterns, and the dissemination of information across international borders.
Data Science Applications: Ideal for researchers and enthusiasts in the fields of data science, media studies, and social analytics, this dataset opens doors to numerous explorations such as engagement analysis, trend forecasting, content strategy optimization, and the study of information flow in digital spaces. It also holds potential for machine learning projects aiming to predict engagement or classify content based on interaction metrics.
Column Descriptors:
Each record in the dataset is detailed with the following columns:
- text: The title or main content of the post.
- likes: The number of likes each post has garnered.
- comments: The number of comments left by viewers.
- shares: How many times the post has been shared.
Ethically Mined Data: The collection of this dataset was conducted with the highest ethical standards in mind, ensuring compliance with data privacy laws and platform policies. By anonymizing data where necessary and focusing solely on publicly available information, it respects both individual privacy and intellectual property rights.
Special thanks are extended to the Facebook platform and the respective news channels for their openness and the rich public data they provide. This dataset not only celebrates the vibrant exchange on social media but also underscores the importance of responsible data use and sharing in fostering understanding and innovation.