100+ datasets found

Social media as a news outlet worldwide 2025
statista.com
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Social media as a news outlet worldwide 2025 [Dataset]. https://www.statista.com/statistics/718019/social-media-news-source/
Explore at:
Dataset updated
Nov 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2025 - Feb 2025
Area covered
Worldwide
Description
During a 2025 survey, ** percent of respondents from Nigeria stated that they used social media as a source of news. In comparison, just ** percent of Japanese respondents said the same. Large portions of social media users around the world admit that they do not trust social platforms either as media sources or as a way to get news, and yet they continue to access such networks on a daily basis. Social media: trust and consumption Despite the majority of adults surveyed in each country reporting that they used social networks to keep up to date with news and current affairs, a 2018 study showed that social media is the least trusted news source in the world. Less than ** percent of adults in Europe considered social networks to be trustworthy in this respect, yet more than ** percent of adults in Portugal, Poland, Romania, Hungary, Bulgaria, Slovakia and Croatia said that they got their news on social media. What is clear is that we live in an era where social media is such an enormous part of daily life that consumers will still use it in spite of their doubts or reservations. Concerns about fake news and propaganda on social media have not stopped billions of users accessing their favorite networks on a daily basis. Most Millennials in the United States use social media for news every day, and younger consumers in European countries are much more likely to use social networks for national political news than their older peers. Like it or not, reading news on social is fast becoming the norm for younger generations, and this form of news consumption will likely increase further regardless of whether consumers fully trust their chosen network or not.
News Datasets
brightdata.com
.json, .csv, .xlsx
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data, News Datasets [Dataset]. https://brightdata.com/products/datasets/news
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.

Dataset Features

News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.

Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.

Popular Use Cases

Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.

Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
Most popular news platforms in the U.S. 2022, by age group
statista.com
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most popular news platforms in the U.S. 2022, by age group [Dataset]. https://www.statista.com/statistics/717651/most-popular-news-platforms/
Explore at:
Dataset updated
Nov 27, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Aug 11, 2022 - Aug 17, 2022
Area covered
United States
Description
Social media was by far the most popular news platform among 18 to 34-year-olds in the United States, with 47 percent of respondents to a survey held in August 2022 saying that they used social networks for news on a daily basis. By comparison, adults over 65 years old mostly used network news to keep up to date.

The decline of newspapers In the past, the reasons to regularly go out and purchase a print newspaper were many. Used not only for news but also apartment hunting, entertainment, and job searches (among other things), newspapers once served multiple purposes. This is no longer the case, with first television and then the internet taking care of consumer needs once covered by printed papers. Indeed, the paid circulation of daily weekday newspapers in the United States has fallen dramatically since the 1980s with no sign of future improvement.

News consumption habits

A survey on news consumption by gender found that 50 percent of women use either online-only news sites or social media for news each day, and 51 percent of male respondents said the same. Social media was by far the most used daily news platform among U.S. Millennials, and the same was true of Gen Z. One appeal of online news is that it often comes at no cost to the consumer. Paying for news found via digital outlets is not yet commonplace in the United States, with only 21 percent of U.S. consumers responding to a study held in early 2021 reporting having paid for online news content in the last year.
UK news headlines
kaggle.com
zip
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeXmaSa (2023). UK news headlines [Dataset]. https://www.kaggle.com/datasets/dexmasa/uk-news-headlines
Explore at:
zip(1318144 bytes)Available download formats
Dataset updated
Jul 5, 2023
Authors
DeXmaSa
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
United Kingdom
Description
This dataset contains the headlines generated by the top 15 UK news websites over a time span of roughly 20 days. The headlines were scraped from the sites' respective RSS feeds.

Time frame: 2023-02-13 to 2023-03-05

Headlines were scraped in 12 hour intervals

The dataset consists of two files:

The scraped data consisting of the headlines

Auxiliary data containing additional information for each of the news outlets

Identification of the top 15 news websites in the UK: statista.com

Data dictionary for scraped data:

website: BBC, Sun, Mirror, Daily Mail, Independent, Telegraph, Guardian, Manchester Evening News, Sky News, Metro, Daily Express, Times, Liverpool Echo, Birmingham Live, Evening Standard.

timestamp scraped: Date and time when a particular headline was scraped.

headline: Headline of news article.

Data dictionary for compiled auxiliary data:

website: BBC, Sun, Mirror, Daily Mail, Independent, Telegraph, Guardian, Manchester Evening News, Sky News, Metro, Daily Express, Times, Liverpool Echo, Birmingham Live, Evening Standard.

RSS URL: URL to RSS feed for each of the above websites.

visitors unique monthly: In millions; taken from statista.com.

ownership: Entity owning a particular news outlet and thus the associated website. Source: General internet search.

political bias: left-center, center, right-center, right. Source: General internet search.

party support GE 2019: None, Conservative, Labour, Unknown. Source: General internet search. GE = general election.

journalism style: quality, tabloid. Source: General internet search.

Image credit: https://unsplash.com/@siora18
Leading social networks used for news in the U.S. 2019-2025
statista.com
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading social networks used for news in the U.S. 2019-2025 [Dataset]. https://www.statista.com/statistics/444708/social-networks-used-for-news-usa/
Explore at:
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description
In 2025, Facebook remained the most-used social platform for news in the United States, with ** percent of respondents reporting they accessed news on it. YouTube followed closely at ** percent, recording a slight increase from the previous year. X (formerly Twitter) saw the most notable growth, rising by ***** percent to ** percent.
ARABIC NEWS DATASET - RESULTS FROM WEB SCRAPING
kaggle.com
zip
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elaaatif (2024). ARABIC NEWS DATASET - RESULTS FROM WEB SCRAPING [Dataset]. https://www.kaggle.com/datasets/latif8/arabic-news-dataset-results-from-web-scraping
Explore at:
zip(10472746 bytes)Available download formats
Dataset updated
Apr 15, 2024
Authors
Elaaatif
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset obtained from web scraping encompasses a diverse set of news articles from prominent sources: Al Jazeera, BBC News Arabic, Fatabyyano, Verify-Sy and matsda2sh. Each article provides unique insights into various topics, ranging from global politics and current affairs to health, culture, and technology. The dataset offers a comprehensive snapshot of contemporary news coverage, allowing for in-depth analysis and exploration of different perspectives. With detailed information on article titles, categories, publication dates, and content, researchers and analysts can gain valuable insights into arabic media trends, public discourse, and societal issues.
h
news-bias-full-data
huggingface.co
Updated Oct 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
News Media Biases (2023). news-bias-full-data [Dataset]. https://huggingface.co/datasets/newsmediabias/news-bias-full-data
Explore at:
Dataset updated
Oct 25, 2023
Dataset authored and provided by
News Media Biases
Description
**Please access the latest verison of data that is here https://huggingface.co/datasets/shainar/BEAD **

email at shaina.raza@torontomu.ca for usage of data

Please cite us if you use it

@article{raza2024beads, title={BEADs: Bias Evaluation Across Domains}, author={Raza, Shaina and Rahman, Mizanur and Zhang, Michael R}, journal={arXiv preprint arXiv:2406.04220}, year={2024} }

license: cc-by-nc-4.0

language: - en pretty_name: Navigating News… See the full description on the dataset page: https://huggingface.co/datasets/newsmediabias/news-bias-full-data.
How news consumption affects adults in the U.S. 2025
statista.com
Updated Sep 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy Watson (2025). How news consumption affects adults in the U.S. 2025 [Dataset]. https://www.statista.com/topics/3251/fake-news/
Explore at:
Dataset updated
Sep 12, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Amy Watson
Area covered
United States
Description
In May 2025, a survey asked U.S. adults how they feel while consuming news. The results indicate that a majority feel informed, with 53 percent saying that news generally makes them feel this way. At the same time, 43 percent reported feeling angry, and 32 percent said they feel depressed when consuming news. In contrast, only 16 percent described feeling hopeful. These findings highlight that while staying informed is a major benefit of news consumption, negative emotional reactions—such as anger and depression—are also very common among Americans.
c
BBC News Dataset – February 2023 Edition
crawlfeeds.com
csv, zip
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). BBC News Dataset – February 2023 Edition [Dataset]. https://crawlfeeds.com/datasets/bbc-news-dataset-feb-2023
Explore at:
zip, csvAvailable download formats
Dataset updated
Jun 14, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Get access to a comprehensive and structured dataset of BBC News articles, freshly crawled and compiled in February 2023. This collection includes 1 million records from one of the world’s most trusted news organizations — perfect for training NLP models, sentiment analysis, and trend detection across global topics.

💾 Format: CSV (available in ZIP archive)

📢 Status: Published and available for immediate access

Use Cases

Train language models to summarize or categorize news

Detect media bias and compare narrative framing

Conduct research in journalism, politics, and public sentiment

Enrich news aggregation platforms with clean metadata

Analyze content distribution across categories (e.g. health, politics, tech)

This dataset ensures reliable and high-quality information sourced from a globally respected outlet. The format is optimized for quick ingestion into your pipelines — with clean text, timestamps, image links, and more.

Need a filtered dataset or want this refreshed for a later date? We offer on-demand news scraping as well.

👉 Request access or sample now
Fake News Detection Dataset
kaggle.com
zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Mashayekhi (2025). Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
Explore at:
zip(11735585 bytes)Available download formats
Dataset updated
Apr 27, 2025
Authors
Mahdi Mashayekhi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📚 Fake News Detection Dataset

Overview

This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.

The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.

Columns Description

title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.

Why Use This Dataset?

Fake News Detection Practice: Perfect for binary classification tasks.

NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.

Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.

Feature Engineering: Encourages creating new features from text and metadata.

Balanced Labels: Realistic distribution of real and fake news for fair model training.

Potential Use Cases

Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).

Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.

Performing exploratory data analysis (EDA) on news data.

Developing pipelines for dealing with missing values and feature extraction.

A Note on Data Quality

This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.

File Info

Filename: fake_news_dataset.csv

Size: 20,000 rows × 7 columns

Missing Data: ~5% missing values in the source and author columns.
News Events Data in Asia ( Techsalerator)
datarade.ai
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Techsalerator (2024). News Events Data in Asia ( Techsalerator) [Dataset]. https://datarade.ai/data-products/news-events-data-in-asia-techsalerator-techsalerator
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Techsalerator LLC
Authors
Techsalerator
Area covered
United Arab Emirates, Timor-Leste, Kyrgyzstan, Brunei Darussalam, Kazakhstan, Uzbekistan, Iran (Islamic Republic of), Maldives, China, Hong Kong
Description
Techsalerator’s News Event Data in Asia offers a detailed and expansive dataset designed to provide businesses, analysts, journalists, and researchers with comprehensive insights into significant news events across the Asian continent. This dataset captures and categorizes major events reported from a diverse range of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable perspectives on regional developments, economic shifts, political changes, and cultural occurrences.

Key Features of the Dataset: Extensive Coverage:

The dataset aggregates news events from a wide range of sources such as company press releases, industry-specific news outlets, blogs, PR sites, and traditional media. This broad coverage ensures a diverse array of information from multiple reporting channels. Categorization of Events:

News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly find and analyze information relevant to their interests or sectors. Real-Time Updates:

The dataset is updated regularly to include the most current events, ensuring users have access to the latest news and can stay informed about recent developments as they happen. Geographic Segmentation:

Events are tagged with their respective countries and regions within Asia. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:

Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps users understand the context and significance of each event. Historical Data:

The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into the evolution of news events. Advanced Search and Filter Options:

Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Asian Countries and Territories Covered: Central Asia: Kazakhstan Kyrgyzstan Tajikistan Turkmenistan Uzbekistan East Asia: China Hong Kong (Special Administrative Region of China) Japan Mongolia North Korea South Korea Taiwan South Asia: Afghanistan Bangladesh Bhutan India Maldives Nepal Pakistan Sri Lanka Southeast Asia: Brunei Cambodia East Timor (Timor-Leste) Indonesia Laos Malaysia Myanmar (Burma) Philippines Singapore Thailand Vietnam Western Asia (Middle East): Armenia Azerbaijan Bahrain Cyprus Georgia Iraq Israel Jordan Kuwait Lebanon Oman Palestine Qatar Saudi Arabia Syria Turkey (partly in Europe, but often included in Asia contextually) United Arab Emirates Yemen Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and identify emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Asia, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Asian news and events. Techsalerator’s News Event Data in Asia is a crucial resource for accessing and analyzing significant news events across the continent. By offering detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
Z
CT-FAN: A Multilingual dataset for Fake News Detection
data.niaid.nih.gov
zenodo.org
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516
Explore at:
Dataset updated
Oct 23, 2022
Dataset provided by
University of Klagenfurt
University of Duisburg-Essen
University of Hildesheim
Darmstadt University of Applied Sciences
University of Applied Sciences Potsdam
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel
Description
By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
Top Youtube News Media Statistics
kaggle.com
zip
Updated Jul 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
crxxom (2023). Top Youtube News Media Statistics [Dataset]. https://www.kaggle.com/datasets/crxxom/top-youtube-news-media-statistics/code
Explore at:
zip(901734 bytes)Available download formats
Dataset updated
Jul 14, 2023
Authors
crxxom
Area covered
YouTube
Description
The dataset contains detailed information on some of the most popular English media channels on Youtube. From channel overview to statistics of the top 50 videos of each channel, here is a description of all the columns of the two datasets.

Mainstream Media Statistics

channelName: name of the channel on Youtube

id: The channel ID in Youtube

subscribers: subscriber count (up till 14/7/2023)

total views: total views of all the videos of the channel (up till 14/7/2023)

total videos: total number of videos of the channel (up till 14/7/2023)

created date: The date where the channel is created

description: description of the channel in their description page

playlistId: The id of the channel's video list

Top50_viewed_video_from_each_channels

Video Id: The ID of the video on Youtube

Channel Title: The channel name of the video

Title: Title of the video

publishedAt: When the video is published

categoryId: The category ID of Youtube (You may reference at https://mixedanalytics.com/blog/list-of-youtube-video-category-ids/)

description: The description of the video

viewCount: The total number of views of that video (up till 14/7/2023)

likeCount: The total number of likes of that video (up till 14/7/2023)

commentCount: The total number of comments of that video (up till 14/7/2023)

duration: The duration of that video

Inspirations

Data is scraped using Youtube API, feel free to use the data as long as it copes with the term of uses of Youtube. Something you can do with the dataset may be to analysis what news are of people's interest or to watch some of the most viewed news in the world to stay close with the society.
c
Bloomberg Quint news dataset
crawlfeeds.com
json, zip
Updated Sep 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). Bloomberg Quint news dataset [Dataset]. https://crawlfeeds.com/datasets/bloomberg-quint-news-dataset
Explore at:
json, zipAvailable download formats
Dataset updated
Sep 27, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Explore the "Bloomberg Quint News Dataset," a comprehensive collection of news articles from Bloomberg Quint, a leading source of financial, business, and economic news in India and around the world.

This dataset includes thousands of articles covering a wide range of topics, such as financial markets, economic policies, corporate news, technology, politics, and more. Each article in the dataset comes with detailed information, including headlines, publication dates, authors, article content, and categories, offering valuable insights for researchers, data analysts, and media professionals.

Key Features:

Extensive Coverage: Thousands of news articles from Bloomberg Quint, covering diverse topics including business, finance, economics, technology, and global news.

Detailed Metadata: Each article includes key details such as headline, publication date, author, content, and category, making it ideal for in-depth research and analysis.

Ideal for Analysis: Perfect for researchers, data scientists, and content strategists looking to analyze trends in news reporting, study media coverage, or develop content strategies.

Rich Source of Information: Provides up-to-date information on financial markets, economic policies, and global events, helping professionals stay informed and make data-driven decisions.

Whether you're researching financial trends, analyzing media coverage, or developing new content, the "Bloomberg Quint News Dataset" is an invaluable resource that offers detailed insights and extensive coverage of the latest news.
E-News Express
kaggle.com
zip
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariyam Al Shatta (2023). E-News Express [Dataset]. https://www.kaggle.com/datasets/mariyamalshatta/e-news-express
Explore at:
zip(925 bytes)Available download formats
Dataset updated
Sep 28, 2023
Authors
Mariyam Al Shatta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Business Context

The advent of e-news, or electronic news, portals has offered us a great opportunity to quickly get updates on the day-to-day events occurring globally. The information on these portals is retrieved electronically from online databases, processed using a variety of software, and then transmitted to the users. There are multiple advantages of transmitting new electronically, like faster access to the content and the ability to utilize different technologies such as audio, graphics, video, and other interactive elements that are either not being used or aren’t common yet in traditional newspapers.

E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyze these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current webpage is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe.

[Companies often analyze user responses to two variants of a product to decide which of the two variants is more effective. This experimental technique, known as A/B testing, is used to determine whether a new feature attracts users based on a chosen metric.]

Objective

The design team of the company has researched and created a new landing page that has a new outline & more relevant content shown compared to the old page. In order to test the effectiveness of the new landing page in gathering new subscribers, the Data Science team conducted an experiment by randomly selecting 100 users and dividing them equally into two groups. The existing landing page was served to the first group (control group) and the new landing page to the second group (treatment group). Data regarding the interaction of users in both groups with the two versions of the landing page was collected. Being a data scientist in E-news Express, you have been asked to explore the data and perform a statistical analysis (at a significance level of 5%) to determine the effectiveness of the new landing page in gathering new subscribers for the news portal by answering the following questions:

Do the users spend more time on the new landing page than on the existing landing page? Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page? Does the converted status depend on the preferred language? Is the time spent on the new page the same for the different language users?

Data Dictionary

The data contains information regarding the interaction of users in both groups with the two versions of the landing page.

user_id - Unique user ID of the person visiting the website group - Whether the user belongs to the first group (control) or the second group (treatment) landing_page - Whether the landing page is new or old time_spent_on_the_page - Time (in minutes) spent by the user on the landing page converted - Whether the user gets converted to a subscriber of the news portal or not language_preferred - Language chosen by the user to view the landing page
o
News Data, Global News, Topic News, and More from Google News
openwebninja.com
json
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWeb Ninja, News Data, Global News, Topic News, and More from Google News [Dataset]. https://www.openwebninja.com/api/real-time-news-data
Explore at:
jsonAvailable download formats
Dataset authored and provided by
OpenWeb Ninja
Area covered
Global News Coverage
Description
This dataset provides comprehensive access to news articles and headlines from Google News in real-time. Get top news globally or by specific topics, with support for geographic targeting and custom search queries. Perfect for applications requiring news monitoring, media analysis, and content aggregation. The dataset is delivered in a JSON format via REST API.
Business Today's YouTube Channel Statistics
vidiq.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vidIQ, Business Today's YouTube Channel Statistics [Dataset]. https://vidiq.com/youtube-stats/channel/UCaPHWiExfUWaKsUtENLCv5w/
Explore at:
Dataset authored and provided by
vidIQ
Time period covered
Nov 1, 2025 - Nov 30, 2025
Area covered
IN, YouTube
Variables measured
subscribers, video count, video views, engagement rate, upload frequency, estimated earnings
Description
Comprehensive YouTube channel statistics for Business Today, featuring 2,970,000 subscribers and 677,782,483 total views. This dataset includes detailed performance metrics such as subscriber growth, video views, engagement rates, and estimated revenue. The channel operates in the News-&-Politics category and is based in IN. Track 30,944 videos with daily and monthly performance data, including view counts, subscriber changes, and earnings estimates. Analyze growth trends, engagement patterns, and compare performance against similar channels in the same category.
Z
Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...
data.niaid.nih.gov
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haak, Fabian; Schaer, Philipp (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
Explore at:
Dataset updated
Mar 1, 2023
Dataset provided by
Technische Hochschule Köln
Authors
Haak, Fabian; Schaer, Philipp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
News Events Data in Latin America( Techsalerator)
datarade.ai
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Techsalerator (2024). News Events Data in Latin America( Techsalerator) [Dataset]. https://datarade.ai/data-products/news-events-data-in-latin-america-techsalerator-techsalerator
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Mar 20, 2024
Dataset provided by
Techsalerator LLC
Authors
Techsalerator
Area covered
Chile, Cuba, Martinique, Montserrat, Dominican Republic, Falkland Islands (Malvinas), French Guiana, Argentina, Aruba, Ecuador, Americas, Latin America
Description
Techsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.

Key Features of the Dataset: Comprehensive Coverage:

The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:

News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:

The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:

Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:

Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:

The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:

Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Social media as a news outlet worldwide 2025 [Dataset]. https://www.statista.com/statistics/718019/social-media-news-source/

Social media as a news outlet worldwide 2025

Explore at:

68 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Nov 19, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Jan 2025 - Feb 2025

Area covered

Worldwide

Description

During a 2025 survey, ** percent of respondents from Nigeria stated that they used social media as a source of news. In comparison, just ** percent of Japanese respondents said the same. Large portions of social media users around the world admit that they do not trust social platforms either as media sources or as a way to get news, and yet they continue to access such networks on a daily basis. Social media: trust and consumption Despite the majority of adults surveyed in each country reporting that they used social networks to keep up to date with news and current affairs, a 2018 study showed that social media is the least trusted news source in the world. Less than ** percent of adults in Europe considered social networks to be trustworthy in this respect, yet more than ** percent of adults in Portugal, Poland, Romania, Hungary, Bulgaria, Slovakia and Croatia said that they got their news on social media. What is clear is that we live in an era where social media is such an enormous part of daily life that consumers will still use it in spite of their doubts or reservations. Concerns about fake news and propaganda on social media have not stopped billions of users accessing their favorite networks on a daily basis. Most Millennials in the United States use social media for news every day, and younger consumers in European countries are much more likely to use social networks for national political news than their older peers. Like it or not, reading news on social is fast becoming the norm for younger generations, and this form of news consumption will likely increase further regardless of whether consumers fully trust their chosen network or not.

Clear search

Close search

Google apps

Main menu

Social media as a news outlet worldwide 2025

News Datasets

Most popular news platforms in the U.S. 2022, by age group

UK news headlines

Leading social networks used for news in the U.S. 2019-2025

ARABIC NEWS DATASET - RESULTS FROM WEB SCRAPING

news-bias-full-data

How news consumption affects adults in the U.S. 2025

BBC News Dataset – February 2023 Edition

Use Cases

Fake News Detection Dataset

📚 Fake News Detection Dataset

Overview

Columns Description

Why Use This Dataset?

Potential Use Cases

A Note on Data Quality

File Info

News Events Data in Asia ( Techsalerator)

CT-FAN: A Multilingual dataset for Fake News Detection

CT-FAN-21 corpus: A dataset for Fake News Detection

Top Youtube News Media Statistics

Bloomberg Quint news dataset

E-News Express

News Data, Global News, Topic News, and More from Google News

Business Today's YouTube Channel Statistics

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

News Events Data in Latin America( Techsalerator)

Social media as a news outlet worldwide 2025See More Versions

Social media as a news outlet worldwide 2025