Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains world news related to Science and technology and also with the news article's available metadata.
Spanish Fake News Dataset
This dataset contains a structured and annotated collection of false news items in Spanish (Castilian), gathered and processed for academic research on misinformation.
Dataset Scope
The dataset represents most of the recorded false news messages and their variations up to 01.02.2021.
Content Description
The dataset includes samples of false information in various formats:
News articles and headlines
Tweets and Facebook/Instagram/Telegram posts
YouTube video captions
WhatsApp text and voice message transcripts
Transcribed video/audio fragments with false claims
Fake government documents
Captions from photos and memes
Text extracted from images using OCR
Only Spanish (Castilian) texts were used, excluding regional variants (e.g., Catalan, Basque, Galician) for consistency.
Sources
The data was collected from the following verified fact-checking initiatives:
Maldito Bulo
Newtral
AFP Factual
Fact-checkers from these organizations provide detailed articles identifying and explaining falsehoods, often including:
General context of the event
Quotes or links to false claims
Analysis and explanation of why the claims are false
Verified information or corrections
Collection Method
The dataset was built using both manual extraction (e.g., identifying and quoting false statements) and automated parsing:
MyNews service: an archive of Spanish mass media
Custom scripts: for parsing and extracting structured data
OCR tools: for extracting text from images (e.g., memes and screenshots)
Fields Description
Column Name
Description
Topic
The thematic category of the news item (e.g., Politics, Health, COVID-19, Crime). Normalized and translated to English.
Link source
URL to the original news piece, fact-check report, or source of the claim. Invalid links were removed.
Media
The platform or outlet where the false claim appeared (e.g., Facebook, YouTube, WhatsApp). Normalized for consistent spelling and language.
Date
Publication or verification date of the news item, in YYYY-MM-DD format.
Author
(Optional) Author of the news or platform source, if available. May be empty.
Headlines
Title or summary of the news item or article containing the false information.
Fake statement
Quoted false claim or misinformation as cited in the verification article.
⚠️ Notes
The dataset was preprocessed to remove duplicates, invalid links, and non-textual clutter.
Field values were normalized to support multilingual and cross-platform analysis.
Only Castilian Spanish was retained for consistency and clarity.
📚 License & Use
This dataset is intended for non-commercial academic and research purposes. Please cite the original fact-checking organizations and this dataset if used in publications or analysis.
By downloading the data, you agree with the terms & conditions mentioned below:
Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.
Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.
We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.
Citation
Please cite our work as
@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.
Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Cross-Lingual Task (German)
Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
ID- Unique identifier of the news article
Title- Title of the news article
text- Text mentioned inside the news article
our rating - class of the news article as false, partially false, true, other
Output data format
public_id- Unique identifier of the news article
predicted_rating- predicted class
Sample File
public_id, predicted_rating 1, false 2, true
IMPORTANT!
We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.
Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498
Related Work
Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.
Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.
Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for CC-News
Dataset Summary
CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English… See the full description on the dataset page: https://huggingface.co/datasets/vblagoje/cc_news.
TabMaven/all-the-news dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Information
This dataset includes news data for various instruments.
Instruments Included
Stocks, ETFs, Forex, Cryptocurrencies, Commodities and more.
Dataset Columns
symbols: The symbols in the news, typically representing stock tickers or other financial instruments mentioned in the article. datetime: The date and time when the news article was published, formatted as a string. title: The title of the news article, providing a brief and descriptive… See the full description on the dataset page: https://huggingface.co/datasets/paperswithbacktest/All-Daily-News.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Alexey Voytsekhovskiy
Released under CC0: Public Domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains world news related to Covid-19 and vaccine and also with the news article's available metadata.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for Hausa VOA News Topic Classification dataset (hausa_voa_topics)
Dataset Summary
A news headline topic classification dataset, similar to AG-news, for Hausa. The news headlines were collected from VOA Hausa.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
Hausa (ISO 639-1: ha)
Dataset Structure
Data Instances
An instance consists of a news title sentence and the corresponding topic label.… See the full description on the dataset page: https://huggingface.co/datasets/UdS-LSV/hausa_voa_topics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset - Male in the news
This data contains news titles and headlines from different sources on different topics. The description of the columns is following;
| Column | DataType | Description | | --- | --- | | IDLink | str | Unique identifier of the row | | Title | str | Title of the news | | Headline | str | Headline of the news | | Source | str | Newspaper/news-source | | Topic | str | News-topic (values : obama, economy, microsoft, palestine) | | PublishDate | Timestamp | publish date & time | | Facebook | int | facebook rating | | GooglePlus | int | google plus rating | | LinkedIn | int | linkedin rating |
One of the main task that can be performed with this dataset is to perform Setiment Analysis. Find the Sentiment scores for each title and headline of the test data applying Regression Analysis.
This graph displays the time of day when consumers check the news on a typical weekday in the United States as of ************. During the survey, it was found that ** percent of consumers check the news in the early morning of a typical weekday.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the study Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis.
It provides a reproducible benchmark for evaluating automated content analysis methods against human-annotated ground truth.
The dataset includes:
articles.csv
Contains the 200 news articles collected for this study, each with:
id
: unique identifier
url
: source URL of the original article
content
: full text of the news article
codebook.json
A structured JSON file defining the 26-question analysis codebook used for annotation.
Each question entry specifies:
questionId
: question ID (e.g., Q1)
prompt
: annotation question text
questionAnswerType
: type (SINGLE_CHOICE or MULTI_CHOICE)
eligibleQuestionAnswers
: list of possible tags / codes
annotations.json
Contains the complete human annotation data.
For each article id
, it provides the list of responses to all 26 codebook questions as determined by an expert annotator, establishing the ground truth labels.
Designed for research popuses including natural language understanding, content classification, and LLM evaluation.
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the News technology, compiled through global website indexing conducted by WebTechSurvey.
In April 2025, the news website with the most monthly visits in the United States was nytimes.com, with a total of ***** million monthly visits in that month. In second place was cnn.com with just over *** million visits, followed by foxnews.com with almost a ****** of a million. Online news consumption in the U.S. Americans get their news in a variety of ways, but social media is an increasingly popular option. A survey on social media news consumption revealed that ** percent of Twitter users regularly used the site for news, and Facebook and Reddit were also popular for news among their users. Interestingly though, social media is the least trusted news sources in the United States. News and trust Trust in news sources has become increasingly important to the American news consumer amidst the spread of fake news, and the public are more vocal about whether or not they have faith in a source to report news correctly. Ongoing discussions about the credibility, accuracy and bias of news networks, anchors, TV show hosts, and news media professionals mean that those looking to keep up to date tend to be more cautious than ever before. In general, news audiences are skeptical. In 2020, just **** percent of respondents to a survey investigating the perceived objectivity of the mass media reported having a great deal of trust in the media to report news fully, accurately, and fairly.
New annotated datasets linking tweets and articles, including Tweets – PAP News Dataset, Tweets – BBC News Dataset, Cascades – PAP News Dataset, and Cascades – BBC News Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Cline Center Global News Index is a searchable database of textual features extracted from millions of news stories, specifically designed to provide comprehensive coverage of events around the world. In addition to searching documents for keywords, users can query metadata and features such as named entities extracted using Natural Language Processing (NLP) methods and variables that measure sentiment and emotional valence. Archer is a web application purpose-built by the Cline Center to enable researchers to access data from the Global News Index. Archer provides a user-friendly interface for querying the Global News Index (with the back-end indexing still handled by Solr). By default, queries are built using icons and drop-down menus. More technically-savvy users can use Lucene/Solr query syntax via a ‘raw query’ option. Archer allows users to save and iterate on their queries, and to visualize faceted query results, which can be helpful for users as they refine their queries. Additional Resources: - Access to Archer and the Global News Index is limited to account-holders. If you are interested in signing up for an account, please fill out the Archer Access Request Form so we can determine if you are eligible for access or not. - Current users who would like to provide feedback, such as reporting a bug or requesting a feature, can fill out the Archer User Feedback Form. - The Cline Center sends out periodic email newsletters to the Archer Users Group. Please fill out this form to subscribe to it. Citation Guidelines: 1) To cite the GNI codebook (or any other documentation associated with the Global News Index and Archer) please use the following citation: Cline Center for Advanced Social Research. 2023. Global News Index and Extracted Features Repository [codebook], v1.2.0. Champaign, IL: University of Illinois. June. XX. doi:10.13012/B2IDB-5649852_V5 2) To cite data from the Global News Index (accessed via Archer or otherwise) please use the following citation (filling in the correct date of access): Cline Center for Advanced Social Research. 2023. Global News Index and Extracted Features Repository [database], v1.2.0. Champaign, IL: University of Illinois. Jun. XX. Accessed Month, DD, YYYY. doi:10.13012/B2IDB-5649852_V5 *NOTE: V4 is suppressed and V5 is replacing V4 with updated ‘Archer’ documents.
https://brightdata.com/licensehttps://brightdata.com/license
Unlock the full potential of BBC broadcast data with our comprehensive dataset featuring transcripts, program schedules, headlines, topics, and multimedia resources. This all-in-one dataset is designed to empower media analysts, researchers, journalists, and advocacy groups with actionable insights for media analysis, transparency studies, and editorial assessments.
Dataset Features
Transcripts: Access detailed broadcast transcripts, including headlines, content, author details, and publication dates. Perfect for analyzing media framing, topic frequency, and news narratives across various programs. Program Schedules: Explore program schedules with accurate timing, show names, and related metadata to track news coverage patterns and identify trends. Topics and Keywords: Analyze categorized topics and keywords to understand content diversity, editorial focus, and recurring themes in news broadcasts. Multimedia Content: Gain access to videos, images, and related articles linked to each broadcast for a holistic understanding of the news presentation. Metadata: Includes critical data points like publication dates, last updates, content URLs, and unique IDs for easier referencing and cross-analysis.
Customizable Subsets for Specific Needs Our CNN dataset is fully customizable to match your research or analytical goals. Focus on transcripts for in-depth media framing analysis, extract multimedia for content visualization studies, or dive into program schedules for broadcast trend analysis. Tailor the dataset to ensure it aligns with your objectives for maximum efficiency and relevance.
Popular Use Cases
Media Analysis: Evaluate news framing, content diversity, and topic coverage to assess editorial direction and media focus. Transparency Studies: Analyze journalistic standards, corrections, and retractions to assess media integrity and accountability. Audience Engagement: Identify recurring topics and trends in news content to understand audience preferences and behavior. Market Analysis: Track media coverage of key industries, companies, and topics to analyze public sentiment and industry relevance. Journalistic Integrity: Use transcripts and metadata to evaluate adherence to reporting practices, fairness, and transparency in news coverage. Research and Scholarly Studies: Leverage transcripts and multimedia to support academic studies in journalism, media criticism, and political discourse analysis.
Whether you are evaluating transparency, conducting media criticism, or tracking broadcast trends, our BBC dataset provides you with the tools and insights needed for in-depth research and strategic analysis. Customize your access to focus on the most relevant data points for your unique needs.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
NEWS COPY
This dataset contains the evaluation and test sets for the NEWS COPY dataset. Original source can be found at Github. The license is unclear. It contains the following data:
Historical Newspapers
Training datasets can be found at chenghao/NEWS-COPY-train.
Citation
@inproceedings{silcock-etal-2020-noise, title = "Noise-Robust De-Duplication at Scale", author = "Silcock, Emily and D'Amico-Wong, Luca and Yang, Jinglin and Dell, Melissa", booktitle =… See the full description on the dataset page: https://huggingface.co/datasets/chenghao/NEWS-COPY-eval.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1EMHTKhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1EMHTK
We demonstrate that the news media causes Americans to take public stands on issues, join national policy conversations, and express themselves publicly more often than they would otherwise --- all key components of democratic politics. We recruited 48 mostly small media outlets that allowed us to choose groups of outlets to write and publish articles, on subjects we approved, and dates we randomly assigned. We estimate the causal effect on proximal measures, such as website pageviews and Twitter discussion of the articles' specific subjects, and distal ones, such as national Twitter conversation in broad policy areas. Our intervention increased discussion in each broad policy area by $\approx$ 62.7% (relative to a day's volume), accounting for 13,166 additional posts, with similar effects across population subgroups.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains world news related to Science and technology and also with the news article's available metadata.