Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
10kGNAD - A german topic classification dataset. Visit the dataset page for more information: https://tblock.github.io/10kGNAD/
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Explore the "Largest News Articles Dataset from CNBC," a comprehensive collection of news articles published by CNBC, one of the leading global news sources for business, finance, and current affairs.
This dataset includes thousands of articles covering a wide range of topics, such as financial markets, economic trends, technology, politics, health, and more. Each article in the dataset provides detailed information, including headlines, publication dates, authors, article content, and categories, offering valuable insights for researchers, data analysts, and media professionals.
Key Features:
Whether you're conducting research on financial markets, analyzing media trends, or developing new content, the "Largest News Articles Dataset from CNBC" is an invaluable resource that provides detailed insights and comprehensive coverage of the latest news.
https://brightdata.com/licensehttps://brightdata.com/license
Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.
Dataset Features
News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.
Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.
Popular Use Cases
Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.
Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Article and Category Dataset
Overview
This dataset contains a collection of articles, primarily news articles, along with their respective IAB (Interactive Advertising Bureau) categories. It can be a valuable resource for various natural language processing (NLP) tasks, including text classification, text generation, and more.
Dataset Information
Number of Samples: 871,909 Number of Categories: 26
Column Information
text: The text of the article.… See the full description on the dataset page: https://huggingface.co/datasets/shishir-dwi/News-Article-Categorization_IAB.
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Dataset Card for news-data
Dataset Summary
The News Dataset is an English-language dataset containing just over 4k unique news articles scrapped from AriseTv- One of the most popular news television in Nigeria.
Supported Tasks and Leaderboards
It supports news article classification into different categories.
Languages
English
Dataset Structure
Data Instances
''' {'Title': 'Nigeria: APC Yet to Zone Party Positions Ahead of… See the full description on the dataset page: https://huggingface.co/datasets/okite97/news-data.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.
This large dataset is ideal for:
Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.
The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
India
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
It is trained on data of around 45,000 news articles with a mix of real and fake news articles. The dataset is provided by the University of Victoria.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21948533%2Fa9c02011dc538fde2c967d56bfdb4778%2Fsubjects.png?generation=1735462720561554&alt=media" alt="distribution of topics">
The dataset contains two types of articles fake and real News. This dataset was collected from realworld sources; the truthful articles were obtained by crawling articles from Reuters.com (News website). As for the fake news articles, they were collected from different sources. The fake news articles were collected from unreliable websites that were flagged by Politifact (a fact-checking organization in the USA) and Wikipedia. The dataset contains different types of articles on different topics, however, the majority of articles focus on political and World news topics.
The dataset consists of two CSV files. The first file named “True.csv” contains more than 12,600 articles from reuter.com. The second file named “Fake.csv” contains more than 12,600 articles from different fake news outlet resources. Each article contains the following information: article title, text, type and the date the article was published on. To match the fake news data collected for kaggle.com, we focused mostly on collecting articles from 2016 to 2017. The data collected were cleaned and processed, however, the punctuations and mistakes that existed in the fake news were kept in the text.
The following table gives a breakdown of the categories and number of articles per category.
News | Size (Number of articles) | Subjects | |
---|---|---|---|
Real-News | 21417 | Type | Articles size |
World-News | 10145 | ||
Politics-News | 11272 | ||
Fake-News | 23481 | Type | Articles size |
Government-News | 1570 | ||
Middle-east | 778 | ||
US News | 783 | ||
Left-news | 4459 | ||
Politics | 6841 | ||
News | 9050 |
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form.
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.
Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Output data format
Sample File
public_id, predicted_rating
1, false
2, true
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498
Submission Link: Coming soon
Related Work
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The disaster-news healline generation dataset (news_articles_and _titles) contains a set of disaster-news articles and their headlines/titles. This dataset may be used to develop a method to generate a good quality headline for a disaster-news article.
2.7 million news articles and essays
Dataset Description
2.7 million news articles and essays from 27 American publications. Includes date, title, publication, article text, publication name, year, month, and URL (for some). Articles mostly span from 2016 to early 2020.
Type: CSV Size: 3.4 GB compressed, 8.8 GB uncompressed Created by: Andrew Thompson Date added: 4/3/2020 Date modified: 4/3/2020 source: Component one Datasets 2.7 Millions Date of Download and processed:… See the full description on the dataset page: https://huggingface.co/datasets/rjac/all-the-news-2-1-Component-one.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the extended version of the authors' earlier work: https://zenodo.org/records/6507872, where pairs of news articles drawn from the first half of 2020 are annotated for seven aspects of similarity in the original version as well as an additional FRAME aspect:
GEO: How similar is the geographic focus (places, cities, countries, etc.) of the two articles?
ENT: How similar are the named entities (e.g., people, companies, organizations, products, named living beings), excluding previously considered locations appearing in the two articles?
TIME Are the two articles relevant to similar time periods or describing similar time periods?
NAR How similar are the narrative schemas presented in the two articles?
OVERALL Overall, are the two articles covering the same substantive news story? (excluding style, framing, and tone)
STYLE Do the articles have similar writing styles?
TONE Do the articles have similar tones?
FRAME Do the articles have similar framing and express similar opinions?
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains news headlines collected from 20 major Indonesian news portals through web scraping conducted on February 23, 2025. The dataset is structured into three key components: the source of the news, the headline title, and the date of publication. By compiling headlines from multiple sources, this dataset provides a comprehensive snapshot of trending topics across different media outlets in Indonesia. It can be utilized for various analytical and research purposes, such as trending topic analysis, sentiment analysis, and natural language processing (NLP) applications. Researchers can use this dataset to track public sentiment, identify recurring themes in news coverage, and train machine learning models for text-based tasks such as classification, keyword extraction, and summarization.
With 1,174 rows and 3 columns, this dataset contains no missing values, ensuring its usability for data analysis and modeling. The three available variables are: source
, which represents the name of the news portal where the headline was published; title
, which contains the actual headline of the news article; and date
, which indicates the publication date of each news piece. These variables make it possible to conduct media monitoring, study media bias, and compare how different news platforms report on similar topics. Additionally, the dataset is valuable for time-series analysis, allowing users to observe how news trends evolve over time.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
https://www.apache.org/licenses/LICENSE-2.0https://www.apache.org/licenses/LICENSE-2.0
A dataset of popular news articles from various sources.. Crawled date: Oct, 2024. Documents count: 12,000.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This dataset contains over 27,000 news articles sourced from CNN.com, including full content, metadata, and media fields. Each article is enriched with publish dates, author information, descriptions, and full raw + cleaned content—perfect for media research, sentiment analysis, topic modeling, and natural language processing (NLP) projects.
Last crawled in July 2021, this collection offers a historical snapshot of CNN’s reporting and editorial content.
News content analysis
Fake news detection & bias tracking
Topic classification and clustering
Training AI/NLP models
Historical news trend research
Media monitoring tools
Archived — no current updates, great for snapshot-based analysis
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this dataset have to part combined namely fake news and true news. fake news collected from Kaggle and some true news collected form IEEE Data port. Therefor some true news data required to optimize with the fake news. After that i have collect some true news from different trusted online site. Finally i have concat the Fake and True news as a single dataset for the purpose to help the Researchers further if they want to research by taken this topic.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This dataset contains more than 1 million news articles and extracted all the data points present in the news article page. BBC news articles first collected on the year 2021 and convered all the categories present in the BBC site.
This news dataset is ideal for text clasification, finding popular categories, NLP and other reasearch purposes.
Dataset is available in JSON format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/
The goal of this dataset is to predict sentiment score for news headline. This dataset contains 83164 time series obtained from the News Popularity in Multiple Social Media Platforms dataset from the UCI repository. This is a large data set of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn. The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine. This data set is tailored for evaluative comparisons in predictive analytics tasks, although allowing for tasks in other research areas such as topic detection and tracking, sentiment analysis in short text, first story detection or news recommendation. The time series has 3 dimensions.
Please refer to https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms for more details
Citation request
Nuno Moniz and Luis Torgo (2018), Multi-Source Social Feedback of Online News Feeds, CoRR
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
10kGNAD - A german topic classification dataset. Visit the dataset page for more information: https://tblock.github.io/10kGNAD/