100+ datasets found

g
Ten Thousand German News Articles Dataset
tblock.github.io
kaggle.com
csv
Updated Mar 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Block (2019). Ten Thousand German News Articles Dataset [Dataset]. https://tblock.github.io/10kGNAD/
Explore at:
csvAvailable download formats
Dataset updated
Mar 5, 2019
Authors
T. Block
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
10kGNAD - A german topic classification dataset. Visit the dataset page for more information: https://tblock.github.io/10kGNAD/
Largest news articles dataset from CNBC
crawlfeeds.com
csv, zip
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Largest news articles dataset from CNBC [Dataset]. https://crawlfeeds.com/datasets/cnbc-news-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Jan 6, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Explore the "Largest News Articles Dataset from CNBC," a comprehensive collection of news articles published by CNBC, one of the leading global news sources for business, finance, and current affairs.

This dataset includes thousands of articles covering a wide range of topics, such as financial markets, economic trends, technology, politics, health, and more. Each article in the dataset provides detailed information, including headlines, publication dates, authors, article content, and categories, offering valuable insights for researchers, data analysts, and media professionals.

Key Features:

Extensive Coverage: Thousands of news articles from CNBC, covering a diverse array of topics including business, finance, technology, and global news.

Detailed Metadata: Each article includes essential details such as headline, publication date, author, content, and category, allowing for in-depth analysis and research.

Ideal for Analysis: Perfect for researchers, data scientists, and content creators looking to analyze trends in news reporting, study media coverage, or develop content strategies.

Up-to-Date Information: Provides a rich source of information on current events and market trends, helping professionals stay informed and make data-driven decisions.

Whether you're conducting research on financial markets, analyzing media trends, or developing new content, the "Largest News Articles Dataset from CNBC" is an invaluable resource that provides detailed insights and comprehensive coverage of the latest news.
News Datasets
brightdata.com
.json, .csv, .xlsx
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data, News Datasets [Dataset]. https://brightdata.com/products/datasets/news
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.

Dataset Features

News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.

Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.

Popular Use Cases

Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.

Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
h
News-Article-Categorization_IAB
huggingface.co
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shishir Dwivedi (2023). News-Article-Categorization_IAB [Dataset]. https://huggingface.co/datasets/shishir-dwi/News-Article-Categorization_IAB
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 2, 2023
Authors
Shishir Dwivedi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Article and Category Dataset

Overview

This dataset contains a collection of articles, primarily news articles, along with their respective IAB (Interactive Advertising Bureau) categories. It can be a valuable resource for various natural language processing (NLP) tasks, including text classification, text generation, and more.

Dataset Information

Number of Samples: 871,909 Number of Categories: 26

Column Information

text: The text of the article.… See the full description on the dataset page: https://huggingface.co/datasets/shishir-dwi/News-Article-Categorization_IAB.
h
agnewsadapted
huggingface.co
Updated Apr 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Ferreira Brigham (2023). agnewsadapted [Dataset]. https://huggingface.co/datasets/ebrigham/agnewsadapted
Explore at:
Dataset updated
Apr 13, 2023
Authors
Eduardo Ferreira Brigham
Description
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
h
news-data
huggingface.co
Updated May 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Okite Chimaobi Samuel (2023). news-data [Dataset]. https://huggingface.co/datasets/okite97/news-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 29, 2023
Authors
Okite Chimaobi Samuel
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
Dataset Card for news-data

Dataset Summary

The News Dataset is an English-language dataset containing just over 4k unique news articles scrapped from AriseTv- One of the most popular news television in Nigeria.

Supported Tasks and Leaderboards

It supports news article classification into different categories.

Languages

English

Dataset Structure Data Instances

''' {'Title': 'Nigeria: APC Yet to Zone Party Positions Ahead of… See the full description on the dataset page: https://huggingface.co/datasets/okite97/news-data.
Fox News dataset is for analyzing media trends and narratives
crawlfeeds.com
csv, zip
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Fox News dataset is for analyzing media trends and narratives [Dataset]. https://crawlfeeds.com/datasets/fox-news-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
May 19, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.

Key Features of the Fox News Dataset

Extensive Coverage: Contains more than 1 million articles spanning various topics and events up to 2023.

Research-Ready: Perfect for text classification, natural language processing (NLP), and other research purposes.

Format: Provided in CSV format for seamless integration into analytical and research tools.

Why Use This Dataset?

This large dataset is ideal for:

Text Classification: Develop machine learning models to classify and categorize news content.

Natural Language Processing (NLP): Conduct sentiment analysis, keyword extraction, or topic modeling.

Media and Political Research: Analyze media narratives, public opinion, and political trends reflected in Fox News articles.

Trend Analysis: Identify shifts in public discourse and media focus over time.

Explore More News Datasets

Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.

The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.
i
Data from: COVID-19 News Articles
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Ghasiya (2022). COVID-19 News Articles [Dataset]. https://ieee-dataport.org/documents/covid-19-news-articles
Explore at:
Dataset updated
May 18, 2022
Authors
Piyush Ghasiya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
India

ISOT Fake News Dataset

kaggle.com

Updated Dec 29, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Rahul Goel (2024). ISOT Fake News Dataset [Dataset]. https://www.kaggle.com/datasets/rahulogoel/isot-fake-news-dataset/discussion

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 29, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rahul Goel

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

It is trained on data of around 45,000 news articles with a mix of real and fake news articles. The dataset is provided by the University of Victoria.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21948533%2Fa9c02011dc538fde2c967d56bfdb4778%2Fsubjects.png?generation=1735462720561554&alt=media" alt="distribution of topics">

The dataset contains two types of articles fake and real News. This dataset was collected from realworld sources; the truthful articles were obtained by crawling articles from Reuters.com (News website). As for the fake news articles, they were collected from different sources. The fake news articles were collected from unreliable websites that were flagged by Politifact (a fact-checking organization in the USA) and Wikipedia. The dataset contains different types of articles on different topics, however, the majority of articles focus on political and World news topics.

The dataset consists of two CSV files. The first file named “True.csv” contains more than 12,600 articles from reuter.com. The second file named “Fake.csv” contains more than 12,600 articles from different fake news outlet resources. Each article contains the following information: article title, text, type and the date the article was published on. To match the fake news data collected for kaggle.com, we focused mostly on collecting articles from 2016 to 2017. The data collected were cleaned and processed, however, the punctuations and mistakes that existed in the fake news were kept in the text.

The following table gives a breakdown of the categories and number of articles per category.

News	Size (Number of articles)	Subjects
Real-News	21417	Type	Articles size
		World-News	10145
		Politics-News	11272
Fake-News	23481	Type	Articles size
		Government-News	1570
		Middle-east	778
		US News	783
		Left-news	4459
		Politics	6841
		News	9050

Note- To cite this dataset use the information given by original authors:

Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.
Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127- 138)

CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.5775511
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5775511
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form.

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Submission Link: Coming soon

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
i
A disaster-news article headline generation dataset
ieee-dataport.org
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumanta Banerjee (2024). A disaster-news article headline generation dataset [Dataset]. https://ieee-dataport.org/documents/disaster-news-article-headline-generation-dataset
Explore at:
Dataset updated
Jul 8, 2024
Authors
Sumanta Banerjee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The disaster-news healline generation dataset (news_articles_and _titles) contains a set of disaster-news articles and their headlines/titles. This dataset may be used to develop a method to generate a good quality headline for a disaster-news article.
h
all-the-news-2-1-Component-one
huggingface.co
Updated Jul 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2019). all-the-news-2-1-Component-one [Dataset]. https://huggingface.co/datasets/rjac/all-the-news-2-1-Component-one
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2019
Authors
Rafael Arias Calles
Description
2.7 million news articles and essays

Dataset Description

2.7 million news articles and essays from 27 American publications. Includes date, title, publication, article text, publication name, year, month, and URL (for some). Articles mostly span from 2016 to early 2020.

Type: CSV Size: 3.4 GB compressed, 8.8 GB uncompressed Created by: Andrew Thompson Date added: 4/3/2020 Date modified: 4/3/2020 source: Component one Datasets 2.7 Millions Date of Download and processed:… See the full description on the dataset page: https://huggingface.co/datasets/rjac/all-the-news-2-1-Component-one.
Z
Multilingual news article similarity dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chen, xi (2024). Multilingual news article similarity dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10516515
Explore at:
Dataset updated
Jul 7, 2024
Dataset authored and provided by
chen, xi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the extended version of the authors' earlier work: https://zenodo.org/records/6507872, where pairs of news articles drawn from the first half of 2020 are annotated for seven aspects of similarity in the original version as well as an additional FRAME aspect:

GEO: How similar is the geographic focus (places, cities, countries, etc.) of the two articles?

ENT: How similar are the named entities (e.g., people, companies, organizations, products, named living beings), excluding previously considered locations appearing in the two articles?

TIME Are the two articles relevant to similar time periods or describing similar time periods?

NAR How similar are the narrative schemas presented in the two articles?

OVERALL Overall, are the two articles covering the same substantive news story? (excluding style, framing, and tone)

STYLE Do the articles have similar writing styles?

TONE Do the articles have similar tones?

FRAME Do the articles have similar framing and express similar opinions?
Indonesia News Portal Headlines Dataset
kaggle.com
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayesq Prameswari (2025). Indonesia News Portal Headlines Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/10831968
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10831968
Dataset updated
Feb 23, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mayesq Prameswari
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains news headlines collected from 20 major Indonesian news portals through web scraping conducted on February 23, 2025. The dataset is structured into three key components: the source of the news, the headline title, and the date of publication. By compiling headlines from multiple sources, this dataset provides a comprehensive snapshot of trending topics across different media outlets in Indonesia. It can be utilized for various analytical and research purposes, such as trending topic analysis, sentiment analysis, and natural language processing (NLP) applications. Researchers can use this dataset to track public sentiment, identify recurring themes in news coverage, and train machine learning models for text-based tasks such as classification, keyword extraction, and summarization.

With 1,174 rows and 3 columns, this dataset contains no missing values, ensuring its usability for data analysis and modeling. The three available variables are: source, which represents the name of the news portal where the headline was published; title, which contains the actual headline of the news article; and date, which indicates the publication date of each news piece. These variables make it possible to conduct media monitoring, study media bias, and compare how different news platforms report on similar topics. Additionally, the dataset is valuable for time-series analysis, allowing users to observe how news trends evolve over time.
a
Online News Popularity Data Set
academictorrents.com
bittorrent
Updated Feb 11, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kelwin Fernandes and Pedro Vinagre and Paulo Cortez and Pedro Sernadela (2016). Online News Popularity Data Set [Dataset]. https://academictorrents.com/details/95d3b03397a0bafd74a662fe13ba3550c13b7ce1
Explore at:
bittorrent(7476401)Available download formats
Dataset updated
Feb 11, 2016
Dataset authored and provided by
Kelwin Fernandes and Pedro Vinagre and Paulo Cortez and Pedro Sernadela
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Data Set Information: * The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls. * Acquisition date: January 8, 2015 * The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set. ##Attribute Information: Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field) 0. url: URL of the article (non-predictive) 1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) 2. n_tokens_title: Number of words in the title 3. n_tokens_content: Number of words in the content 4. n_unique_tokens: Rate of unique words in the conte
Popular News articles
apitube.io
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
APITube (2024). Popular News articles [Dataset]. https://apitube.io/free-datasets/popular-news-articles
Explore at:
Dataset updated
Oct 2, 2024
Dataset authored and provided by
APITube
License
https://www.apache.org/licenses/LICENSE-2.0https://www.apache.org/licenses/LICENSE-2.0
Time period covered
Jan 1, 2020 - Present
Area covered
Global
Variables measured
Category, Language, Sentiment, News Content, News Sources, News Headlines, Publication Date, Geographic Location
Description
A dataset of popular news articles from various sources.. Crawled date: Oct, 2024. Documents count: 12,000.
CNN news dataset
crawlfeeds.com
json, zip
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). CNN news dataset [Dataset]. https://crawlfeeds.com/datasets/cnn-news-dataset
Explore at:
json, zipAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
This dataset contains over 27,000 news articles sourced from CNN.com, including full content, metadata, and media fields. Each article is enriched with publish dates, author information, descriptions, and full raw + cleaned content—perfect for media research, sentiment analysis, topic modeling, and natural language processing (NLP) projects.

Last crawled in July 2021, this collection offers a historical snapshot of CNN’s reporting and editorial content.

Use Cases:

News content analysis

Fake news detection & bias tracking

Topic classification and clustering

Training AI/NLP models

Historical news trend research

Media monitoring tools

Update Frequency:

Archived — no current updates, great for snapshot-based analysis
Fake and True News Dataset
figshare.com
txt
Updated Dec 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abu Bakkar Siddik (2020). Fake and True News Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13325198.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13325198.v1
Dataset updated
Dec 3, 2020
Dataset provided by
Figsharehttp://figshare.com/
Authors
Abu Bakkar Siddik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this dataset have to part combined namely fake news and true news. fake news collected from Kaggle and some true news collected form IEEE Data port. Therefor some true news data required to optimize with the fake news. After that i have collect some true news from different trusted online site. Finally i have concat the Fake and True news as a single dataset for the purpose to help the Researchers further if they want to research by taken this topic.
BBC Latest News Dataset 2021
crawlfeeds.com
json, zip
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). BBC Latest News Dataset 2021 [Dataset]. https://crawlfeeds.com/datasets/bbc-latest-news-dataset-2021
Explore at:
zip, jsonAvailable download formats
Dataset updated
Apr 6, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
This dataset contains more than 1 million news articles and extracted all the data points present in the news article page. BBC news articles first collected on the year 2021 and convered all the categories present in the BBC site.

This news dataset is ideal for text clasification, finding popular categories, NLP and other reasearch purposes.

Dataset is available in JSON format.
News Headline Sentiment Dataset
zenodo.org
bin
Updated Mar 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan; Chang Wei Tan; Christoph Bergmeir; Christoph Bergmeir; Francois Petitjean; Francois Petitjean; Geoffrey I Webb; Geoffrey I Webb (2021). News Headline Sentiment Dataset [Dataset]. http://doi.org/10.5281/zenodo.3902718
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3902718
Dataset updated
Mar 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chang Wei Tan; Chang Wei Tan; Christoph Bergmeir; Christoph Bergmeir; Francois Petitjean; Francois Petitjean; Geoffrey I Webb; Geoffrey I Webb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

The goal of this dataset is to predict sentiment score for news headline. This dataset contains 83164 time series obtained from the News Popularity in Multiple Social Media Platforms dataset from the UCI repository. This is a large data set of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn. The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine. This data set is tailored for evaluative comparisons in predictive analytics tasks, although allowing for tasks in other research areas such as topic detection and tracking, sentiment analysis in short text, first story detection or news recommendation. The time series has 3 dimensions.

Please refer to https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms for more details

Citation request
Nuno Moniz and Luis Torgo (2018), Multi-Source Social Feedback of Online News Feeds, CoRR

Facebook

Twitter

Click to copy link

Link copied

Cite

T. Block (2019). Ten Thousand German News Articles Dataset [Dataset]. https://tblock.github.io/10kGNAD/

Ten Thousand German News Articles Dataset

Explore at:

17 scholarly articles cite this dataset (View in Google Scholar)

csvAvailable download formats

Dataset updated

Mar 5, 2019

Authors

T. Block

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

10kGNAD - A german topic classification dataset. Visit the dataset page for more information: https://tblock.github.io/10kGNAD/

Clear search

Close search

Google apps

Main menu

Ten Thousand German News Articles Dataset

Largest news articles dataset from CNBC

News Datasets

News-Article-Categorization_IAB

agnewsadapted

news-data

Fox News dataset is for analyzing media trends and narratives

Key Features of the Fox News Dataset

Why Use This Dataset?

Explore More News Datasets

Data from: COVID-19 News Articles

ISOT Fake News Dataset

Note- To cite this dataset use the information given by original authors:

CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

A disaster-news article headline generation dataset

all-the-news-2-1-Component-one

Multilingual news article similarity dataset

Indonesia News Portal Headlines Dataset

Online News Popularity Data Set

Popular News articles

CNN news dataset

Use Cases:

Update Frequency:

Fake and True News Dataset

BBC Latest News Dataset 2021

News Headline Sentiment Dataset

Ten Thousand German News Articles Dataset