100+ datasets found
  1. News Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data, News Datasets [Dataset]. https://brightdata.com/products/datasets/news
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.

    Dataset Features

    News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.

    Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.

    Popular Use Cases

    Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.

    Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.

  2. News on the Web Corpus (NOW)

    • redivis.com
    application/jsonl +7
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford University Libraries (2024). News on the Web Corpus (NOW) [Dataset]. http://doi.org/10.57761/75nh-0e37
    Explore at:
    parquet, stata, spss, arrow, application/jsonl, csv, sas, avroAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford University Libraries
    Description

    Abstract

    The News on the Web (NOW) corpus contains billions of words of data from web-based newspapers and magazines from 2010 to 2022.

    Methodology

    The data were cleaned for inclusion in Data Farm. Please see News on the Web Corpus GitLab for more information.

    Bulk Data Access

    Data access is required to view this section.

  3. Fake News Classification

    • kaggle.com
    zip
    Updated Oct 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2023). Fake News Classification [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
    Explore at:
    zip(96615040 bytes)Available download formats
    Dataset updated
    Oct 8, 2023
    Authors
    Saurabh Shahane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (WELFake) is a dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, authors merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    Published in: IEEE Transactions on Computational Social Systems: pp. 1-13 (doi: 10.1109/TCSS.2021.3068519).

  4. Trust in news shared on social media in the U.S. 2025

    • statista.com
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Trust in news shared on social media in the U.S. 2025 [Dataset]. https://www.statista.com/statistics/1462041/trust-in-news-found-on-social-media-us/
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 11, 2025 - May 12, 2025
    Area covered
    United States
    Description

    Data on trust in the news reported by selected social media platforms in the United States revealed that as of May 2025, news found on TikTok was considered to be the least trustworthy overall, with 23 percent of respondents saying they did not trust news they encountered on the platform. YouTube fared the best in terms of which platform was considered to have the most trustworthy news content, with 29 percent of respondents saying they felt the reporting they saw on these sites was reliable.

  5. h

    bbc-news

    • huggingface.co
    • opendatalab.com
    Updated Jun 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SetFit (2022). bbc-news [Dataset]. https://huggingface.co/datasets/SetFit/bbc-news
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 28, 2022
    Dataset authored and provided by
    SetFit
    Description

    BBC News Topic Dataset

    Dataset on BBC News Topic Classification consisting of 2,225 articles published on the BBC News website corresponding during 2004-2005. Each article is labeled under one of 5 categories: business, entertainment, politics, sport or tech. Original source for this dataset:

    Derek Greene, Pádraig Cunningham, “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering,” in Proc. 23rd International Conference on Machine learning (ICML’06)… See the full description on the dataset page: https://huggingface.co/datasets/SetFit/bbc-news.

  6. Types of news info accessed by Gen Z on social media in the UK 2024

    • statista.com
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Types of news info accessed by Gen Z on social media in the UK 2024 [Dataset]. https://www.statista.com/statistics/1553382/types-of-news-info-accessed-on-social-media-genz-uk/
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jul 10, 2024 - Jul 11, 2024
    Area covered
    United Kingdom
    Description

    Data from a survey conducted in the United Kingdom in *********, shows that among the Generation Z consumers who use social media for news, the majority look for information from traditional TV or press sources such as the BBC, or Telegraph on those networks.

  7. Global News Engagement on Social Media

    • kaggle.com
    zip
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanchana1990 (2024). Global News Engagement on Social Media [Dataset]. https://www.kaggle.com/datasets/kanchana1990/global-news-engagement-on-social-media
    Explore at:
    zip(267156 bytes)Available download formats
    Dataset updated
    Mar 15, 2024
    Authors
    Kanchana1990
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    This comprehensive dataset offers a deep dive into the social media engagement metrics of nearly 4,000 posts from four of the world's leading news channels: CNN, BBC, Al Jazeera, and Reuters. Curated to provide a holistic view of global news interaction on social media, the collection stands out for its meticulous assembly and broad spectrum of content.

    Dataset Overview: Spanning various global events, topics, and narratives, this dataset is a snapshot of how news is consumed and interacted with on social media platforms. It serves as a rich resource for analyzing trends, engagement patterns, and the dissemination of information across international borders.

    Data Science Applications: Ideal for researchers and enthusiasts in the fields of data science, media studies, and social analytics, this dataset opens doors to numerous explorations such as engagement analysis, trend forecasting, content strategy optimization, and the study of information flow in digital spaces. It also holds potential for machine learning projects aiming to predict engagement or classify content based on interaction metrics.

    Column Descriptors: Each record in the dataset is detailed with the following columns: - text: The title or main content of the post. - likes: The number of likes each post has garnered. - comments: The number of comments left by viewers. - shares: How many times the post has been shared.

    Ethically Mined Data: The collection of this dataset was conducted with the highest ethical standards in mind, ensuring compliance with data privacy laws and platform policies. By anonymizing data where necessary and focusing solely on publicly available information, it respects both individual privacy and intellectual property rights.

    Special thanks are extended to the Facebook platform and the respective news channels for their openness and the rich public data they provide. This dataset not only celebrates the vibrant exchange on social media but also underscores the importance of responsible data use and sharing in fostering understanding and innovation.

  8. Fake News Detection Dataset

    • kaggle.com
    zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Mashayekhi (2025). Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
    Explore at:
    zip(11735585 bytes)Available download formats
    Dataset updated
    Apr 27, 2025
    Authors
    Mahdi Mashayekhi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📚 Fake News Detection Dataset

    Overview

    This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.

    The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.

    Columns Description

    title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.

    Why Use This Dataset?

    Fake News Detection Practice: Perfect for binary classification tasks.

    NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.

    Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.

    Feature Engineering: Encourages creating new features from text and metadata.

    Balanced Labels: Realistic distribution of real and fake news for fair model training.

    Potential Use Cases

    Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).

    Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.

    Performing exploratory data analysis (EDA) on news data.

    Developing pipelines for dealing with missing values and feature extraction.

    A Note on Data Quality

    This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.

    File Info

    Filename: fake_news_dataset.csv

    Size: 20,000 rows × 7 columns

    Missing Data: ~5% missing values in the source and author columns.

  9. Z

    CT-FAN: A Multilingual dataset for Fake News Detection

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    University of Applied Sciences Potsdam
    University of Duisburg-Essen
    University of Klagenfurt
    Darmstadt University of Applied Sciences
    University of Hildesheim
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel
    Description

    By downloading the data, you agree with the terms & conditions mentioned below:

    Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

    Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

    We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

    Citation

    Please cite our work as

    @InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

    @article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

    Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

    False - The main claim made in an article is untrue.

    Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    True - This rating indicates that the primary elements of the main claim are demonstrably true.

    Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Cross-Lingual Task (German)

    Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    ID- Unique identifier of the news article

    Title- Title of the news article

    text- Text mentioned inside the news article

    our rating - class of the news article as false, partially false, true, other

    Output data format

    public_id- Unique identifier of the news article

    predicted_rating- predicted class

    Sample File

    public_id, predicted_rating 1, false 2, true

    IMPORTANT!

    We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

    Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

    Related Work

    Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

    G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

    Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

    Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

    Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

    Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.

  10. Positives of getting news on social media in the U.S. 2018-2023

    • statista.com
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Positives of getting news on social media in the U.S. 2018-2023 [Dataset]. https://www.statista.com/statistics/1450390/us-adults-news-social-media-convenience/
    Explore at:
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 25, 2023 - Oct 1, 2023
    Area covered
    United States
    Description

    According to a survey conducted in 2023, 20 percent of adults in the United States who used social media to get news stated that convenience was their main reason for doing so. Speed and interaction with people were the two next most popular reasons for using social networking platforms as a source of news, accounting for nine and six percent of respondents, respectively. Smaller shares of adults said they liked that the news was up-to-date, the content or format, and the variety of sources or stories available. Overall, seven percent of U.S. adults who got their news on social media said they did not like anything about the experience.

  11. S

    A dataset of domain events based on open-source military news

    • scidb.cn
    Updated Sep 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongbin Huang; Jiao Sun; Hui Wei; Kaiming Xiao; Mao Wang; Xuan Li (2022). A dataset of domain events based on open-source military news [Dataset]. http://doi.org/10.57760/sciencedb.j00001.00486
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 6, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Hongbin Huang; Jiao Sun; Hui Wei; Kaiming Xiao; Mao Wang; Xuan Li
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The text dataset of the military field is the basis for event extraction in military field, and high-quality data set can effectively promote the study of event extraction in this field,However, the event extraction data set commonly used in the real world (such as ACE2005, etc.) is oriented to the general field, and the text corpus resources on military events are scarce. Therefore, we collect a large amount of military news content from public military news websites; On the basis of text content analysis, we firstly establish an event model of military news that includes event types, entity types and entity relationship types. Secondly, the text data is manually labeled according to the event model, which is iteratively verified and corrected simultaneously. Finally, a dataset of 13,000 high-quality military news events with a full variety of labels was obtained. We make this military news event dataset publicly available in this paper.

  12. Social media as a news outlet worldwide 2025

    • statista.com
    Updated Nov 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Social media as a news outlet worldwide 2025 [Dataset]. https://www.statista.com/statistics/718019/social-media-news-source/
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2025 - Feb 2025
    Area covered
    Worldwide
    Description

    During a 2025 survey, ** percent of respondents from Nigeria stated that they used social media as a source of news. In comparison, just ** percent of Japanese respondents said the same. Large portions of social media users around the world admit that they do not trust social platforms either as media sources or as a way to get news, and yet they continue to access such networks on a daily basis. Social media: trust and consumption Despite the majority of adults surveyed in each country reporting that they used social networks to keep up to date with news and current affairs, a 2018 study showed that social media is the least trusted news source in the world. Less than ** percent of adults in Europe considered social networks to be trustworthy in this respect, yet more than ** percent of adults in Portugal, Poland, Romania, Hungary, Bulgaria, Slovakia and Croatia said that they got their news on social media. What is clear is that we live in an era where social media is such an enormous part of daily life that consumers will still use it in spite of their doubts or reservations. Concerns about fake news and propaganda on social media have not stopped billions of users accessing their favorite networks on a daily basis. Most Millennials in the United States use social media for news every day, and younger consumers in European countries are much more likely to use social networks for national political news than their older peers. Like it or not, reading news on social is fast becoming the norm for younger generations, and this form of news consumption will likely increase further regardless of whether consumers fully trust their chosen network or not.

  13. News Category Dataset

    • kaggle.com
    zip
    Updated Sep 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Misra (2022). News Category Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/news-category-dataset/
    Explore at:
    zip(27829769 bytes)Available download formats
    Dataset updated
    Sep 24, 2022
    Authors
    Rishabh Misra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    ** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **

    This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.

    Content

    Each record in the dataset consists of the following attributes: - category: category in which the article was published. - headline: the headline of the news article. - authors: list of authors who contributed to the article. - link: link to the original news article. - short_description: Abstract of the news article. - date: publication date of the article.

    There are a total of 42 news categories in the dataset. The top-15 categories and corresponding article counts are as follows:

    • POLITICS: 35602

    • WELLNESS: 17945

    • ENTERTAINMENT: 17362

    • TRAVEL: 9900

    • STYLE & BEAUTY: 9814

    • PARENTING: 8791

    • HEALTHY LIVING: 6694

    • QUEER VOICES: 6347

    • FOOD & DRINK: 6340

    • BUSINESS: 5992

    • COMEDY: 5400

    • SPORTS: 5077

    • BLACK VOICES: 4583

    • HOME & LIVING: 4320

    • PARENTS: 3955

    Citation

    If you're using this dataset for your work, please cite the following articles:

    Citation in text format: 1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022). 2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

    Citation in BibTex format: @article{misra2022news, title={News Category Dataset}, author={Misra, Rishabh}, journal={arXiv preprint arXiv:2209.11429}, year={2022} } @book{misra2021sculpting, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {9798585463570} }

    Please link to rishabhmisra.github.io/publications as the source of this dataset. Thanks!

    Acknowledgements

    This dataset was collected from HuffPost.

    Inspiration

    • Can you categorize news articles based on their headlines and short descriptions?

    • Do news articles from different categories have different writing styles?

    • A classifier trained on this dataset could be used on a free text to identify the type of language being used.

    Want to contribute your own datasets?

    If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

    Other datasets

    Please also checkout the following datasets collected by me:

  14. Average Number of Fake News Stories Shared on Facebook, by Age Group

    • evidencehub.net
    json
    Updated Feb 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guess, Andrew, Jonathan Nagler, Joshua Tucker. Less Than You Think: Prevalence and Predictions of Fake News Dissemination on Facebook (New York: American Association for the Advancement of Science, 2019) (2022). Average Number of Fake News Stories Shared on Facebook, by Age Group [Dataset]. https://evidencehub.net/chart/average-number-of-fake-news-stories-shared-on-facebook-by-age-group-78.0
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    The Lisbon Council
    Authors
    Guess, Andrew, Jonathan Nagler, Joshua Tucker. Less Than You Think: Prevalence and Predictions of Fake News Dissemination on Facebook (New York: American Association for the Advancement of Science, 2019)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Measurement technique
    Survey (N=5000)
    Description

    The chart shows that that the oldest Americans, especially those over 65, were more likely to share fake news to their Facebook friends. This is true even when holding other characteristics—including education, ideology, and partisanship—constant. The coefficient on “Age over 65” implies that being in the oldest age group was associated with sharing nearly seven times as many articles from fake news domains on Facebook as those in the youngest age group, or about 2.3 times as many as those in the next-oldest age group, holding the effect of ideology, education, and the total number of web links shared constant.

  15. CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

    • zenodo.org
    • data.niaid.nih.gov
    Updated Oct 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.5775511
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form.

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

    Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Output data format

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

    IMPORTANT!

    1. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

    Submission Link: Coming soon

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
    • Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
  16. Z

    Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

    • data.niaid.nih.gov
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haak, Fabian; Schaer, Philipp (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
    Explore at:
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Technische Hochschule Köln
    Authors
    Haak, Fabian; Schaer, Philipp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

    Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

    Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

    The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

    To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

    Dataset 2: Search Query Suggestions (suggestions.csv)

    The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

    The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

    We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

    AllSides Scraper

    At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

    We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

  17. Same News - Different Sources

    • kaggle.com
    zip
    Updated Oct 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Same News - Different Sources [Dataset]. https://www.kaggle.com/datasets/thedevastator/same-news-different-sources
    Explore at:
    zip(262582 bytes)Available download formats
    Dataset updated
    Oct 28, 2022
    Authors
    The Devastator
    Description

    Same News Different Sources

    How different sources report on the same events

    About this dataset

    Do you ever feel like you're being inundated with news from all sides, and you can't keep up? Well, you're not alone. In today's age of social media and 24-hour news cycles, it can be difficult to know what's going on in the world. And with so many different news sources to choose from, it can be hard to know who to trust.

    That's where this dataset comes in. It captures data related to individuals' Sentiment Analysis toward different news sources. The data was collected by administering a survey to individuals who use different news sources. The survey responses were then analyzed to obtain the sentiment score for each news source.

    So if you're feeling overwhelmed by the news, don't worry – this dataset has you covered. With its insights on which news sources are trustworthy and which ones aren't, you'll be able to make informed decisions about what to read – and what to skip

    How to use the dataset

    The Twitter Sentiment Analysis dataset can be used to analyze the impact of social media on news consumption. This data can be used to study how individuals' sentiments towards different news sources vary based on the source they use. The dataset can also be used to study how different factors, such as the time of day or the topic of the news, affect an individual's sentiments

    Research Ideas

    • Identify which news sources are most trusted by the public.
    • Understand what topics are most important to the public.
    • Understand how different news sources report on the same issue

    Columns

    File: news.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |

    File: news_api.csv | Column name | Description | |:--------------|:------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Source | The news source the article is from. (String) |

    File: politics.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |

    File: sports.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |

    File: television.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |

    File: trending.csv | Column name | Description ...

  18. Preprocessed WELFake News Dataset

    • kaggle.com
    zip
    Updated Dec 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atharv Arya (2024). Preprocessed WELFake News Dataset [Dataset]. https://www.kaggle.com/datasets/ceasor6/preprocessed-welfake-news-dataset
    Explore at:
    zip(190989523 bytes)Available download formats
    Dataset updated
    Dec 24, 2024
    Authors
    Atharv Arya
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset is a preprocessed and enhanced version of the original WELFake News Dataset available on Kaggle. It includes engineered features and text preprocessing steps for improved performance in fake news classification tasks.

    The original dataset contains titles, text, and binary labels (0 = Real, 1 = Fake). This enhanced version introduces additional text-based features and processed columns to aid in advanced Natural Language Processing (NLP) and Machine Learning modeling.

    Dataset Details

    Column NameDescription
    titleHeadline or title of the news article.
    textBody content of the news article.
    labelBinary label: 1 (Fake News), 0 (Real News).
    languageLanguage of the text (default: English).
    punctuation_countTotal number of punctuation marks in the text.
    uppercase_ratioRatio of uppercase letters to total characters.
    numerical_countCount of numerical values present in the text.
    sentiment_polaritySentiment polarity score (-1 to 1) based on TextBlob analysis.
    processed_titlePreprocessed and tokenized version of the title.
    processed_textPreprocessed and tokenized version of the text.
    title_lenLength of the title in terms of word count.
    text_lenLength of the text in terms of word count.
    total_lenCombined length (title + text).
    combined_processed_textConcatenated and tokenized version of title and text for NLP tasks.
    combined_text_titleConcatenated raw title and text for combined analysis.

    Acknowledgments

    This dataset is a modified and derived work based on the WELFake Dataset available on Kaggle.
    - Original Source: WELFake Dataset
    - License: CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International).
    - This enhanced version inherits the same license and must be used for non-commercial purposes with proper attribution.

  19. News sources most accessed on social media worldwide 2025, by network

    • statista.com
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). News sources most accessed on social media worldwide 2025, by network [Dataset]. https://www.statista.com/statistics/1392866/social-media-news-topics-worldwide/
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2025 - Feb 2025
    Area covered
    Worldwide
    Description

    A 2025 survey revealed that audiences using social networks for news engage with sources differently depending on the network; for example ** percent of X (formerly Twitter) users paid the most attention to mainstream news outlets and journalists, while only ** percent of TikTok users and ** percent of Snapchat users did the same. These latter platforms saw higher attention directed toward creators and personalities, with ** percent of TikTok users and ** percent of Snapchat users engaging more with influencers and celebrities.

  20. d

    Data from: Supersharers of fake news on Twitter

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahar Baribi-Bartov; Briony Swire-Thompson; Nir Grinberg (2024). Supersharers of fake news on Twitter [Dataset]. http://doi.org/10.5061/dryad.44j0zpcmq
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2024
    Dataset provided by
    Dryad
    Authors
    Sahar Baribi-Bartov; Briony Swire-Thompson; Nir Grinberg
    Time period covered
    Feb 6, 2024
    Description

    Governments may have the capacity to flood social media with fake news, but little is known about the use of flooding by ordinary voters. In this work, we identify 2107 registered US voters that account for 80% of the fake news shared on Twitter during the 2020 US presidential election by an entire panel of 664,391 voters. We find that supersharers are important members of the network, reaching a sizable 5.2% of registered voters on the platform. Supersharers have a significant overrepresentation of women, older adults, and registered Republicans. Supersharers' massive volume does not seem automated but is rather generated through manual and persistent retweeting. These findings highlight a vulnerability of social media for democracy, where a small group of people distort the political reality for many.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bright Data, News Datasets [Dataset]. https://brightdata.com/products/datasets/news
Organization logo

News Datasets

Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered
Worldwide
Description

Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.

Dataset Features

News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.

Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.

Popular Use Cases

Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.

Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.

Search
Clear search
Close search
Google apps
Main menu