100+ datasets found
  1. CT-FAN-21 corpus: A dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl; Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl; Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

    Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3a

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Task 3b

    • public_id- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • domain - domain of the given news article(applicable only for task B)

    Output data format

    Task 3a

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Task 3b

    • public_id- Unique identifier of the news article
    • predicted_domain- predicted domain

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

    IMPORTANT!

    1. Fake news article used for task 3b is a subset of task 3a.
    2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: https://competitions.codalab.org/competitions/31238

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, โ€œFakeCovid โ€“ a multilingualcross-domain fact check news dataset for covid-19,โ€ inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
  2. h

    modified-news-category-dataset

    • huggingface.co
    Updated Aug 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karim Tarek (2022). modified-news-category-dataset [Dataset]. https://huggingface.co/datasets/karimtarektech/modified-news-category-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 13, 2022
    Authors
    Karim Tarek
    Description

    karimtarektech/modified-news-category-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. AZERNEWSV1: AZERBAIJANI NEWS CLASSIFICATION DATASET

    • zenodo.org
    • ieee-dataport.org
    bin, csv
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samir Rustamov; Samir Rustamov (2024). AZERNEWSV1: AZERBAIJANI NEWS CLASSIFICATION DATASET [Dataset]. http://doi.org/10.5281/zenodo.10638520
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Samir Rustamov; Samir Rustamov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our dataset encompasses a comprehensive collection of Azerbaijani news texts from the Azertac (https://azertag.az/) State Agency, drawn from a variety of news articles.

    Azertac, established on March 1, 1920, was recognized as a pioneering entity within the framework of international information agencies. It has played a pivotal role in the establishment and coordination of various associations, including the Association of National Information Agencies comprising nations affiliated with the Commonwealth of Independent States, the Association of News Agencies representing Turkish-speaking countries, and the Association of National News Agencies associated with countries participating in the Black Sea Economic Cooperation Organization. AZERTAC has engaged in collaborative endeavors with several renowned news agencies to foster global information exchange and cooperation. This extensive network of collaborations underscores Azertac's global reach and influence in international news dissemination.

    The dataset comprises approximately three million rows, with each row representing a sentence extracted from diverse Azerbaijani news sources. These sentences cover a wide spectrum of subjects, including but not limited to politics, the economy, culture, sports, technology, and health. The Labeled dataset, which has been posted and publicly shared in the link, is organized to facilitate rigorous analysis and classification tasks, with essential metadata provided for each sentence.

    The dataset is enriched with crucial metadata attributes that enhance its utility and applicability to various research tasks:

    • News Category: Each sentence is linked to a specific news category, covering subjects such as politics, economy, culture, sports, technology, and health.
    • News Subcategory: Further enhance granularity, each sentence is classified into a subcategory, enabling fine-tuned analysis and specialized classification tasks.
    • News Index: A unique identifier for each news article maintains the dataset integrity and supports cross-referencing.
    • News Sentence Order: Sequential order aids in preserving sentence context, which is essential for text generation and summarization.
    • Link: Hyperlinks to original articles provide direct access for researchers to delve into the sentence context.
    • Sentence: The core textual content, which varies in length and complexity, covers a spectrum of linguistic styles and themes.

    Instructions:

    Dataset is presented in single csv file.

    The dataset is enriched with crucial metadata attributes that enhance its utility and applicability to various research tasks:

    • News Category: Each sentence is linked to a specific news category, covering subjects such as politics, economy, culture, sports, technology, and health.
    • News Subcategory: Further enhance granularity, each sentence is classified into a subcategory, enabling fine-tuned analysis and specialized classification tasks.
    • News Index: A unique identifier for each news article maintains the dataset integrity and supports cross-referencing.
    • News Sentence Order: Sequential order aids in preserving sentence context, which is essential for text generation and summarization.
    • Link: Hyperlinks to original articles provide direct access for researchers to delve into the sentence context.
    • Sentence: The core textual content, which varies in length and complexity, covers a spectrum of linguistic styles and themes.
  4. News Category Cleaned Dataset

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tamoghna Saha (2024). News Category Cleaned Dataset [Dataset]. https://www.kaggle.com/datasets/tamoghna96saha/news-category-cleaned-dataset/suggestions
    Explore at:
    zip(1754314 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Tamoghna Saha
    Description

    Dataset

    This dataset was created by Tamoghna Saha

    Contents

  5. P

    MNAD Dataset

    • paperswithcode.com
    Updated May 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). MNAD Dataset [Dataset]. https://paperswithcode.com/dataset/mnad
    Explore at:
    Dataset updated
    May 16, 2023
    Description

    About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

    Dataset Fields

    Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article

    About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

    The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

    This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

    About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

    The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

    Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

    This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

    Citation If you use our data, please cite the following paper:

    bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }

  6. h

    bbc-news

    • huggingface.co
    • opendatalab.com
    Updated Jun 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bbc-news [Dataset]. https://huggingface.co/datasets/SetFit/bbc-news
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 28, 2022
    Dataset authored and provided by
    SetFit
    Description

    BBC News Topic Dataset

    Dataset on BBC News Topic Classification consisting of 2,225 articles published on the BBC News website corresponding during 2004-2005. Each article is labeled under one of 5 categories: business, entertainment, politics, sport or tech. Original source for this dataset:

    Derek Greene, Pรกdraig Cunningham, โ€œPractical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering,โ€ in Proc. 23rd International Conference on Machine learningโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/SetFit/bbc-news.

  7. H

    Times of India News Headlines

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Apr 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Kulkarni (2022). Times of India News Headlines [Dataset]. http://doi.org/10.7910/DVN/DPQMQH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Rohit Kulkarni
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2001 - Mar 31, 2022
    Dataset funded by
    Madarasah
    Description

    Format: CSV ; Kusvooma ; Kohma Separatis Valaroj Events: 36,50,970 column 1: publish_date - Date of publishing in yyyyMMdd format column 2: headline_category - Category of event in ascii, dot-delimited values column 3: headline_text - Headline of article en la Engrezi

  8. c

    BBC Latest News Dataset 2021

    • crawlfeeds.com
    json, zip
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). BBC Latest News Dataset 2021 [Dataset]. https://crawlfeeds.com/datasets/bbc-latest-news-dataset-2021
    Explore at:
    zip, jsonAvailable download formats
    Dataset updated
    Apr 6, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    This dataset contains more than 1 million news articles and extracted all the data points present in the news article page. BBC news articles first collected on the year 2021 and convered all the categories present in the BBC site.

    This news dataset is ideal for text clasification, finding popular categories, NLP and other reasearch purposes.

    Dataset is available in JSON format.

  9. T

    ag_news_subset

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

    The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

    The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ag_news_subset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. m

    Ultimate Arabic News Dataset

    • data.mendeley.com
    • opendatalab.com
    • +1more
    Updated Jul 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Hashim Al-Dulaimi (2022). Ultimate Arabic News Dataset [Dataset]. http://doi.org/10.17632/jz56k5wxz7.2
    Explore at:
    Dataset updated
    Jul 4, 2022
    Authors
    Ahmed Hashim Al-Dulaimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.

    Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.

    • The data we collect consists of two Primary files:

    UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.

    UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.

    • We have added two folders containing additional detailed datasets:

    1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets:

    Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.

    2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases.

    • The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.
  11. News Category

    • kaggle.com
    zip
    Updated Oct 19, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohd Aquib (2020). News Category [Dataset]. https://www.kaggle.com/aquib5559/news-category
    Explore at:
    zip(2809114 bytes)Available download formats
    Dataset updated
    Oct 19, 2020
    Authors
    Mohd Aquib
    Description

    Dataset

    This dataset was created by Mohd Aquib

    Contents

  12. CT-FAN: A Multilingual dataset for Fake News Detection

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl; Juliane Kรถhler; Michael Wiegand; Melanie Siegel; Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl; Juliane Kรถhler; Michael Wiegand; Melanie Siegel (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.6555293
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl; Juliane Kรถhler; Michael Wiegand; Melanie Siegel; Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl; Juliane Kรถhler; Michael Wiegand; Melanie Siegel
    Description

    By downloading the data, you agree with the terms & conditions mentioned below:

    Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

    Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

    We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

    Citation

    Please cite our work as

    @InProceedings{clef-checkthat:2022:task3,
    author = {K{\"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas},
    title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection",
    year = {2022},
    booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum",
    series = {CLEF~'2022},
    address = {Bologna, Italy},}
    
    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

    Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Cross-Lingual Task (German)

    Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Output data format

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    IMPORTANT!

    1. We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

    Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, โ€œFakeCovid โ€“ a multilingual cross-domain fact check news dataset for covid-19,โ€ in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
    • Shahi, G. K., StruรŸ, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrรณn-Cedeno, A., Mรญguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrรณn-Cedeรฑo, A., Mรญguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEFโ€“2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
  13. Z

    CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    StruรŸ Julia Maria (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5775507
    Explore at:
    Dataset updated
    Jan 6, 2022
    Dataset provided by
    Thomas Mandl
    StruรŸ Julia Maria
    Shahi Gautam Kishore
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

    Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    False - The main claim made in an article is untrue.

    Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    True - This rating indicates that the primary elements of the main claim are demonstrably true.

    Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3

    ID- Unique identifier of the news article

    Title- Title of the news article

    text- Text mentioned inside the news article

    our rating - class of the news article as false, partially false, true, other

    Output data format

    Task 3

    public_id- Unique identifier of the news article

    predicted_rating- predicted class

    Sample File

    public_id, predicted_rating 1, false 2, true

    Sample file

    public_id, predicted_domain 1, health 2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

    Fakenews Classification Datasets

    Fake News Detection Challenge KDD 2020

    FakeNewsNet

    IMPORTANT!

    We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: Coming soon

    Related Work

    Shahi, G. K., StruรŸ, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

    Nakov, P., Da San Martino, G., Elsayed, T., Barrรณn-Cedeรฑo, A., Mรญguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

    Nakov, P., Da San Martino, G., Elsayed, T., Barrรณn-Cedeรฑo, A., Mรญguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEFโ€“2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.

    Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

    G. K. Shahi and D. Nandini, โ€œFakeCovid โ€“ a multilingualcross-domain fact check news dataset for covid-19,โ€ inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

    Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

  14. News Story Category : DistilBERT

    • kaggle.com
    zip
    Updated Jul 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2024). News Story Category : DistilBERT [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/news-story-category-distilbert/code
    Explore at:
    zip(246779655 bytes)Available download formats
    Dataset updated
    Jul 21, 2024
    Authors
    Gaurav Dutta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Gaurav Dutta

    Released under Apache 2.0

    Contents

  15. Lao News classification

    • zenodo.org
    csv
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun (2025). Lao News classification [Dataset]. http://doi.org/10.5281/zenodo.14967275
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Laos
    Description

    Dataset Card for Lao News classification

    Lao News classification dataset

    This dataset are collected Lao News for News classification from laopost.com.

    Dataset Details

    Dataset Description

    • Curated by: Wannaphong Phatthiyaphaibun
    • Language(s) (NLP): Lao
    • License: cc-by-4.0

    Uses

    Direct Use

    News classification

    Dataset Structure

    The dataset is divided into three splits: train, validation, and test. Each split contains news articles, with each article represented as a dictionary with the following fields:

    • title: The title of the news article (string).
    • text: The main content of the news article (string).
    • category: The category of the news article (string).
    • date: The publication date of the news article (string).
    • url: The URL of the news article (string).

    The dataset is structured as a DatasetDict object, which contains three Dataset objects, one for each split.

    • The train split contains 9196 news articles.
    • The validation split contains 3066 news articles.
    • The test split contains 3066 news articles.

    The splits likely represent a standard train/validation/test split, designed for training, evaluating, and testing machine learning models. The exact criteria used to create these splits are not explicitly stated, but are implied to provide a representative distribution of the data.

    Categorys

    Train

    เบ‚เปˆเบฒเบงเบ•เปˆเบฒเบ‡เบ›เบฐเป€เบ—เบ”       3417
    เบ‚เปˆเบฒเบงเบžเบฒเบเปƒเบ™         3219
    เบ‚เปˆเบฒเบงเบ—เป‰เบญเบ‡เบ–เบดเปˆเบ™        1307
    เบ‚เปˆเบฒเบงเป€เบซเบ”เบเบฒเบ™         459
    เบชเบธเบ‚เบฐเบžเบฒเบš เปเบฅเบฐ เบชเบตเปˆเบ‡เปเบงเบ”เบฅเป‰เบญเบก   404
    เบ‚เปˆเบฒเบงเบšเบฑเบ™เป€เบ—เบดเบ‡         240
    เบ‚เปˆเบฒเบงเบ—เปˆเบญเบ‡เบ—เปˆเบฝเบง        150
    

    Validation

    เบ‚เปˆเบฒเบงเบ•เปˆเบฒเบ‡เบ›เบฐเป€เบ—เบ”       1163
    เบ‚เปˆเบฒเบงเบžเบฒเบเปƒเบ™         1026
    เบ‚เปˆเบฒเบงเบ—เป‰เบญเบ‡เบ–เบดเปˆเบ™        449
    เบ‚เปˆเบฒเบงเป€เบซเบ”เบเบฒเบ™         157
    เบชเบธเบ‚เบฐเบžเบฒเบš เปเบฅเบฐ เบชเบตเปˆเบ‡เปเบงเบ”เบฅเป‰เบญเบก   137
    เบ‚เปˆเบฒเบงเบšเบฑเบ™เป€เบ—เบดเบ‡         86
    เบ‚เปˆเบฒเบงเบ—เปˆเบญเบ‡เบ—เปˆเบฝเบง         48
    

    Test

    เบ‚เปˆเบฒเบงเบ•เปˆเบฒเบ‡เบ›เบฐเป€เบ—เบ”       1185
    เบ‚เปˆเบฒเบงเบžเบฒเบเปƒเบ™         1059
    เบ‚เปˆเบฒเบงเบ—เป‰เบญเบ‡เบ–เบดเปˆเบ™        431
    เบ‚เปˆเบฒเบงเป€เบซเบ”เบเบฒเบ™         147
    เบชเบธเบ‚เบฐเบžเบฒเบš เปเบฅเบฐ เบชเบตเปˆเบ‡เปเบงเบ”เบฅเป‰เบญเบก   136
    เบ‚เปˆเบฒเบงเบšเบฑเบ™เป€เบ—เบดเบ‡         64
    เบ‚เปˆเบฒเบงเบ—เปˆเบญเบ‡เบ—เปˆเบฝเบง         44
    

    Dataset Creation

    We are collected news and categorys from laopost.com.

    Categorys

    • เบ‚เปˆเบฒเบงเบ•เปˆเบฒเบ‡เบ›เบฐเป€เบ—เบ”: Foreign news
    • เบ‚เปˆเบฒเบงเบžเบฒเบเปƒเบ™: Laos internal news
    • เบ‚เปˆเบฒเบงเบ—เป‰เบญเบ‡เบ–เบดเปˆเบ™: Local news
    • เบ‚เปˆเบฒเบงเป€เบซเบ”เบเบฒเบ™: Event news, such as accidents, crimes, illegal activities
    • เบชเบธเบ‚เบฐเบžเบฒเบš เปเบฅเบฐ เบชเบตเปˆเบ‡เปเบงเบ”เบฅเป‰เบญเบก: Health and environmental news
    • เบ‚เปˆเบฒเบงเบšเบฑเบ™เป€เบ—เบดเบ‡: Entertainment news
    • เบ‚เปˆเบฒเบงเบ—เปˆเบญเบ‡เบ—เปˆเบฝเบง: Travel News

    Other categories are not collect to this dataset because it has few news in the tag, duplicate categories (Example เบญเบธเบšเบฑเบ”เป€เบซเบ”เปเบฅเบฐเบ›เบฒเบเบปเบ”เบเบฒเบ™เบซเบเปเป‰เบ—เปเป‰ and เบ‚เปˆเบฒเบงเป€เบซเบ”เบเบฒเบ™), or the tag are out-of-date update in the website (Example เบกเบนเบกเป„เบญเบ—เบตเบฅเบฒเบง or IT news latest update 22/11/2024 ).

    Licensing Information

    The dataset is released under the Creative Commons Attribution 4.0 International license. The use of this dataset is also subject to CommonCrawl's Terms of Use.

    Citation

    If you use this dataset in your project or research, you can cite as follows:

    BibTeX:

    @dataset{phatthiyaphaibun_2025_14967275,
     author    = {Phatthiyaphaibun, Wannaphong},
     title    = {Lao News classification},
     month    = mar,
     year     = 2025,
     publisher  = {Zenodo},
     version   = {1.0.0},
     doi     = {10.5281/zenodo.14967275},
     url     = {https://doi.org/10.5281/zenodo.14967275},
    }
    

    APA:

    Phatthiyaphaibun, W. (2025). Lao News classification (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14967275

  16. h

    news-category

    • huggingface.co
    Updated Jul 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alireza mp (2024). news-category [Dataset]. https://huggingface.co/datasets/Alirezamp/news-category
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2024
    Authors
    Alireza mp
    Description

    Alirezamp/news-category dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. Digital news categories ranked by willingness to pay in the United Kingdom...

    • statista.com
    Updated Apr 4, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2014). Digital news categories ranked by willingness to pay in the United Kingdom (UK) 2014 [Dataset]. https://www.statista.com/statistics/296589/willingness-to-pay-for-digital-news-content-by-category-uk/
    Explore at:
    Dataset updated
    Apr 4, 2014
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Mar 18, 2014
    Area covered
    United Kingdom
    Description

    This statistic displays what categories of digital news internet users in the United Kingdom would consider paying for as of March 2014. During the survey, 8 percent of UK internet users reported they would consider paying for expert opinion, analysis and commentary content.

  18. Forex News Annotated Dataset for Sentiment Analysis

    • zenodo.org
    • paperswithcode.com
    • +1more
    csv
    Updated Nov 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali (2023). Forex News Annotated Dataset for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.7976208
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 11, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

    To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.

    We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.

    Examples of Annotated Headlines
    
    
        Forex Pair
        Headline
        Sentiment
        Explanation
    
    
    
    
        GBPUSD 
        Diminishing bets for a move to 12400 
        Neutral
        Lack of strong sentiment in either direction
    
    
        GBPUSD 
        No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft 
        Positive
        Positive sentiment towards GBPUSD (Cable) in the near term
    
    
        GBPUSD 
        When are the UK jobs and how could they affect GBPUSD 
        Neutral
        Poses a question and does not express a clear sentiment
    
    
        JPYUSD
        Appropriate to continue monetary easing to achieve 2% inflation target with wage growth 
        Positive
        Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply
    
    
        USDJPY
        Dollar rebounds despite US data. Yen gains amid lower yields 
        Neutral
        Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other
    
    
        USDJPY
        USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains 
        Negative
        USDJPY is expected to reach a lower value, with the USD losing value against the JPY
    
    
        AUDUSD
    
        <p>RBA Governor Loweโ€™s Testimony High inflation is damaging and corrosive </p>
    
        Positive
        Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD.
    

    Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.

  19. News Category Prediction Data

    • kaggle.com
    zip
    Updated Jan 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    murari jha (2019). News Category Prediction Data [Dataset]. https://www.kaggle.com/murarijha/mainset
    Explore at:
    zip(17713743 bytes)Available download formats
    Dataset updated
    Jan 12, 2019
    Authors
    murari jha
    Description

    Dataset

    This dataset was created by murari jha

    Contents

  20. P

    News Interactions on Globo.com Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel de Souza Pereira Moreira; Dietmar Jannach; Adilson Marques da Cunha, News Interactions on Globo.com Dataset [Dataset]. https://paperswithcode.com/dataset/news-interactions-on-globo-com
    Explore at:
    Authors
    Gabriel de Souza Pereira Moreira; Dietmar Jannach; Adilson Marques da Cunha
    Description

    Context This large dataset with users interactions logs (page views) from a news portal was kindly provided by Globo.com, the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at GitHub.

    The first version (v1) (download) of this dataset was released for reproducibility of the experiments presented in the following paper:

    Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. 2018. News Session-Based Recommendations using Deep Neural Networks. In 3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018), October 6, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3270323.3270328

    A second version (v2) (download) of this dataset was made available for reproducibility of the experiments presented in the following paper. Compared to the v1, the only differences are:

    Included four additional user contextual attributes (click_os, click_country, click_region, click_referrer_type) Removed repeated clicks (clicks in the same articles) within sessions. Those sessions with less than two clicks (minimum for the next-click prediction task) were removed

    Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adilson Marques da Cunha. 2019. Contextual Hybrid Session-based News Recommendation with Recurrent Neural Networks. arXiv preprint arXiv:1904.10367, 49 pages

    You are not allowed to use this dataset for commercial purposes, only with academic objectives (like education or research). If used for research, please cite the above papers.

    Content The dataset contains a sample of user interactions (page views) in G1 news portal from Oct. 1 to 16, 2017, including about 3 million clicks, distributed in more than 1 million sessions from 314,000 users who read more than 46,000 different news articles during that period.

    It is composed by three files/folders:

    clicks.zip - Folder with CSV files (one per hour), containing user sessions interactions in the news portal. articles_metadata.csv - CSV file with metadata information about all (364047) published articles articles_embeddings.pickle Pickle (Python 3) of a NumPy matrix containing the Article Content Embeddings (250-dimensional vectors), trained upon articles' text and metadata by the CHAMELEON's ACR module (see paper for details) for 364047 published articles. P.s. The full text of news articles could not be provided due to license restrictions, but those embeddings can be used by Neural Networks to represent their content. See this paper for a t-SNE visualization of these embeddings, colored by category.

    Acknowledgements I would like to acknowledge Globo.com for providing this dataset for this research and for the academic community, in special to Felipe Ferreira for preparing the original dataset by Globo.com.

    Dataset banner photo by rawpixel on Unsplash

    Inspiration This dataset might be very useful if you want to implement and evaluate hybrid and contextual news recommender systems, using both user interactions and articles content and metadata to provide recommendations. You might also use it for analytics, trying to understand how users interactions in a news portal are distributed by user, by article, or by category, for example.

    If you are interested in a dataset of user interactions on articles with the full text provided, to experiment with some different text representations using NLP, you might want to take a look in this smaller dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl; Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Organization logo

CT-FAN-21 corpus: A dataset for Fake News Detection

Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl; Gautam Kishore Shahi; Julia Maria StruรŸ; Thomas Mandl
Description

Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview,
 title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
 author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
 journal={Working Notes of CLEF},
 year={2021}
}

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

  • False - The main claim made in an article is untrue.

  • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

  • True - This rating indicates that the primary elements of the main claim are demonstrably true.

  • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

  • ID- Unique identifier of the news article
  • Title- Title of the news article
  • text- Text mentioned inside the news article
  • our rating - class of the news article as false, partially false, true, other

Task 3b

  • public_id- Unique identifier of the news article
  • Title- Title of the news article
  • text- Text mentioned inside the news article
  • domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

  • public_id- Unique identifier of the news article
  • predicted_rating- predicted class

Sample File

public_id, predicted_rating
1, false
2, true

Task 3b

  • public_id- Unique identifier of the news article
  • predicted_domain- predicted domain

Sample file

public_id, predicted_domain
1, health
2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

IMPORTANT!

  1. Fake news article used for task 3b is a subset of task 3a.
  2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

  • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
  • G. K. Shahi and D. Nandini, โ€œFakeCovid โ€“ a multilingualcross-domain fact check news dataset for covid-19,โ€ inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
  • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
Search
Clear search
Close search
Google apps
Main menu