3 datasets found
  1. Fake News Spreader Classification - CoAID Extended dataset

    • figshare.com
    • explore.openaire.eu
    txt
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Leonardi; Giuseppe Rizzo; Maurizio Morisio (2023). Fake News Spreader Classification - CoAID Extended dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14392859.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    figshare
    Authors
    Simone Leonardi; Giuseppe Rizzo; Maurizio Morisio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a gold standard for the classification of users sharing the misinformation about COVID-19. It presents a list of mapped used id for privacy concerns, the list of real tweet id as retrieved from Twitter and the label classifying the tweet author as spreader or checker. Spreader are users supporting fake news, while checkers are users supporting real news. The list of fake and real news came from the CoAID dataset by Limeng and Dongwon.Data were retrieved from December 1, 2019 to April 1, 2021.For further details look at the paper "Fake News Spreader Automated Classification for Breaking the Misinformation Chain" in the MDPI Information Journal Special Issue "News Research in Social Networks and Social Media", or open an issue in the GitHub repository.

  2. Z

    CoAID dataset with multiple extracted features (both sparse and dense)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Bernard (2022). CoAID dataset with multiple extracted features (both sparse and dense) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6630404
    Explore at:
    Dataset updated
    Jun 10, 2022
    Dataset authored and provided by
    Guillaume Bernard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.

    Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

    In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

    Features are extracted using:

    • A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]

    • A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]

    • A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) 3

    • A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) 4

    References:

    [1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406

    [2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

  3. Z

    CoAID dataset texts with OCR degradations

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Bernard (2022). CoAID dataset texts with OCR degradations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6630709
    Explore at:
    Dataset updated
    Jun 10, 2022
    Dataset authored and provided by
    Guillaume Bernard
    Description

    This is the text of the CoAID dataset dedicated to fake news detection that has been updated to be used in event detection.

    Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

    Guillaume Bernard. (2022). CoAID dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630405

    Some degradations are applied using the DocCreator [1] tool in order to degrade the text of the tweets and to reproduce some common errors found in OCRised documents [2].

    [1]: Journet, Nicholas, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, et Antoine Billy. 2017. « DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images ». Journal of Imaging 3 (4): 62. https://doi.org/10.3390/jimaging3040062.

    [2]: Linhares Pontes, Elvys, Ahmed Hamdi, Nicolas Sidere, et Antoine Doucet. 2019. « Impact of OCR Quality on Named Entity Linking ». In Digital Libraries at the Crossroads of Digital Information for the Future, 11853:102‑15. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-34058-2_11.

    The results of the OCR degradations are as follow:

    CoAID CER/WER
    
    
    
    
        Without
        Character degradation
        Phantom degradation
        Bleed
        Blur
        All
    
    
        CoAID
        CER
        2.105
        6.358
        2.105
        2.122
        2.616
        7.898
    
    
        CoAID
        WER
        2.494
        20.230
        2.496
        2.580
        3.726
        20.230
    
  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Simone Leonardi; Giuseppe Rizzo; Maurizio Morisio (2023). Fake News Spreader Classification - CoAID Extended dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14392859.v1
Organization logo

Fake News Spreader Classification - CoAID Extended dataset

Explore at:
txtAvailable download formats
Dataset updated
Jun 7, 2023
Dataset provided by
figshare
Authors
Simone Leonardi; Giuseppe Rizzo; Maurizio Morisio
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains a gold standard for the classification of users sharing the misinformation about COVID-19. It presents a list of mapped used id for privacy concerns, the list of real tweet id as retrieved from Twitter and the label classifying the tweet author as spreader or checker. Spreader are users supporting fake news, while checkers are users supporting real news. The list of fake and real news came from the CoAID dataset by Limeng and Dongwon.Data were retrieved from December 1, 2019 to April 1, 2021.For further details look at the paper "Fake News Spreader Automated Classification for Breaking the Misinformation Chain" in the MDPI Information Journal Special Issue "News Research in Social Networks and Social Media", or open an issue in the GitHub repository.

Search
Clear search
Close search
Google apps
Main menu