12 datasets found
  1. Twitter Tweets Sentiment Dataset

    • kaggle.com
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  2. COVID-19 rumor dataset

    • figshare.com
    html
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cheng (2023). COVID-19 rumor dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14456385.v2
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    cheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }

  3. Gajderowicz, B., Fisher, A., Mago, V.: (preperation) "Graph pruning for...

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bart Gajderowicz; Bart Gajderowicz (2024). Gajderowicz, B., Fisher, A., Mago, V.: (preperation) "Graph pruning for identifying COVID-19 misinformation dissemination patterns and indicators on Twitter/X" [Dataset]. http://doi.org/10.5281/zenodo.11100129
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bart Gajderowicz; Bart Gajderowicz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 2, 2024
    Description

    This dataset is for the repository https://github.com/bgajdero/social-graph-analysis-2024.

  4. o

    Using social media and personality traits to assess software developers'...

    • explore.openaire.eu
    Updated Jan 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leo Silva; Marília Gurgel Castro; Miriam Bernardino Silva; Milena Nestor Santos; Uirá Kulesza; Margarida Lima; Henrique Madeira (2022). Using social media and personality traits to assess software developers' emotions [Dataset]. http://doi.org/10.5281/zenodo.7425721
    Explore at:
    Dataset updated
    Jan 1, 2022
    Authors
    Leo Silva; Marília Gurgel Castro; Miriam Bernardino Silva; Milena Nestor Santos; Uirá Kulesza; Margarida Lima; Henrique Madeira
    Description

    Companion DATA Title: Using social media and personality traits to assess software developers' emotions Authors: Leo Moreira Silva Marília Gurgel Castro Miriam Bernardino Silva Milena Nestor Santos Uirá Kulesza Margarida Lima Henrique Madeira Journal: PeerJ Computer Science Github: https://github.com/leosilva/peerj_computer_science_2022 ------------------------------------------------------------ The folders contain: Experiment_Protocol.pdf: document that present the protocol regarding recruitment protocol, data collection of public posts from Twitter, criteria for manual analysis, and the assessment of Big Five factors from participants and psychologists. English version. /analysis analyzed_tweets_by_psychologists.csv: file containing the manual analysis done by psychologists analyzed_tweets_by_participants.csv: file containing the manual analysis done by participants analyzed_tweets_by_psychologists_solved_divergencies.csv: file containing the manual analysis done by psychologists over 51 divergent tweets' classifications /dataset alldata.json: contains the dataset used in the paper /ethics_committee committee_response_english_version.pdf: contains the acceptance response of Research Ethics and Deontology Committee of the Faculty of Psychology and Educational Sciences of the University of Coimbra. English version. committee_response_original_portuguese_version: contains the acceptance response of Research Ethics and Deontology Committee of the Faculty of Psychology and Educational Sciences of the University of Coimbra. Portuguese version. committee_submission_form_english_version.pdf: the project submitted to the committee. English version. committee_submission_form_original_portuguese_version.pdf: the project submitted to the committee. Portuguese version. consent_form_english_version.pdf: declaration of free and informed consent fulfilled by participants. English version. consent_form_original_portuguese_version.pdf: declaration of free and informed consent fulfilled by participants. Portuguese version. data_protection_declaration_english_version.pdf: personal data and privacy declaration, according to European Union General Data Protection Regulation. English version. data_protection_declaration_original_portuguese_version.pdf: personal data and privacy declaration, according to European Union General Data Protection Regulation. Portuguese version. /notebooks General - Charts.ipynb: notebook file containing all charts produced in the study, including those in the paper Statistics - Lexicons and Ensembles.ipynb: notebook file with the statistics for the five lexicons and ensembles used in the study Statistics - Linear Regression.ipynb: notebook file with the multiple linear regression results Statistics - Polynomial Regression.ipynb: notebook file with the polynomial regression results Statistics - Psychologists versus Participants.ipynb: notebook file with the statistics between the psychologists and participants manual analysis Statistics - Working x Non-working.ipynb: notebook file containing the statistical analysis for the tweets posted during work period and those posted outside of working period /surveys Demographic_Survey_english_version.pdf: survey inviting participants to enroll in the study. We collect demographic data and participants' authorization to access their public Tweet posts. English version. Demographic_Survey_portuguese_version.pdf: survey inviting participants to enroll in the study. We collect demographic data and participants' authorization to access their public Tweet posts. Portuguese version. Demographic_Survey_answers.xlsx: participants' demographic survey answers ibf_pt_br.doc: the Portuguese version of the Big Five Inventory (BFI) instrument to infer participants' Big Five polarity traits. ibf_en.doc: translation in English of the Portuguese version of the Big Five Inventory (BFI) instrument to infer participants' Big Five polarity traits. ibf_answers.xlsx: participantes' and psychologists' answers for BFI ------------------------------------------------------------ We have removed from dataset any sensible data to protect participants' privacy and anonymity. We have removed from demographic survey answers any sensible data to protect participants' privacy and anonymity.

  5. Twitterstorm data: the Katie Hinde Target t-shirt saga 2017-06-11

    • figshare.com
    application/gzip
    Updated Apr 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Randi Griffin (2018). Twitterstorm data: the Katie Hinde Target t-shirt saga 2017-06-11 [Dataset]. http://doi.org/10.6084/m9.figshare.6096986.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 6, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Randi Griffin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains data from the Twitterstorm that occurred from June 11-13 2017 following a controversial tweet by scientist and public figure Katie Hinde (@Mammals_Suck on Twitter). This is a good dataset for practicing the analysis of text and/or social network data in R. 1. tweets_raw.Rds contains raw text data for all 33261 tweets mentioning @Mammals_Suck in the 36 hours following the original tweet. This file can be read into R using the 'readRDS' function.2. tweets_clean.csv contains clean data on all 4843 quote & reply tweets responding to the original tweet over a 36 hour period, including: tweet text, user name, tweet time, # of favorites, # of retweets, # of friends of the user, # of followers of the user, user self-description, user location, type (quote or reply).3. social_network.rds contains a social network for users in the twitterstorm an an 'igraph' object. Vertex names correspond to Twitter users, and edge weights are based on co-followers (i.e., the number of mutually followed accounts, which is a proxy for overlap in the interests of two users). Additional vertex attributes can be added to the graph using information about users from the 'tweets_clean.csv' file, such as the time they entered the twitterstorm, their geographic location, or the text content of their tweets. This file can be read into R using the 'readRDS' function. For more information, check out the blog posts written by myself and Katie Hinde. Mine focuses on data analysis, while hers focuses on her experience and understanding of the events. My blog post: https://rgriff23.github.io/2017/06/29/Katie-Hinde-Twitterstorm.html Katie Hinde's blog post: https://mammalssuck.blogspot.co.uk/2017/06/portrait-of-unexpected-twitter-storm.html The R code I used to compile and analyze this data can be found in this GitHub repository: https://github.com/rgriff23/Katie_Hinde_Twitter_storm_text_analysisNote that the data in the GitHub repo does not match the data included in this figshare repo exactly. This is because the data provided here has been reduced to information collected from Twitter: I eliminated data columns that were produced using subsequent analysis, such as tweet classifications based on sentiment analysis or social network analysis.

  6. MigrationsKB: A Knowledge Base of Migration related annotated Tweets

    • zenodo.org
    zip
    Updated May 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiyi Chen; Harald Sack; Mehwish Alam; Yiyi Chen; Harald Sack; Mehwish Alam (2022). MigrationsKB: A Knowledge Base of Migration related annotated Tweets [Dataset]. http://doi.org/10.5281/zenodo.5206820
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yiyi Chen; Harald Sack; Mehwish Alam; Yiyi Chen; Harald Sack; Mehwish Alam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MigrationsKB(MGKB) is a public Knowledge Base of anonymized Migration related annotated tweets. The MGKB currently contains over 200 thousand tweets, spanning over 9 years (January 2013 to July 2021), filtered with 11 European countries of the United Kingdom, Germany, Spain, Poland, France, Sweden, Austria, Hungary, Switzerland, Netherlands and Italy. Metadata information about the tweets, such as Geo information (place name, coordinates, country code). MGKB contains entities, sentiments, hate speeches, topics, hashtags, encrypted user mentions in RDF format. The schema of MGKB is an extension of TweetsKB for migrations related information. Moreover, to associate and represent the potential economic and social factors driving the migration flows such as eurostat, statista, etc. FIBO ontology was used. The extracted economic indicators, such as GDP Growth Rate, are connected with each Tweet in RDF using geographical and temporal dimensions. The user IDs and the tweet texts are encrypted for privacy purposes, while the tweet IDs are preserved.

    For this version, the MGKB is delivered as a whole and separately by year. The extracted entities and topic words are also published.

    Preprint paper: https://arxiv.org/abs/2108.07593

    Online SPARQL endpoint https://mgkb.fiz-karlsruhe.de/sparql/

    More information please refer to the website https://migrationskb.github.io/MGKB/.

    Please contact Yiyi Chen (yiyi.chen@partner.kit.edu) for pretrained models (sentiment analysis/hate speech detection/ETM) if necessary.

    Citation:

    @misc{chen2021migrationskb,
    title={MigrationsKB: A Knowledge Base of Public Attitudes towards Migrations and their Driving Factors},
    author={Yiyi Chen and Harald Sack and Mehwish Alam},
    year={2021},
    eprint={2108.07593},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
    }

  7. Tweets about the Top Companies from 2015 to 2020

    • kaggle.com
    zip
    Updated Nov 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ömer Metin (2020). Tweets about the Top Companies from 2015 to 2020 [Dataset]. https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020
    Explore at:
    zip(291228288 bytes)Available download formats
    Dataset updated
    Nov 26, 2020
    Authors
    Ömer Metin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Tweets about the Top Companies from 2015 to 2020

    This dataset as a part of the paper published in the 2020 IEEE International Conference on Big Data under the 6th Special Session on Intelligent Data Mining track, is created to determine possible speculators and influencers in a stock market. Although we used both tweet data and companies' market data in our project, we thought that it is a better choice to split our datasets into two parts while sharing in Kaggle. This dataset is helpful for those interested in tweets that are written about Amazon, Apple, Google, Microsoft, and Tesla by using their appropriate share tickers.

    Note: For those interested in the process of evaluating speculators and influencers in a stock market, the dataset in the following link may be helpful. https://www.kaggle.com/omermetinn/values-of-top-nasdaq-copanies-from-2010-to-2020

    Content

    This dataset contains over 3 million unique tweets with their information such as tweet id, author of the tweet, post date, the text body of the tweet, and the number of comments, likes, and retweets of tweets matched with the related company.

    Acknowledgements

    Tweets are collected from Twitter by a parsing script that is based on Selenium. Note 1: For those interested in the script, please visit the following link. https://github.com/omer-metin/TweetCollector

    Note 2: For those interested in our paper used this dataset, please visit the following link. https://ieeexplore.ieee.org/document/9378170

    Inspiration

    Some of the interesting questions (tasks) which can be performed on this dataset -

    1) Determining the correlation between the market value of company respect to the public opinion of that company. 2) Sentiment Analysis of the companies with a time series in a graph and reasoning the possible declines and rises. 3) Evaluating troll users who try to occupy the social agenda.

  8. Twitter-analysis-of-popular-South-African-Banks

    • kaggle.com
    Updated Feb 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nicholasblomerus (2023). Twitter-analysis-of-popular-South-African-Banks [Dataset]. https://www.kaggle.com/datasets/nicholasblomerus/twitter-analysis-of-popular-south-african-banks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2023
    Dataset provided by
    Kaggle
    Authors
    nicholasblomerus
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    South Africa
    Description

    Context

    The dataset for this project was collected using the sntwitter module. A total of 5,591,765 tweets were scraped using the following search dictionary:

    bank_dict = {'fnb': ['fnb', 'FNBSA', 'fnbSouthAfrica'], 'absa': ['absa', 'absaSA', 'ABSASouthAfrica'], 'nedbank': ['nedbank', 'NEDBANKSA', 'nedbankSouthAfrica'], 'capitec': ['capitec', 'CapitecBank', 'capitecSA', 'capitecbankSA'], 'standard_bank': ['standard bank','standardbank', 'StandardbankSA', 'standardbankZA', 'standardbankSouthAfrica']}

    Content

    The tweets were collected between 1 January 2006 to 1 January 2023. Each tweet contains the following information:

    • Tweet ID
    • Tweet
    • Date and time of the tweet
    • Username
    • Number of retweets
    • Number of likes
    • Number of Replies

    Data Processing

    Code for tweet cleaning, sentiment, offensive, hate-speech and topic detection can be found here.

    Inspiration

    I am pursuing a career as a data scientist/ ML engineer. This project spawned from the work of Andrew Schleiss. Here is the notebook that this dataset is derived. I would like to thank Andrew for his insightful and well-documented notebook that many of us have learnt from.

  9. [Tweets] 2022 Brazilian Presidential Elections

    • zenodo.org
    zip
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Raniére Juvino Santos; Lucas Raniére Juvino Santos; Leandro Balby Marinho; Leandro Balby Marinho; Claudio Campelo; Claudio Campelo (2025). [Tweets] 2022 Brazilian Presidential Elections [Dataset]. http://doi.org/10.5281/zenodo.14834669
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lucas Raniére Juvino Santos; Lucas Raniére Juvino Santos; Leandro Balby Marinho; Leandro Balby Marinho; Claudio Campelo; Claudio Campelo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 1, 2022
    Area covered
    Brazil
    Description

    2022 Brazilian Presidential Election

    This dataset contains 7,015,186 tweets from 951,602 users, extracted using 91 search terms over 36 days between August 1st and December 31st, 2022.

    All tweets in this dataset are in Brazilian Portuguese.

    Data Usage

    The dataset contains textual data from tweets, making it suitable for various NLP analyses, such as sentiment analysis, bias or stance detection, and toxic language detection. Additionally, users and tweets can be linked to create social graphs, enabling Social Network Analysis (SNA) to study polarization, communities, and other social dynamics.

    Extraction Method

    This data set was extracted using Twitter's (now X) official API—when Academic Research API access was still available—following the pipeline:

    1. Twitter/X daily monitoring: The dataset author monitored daily political events appearing in Brazil's Trending Topics. Twitter/X has an automated system for classifying trending terms. When a term was identified as political, it was stored along with its date for later use as a search query.

    2. Tweet collection using saved search terms: Once terms and their corresponding dates were recorded, tweets were extracted from 12:00 AM to 11:59 PM on the day the term entered the Trending Topics. A language filter was applied to select only tweets in Portuguese. The extraction was performed using the official Twitter/X API.

    3. Data storage: The extracted data was organized by day and search term. If the same search term appeared in Trending Topics on consecutive days, a separate file was stored for each respective day.

    Further Information

    For more details, visit:

    - The repository
    - Dataset short paper:

    ---

    DOI: 10.5281/zenodo.14834669
  10. o

    Data for "Mapping dynamic human sentiments of heat exposure with...

    • explore.openaire.eu
    • databank.illinois.edu
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fangzheng Lyu; Lixuanwu Zhou; Jinwoo Park; Furqan Baig; Shaowen Wang (2024). Data for "Mapping dynamic human sentiments of heat exposure with location-based social media data" [Dataset]. http://doi.org/10.13012/b2idb-9405860_v1
    Explore at:
    Dataset updated
    Jan 1, 2024
    Authors
    Fangzheng Lyu; Lixuanwu Zhou; Jinwoo Park; Furqan Baig; Shaowen Wang
    Description

    This dataset contains all the datasets used in the study conducted for the research publication titled "Mapping dynamic human sentiments of heat exposure with location-based social media data". This paper develops a cyberGIS framework to analyze and visualize human sentiments of heat exposure dynamically based on near real-time location-based social media (LBSM) data. Large volumes and low-cost LBSM data, together with a content analysis algorithm based on natural language processing are used effectively to generate heat exposure maps from human sentiments on social media. ## What’s inside - A quick explanation of the components of the zip file * US folder includes the shapefile corresponding to the United State with County as spatial unit * Census_tract folder includes the shapefile corresponding to the Cook County with census tract as spatial unit * data/data.txt includes instruction to retrieve the sample data either from Keeling or figshare * geo/data20000.txt is the heat dictionary created in this paper, please refer to the corresponding publication to see the data creation process Jupyter notebook and code attached to this publication can be found at: https://github.com/cybergis/real_time_heat_exposure_with_LBSMD

  11. CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

    • zenodo.org
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Puneet Kumar; Puneet Kumar; Sarthak Malik; Sarthak Malik; Balasubramanian Raman; Balasubramanian Raman; Xiaobai Li; Xiaobai Li (2025). CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis [Dataset]. http://doi.org/10.5281/zenodo.11409612
    Explore at:
    Dataset updated
    May 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Puneet Kumar; Puneet Kumar; Sarthak Malik; Sarthak Malik; Balasubramanian Raman; Balasubramanian Raman; Xiaobai Li; Xiaobai Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 1, 2024
    Description

    Overview
    The Controllable Multimodal Feedback Synthesis (CMFeed) Dataset is designed to enable the generation of sentiment-controlled feedback from multimodal inputs, including text and images. This dataset can be used to train feedback synthesis models in both uncontrolled and sentiment-controlled manners. Serving a crucial role in advancing research, the CMFeed dataset supports the development of human-like feedback synthesis, a novel task defined by the dataset's authors. Additionally, the corresponding feedback synthesis models and benchmark results are presented in the associated code and research publication.

    Task Uniqueness: The task of controllable multimodal feedback synthesis is unique, distinct from LLMs and tasks like VisDial, and not addressed by multi-modal LLMs. LLMs often exhibit errors and hallucinations, as evidenced by their auto-regressive and black-box nature, which can obscure the influence of different modalities on the generated responses [Ref1; Ref2]. Our approach includes an interpretability mechanism, as detailed in the supplementary material of the corresponding research publication, demonstrating how metadata and multimodal features shape responses and learn sentiments. This controllability and interpretability aim to inspire new methodologies in related fields.

    Data Collection and Annotation
    Data was collected by crawling Facebook posts from major news outlets, adhering to ethical and legal standards. The comments were annotated using four sentiment analysis models: FLAIR, SentimentR, RoBERTa, and DistilBERT. Facebook was chosen for dataset construction because of the following factors:
    • Facebook was chosen for data collection because it uniquely provides metadata such as news article link, post shares, post reaction, comment like, comment rank, comment reaction rank, and relevance scores, not available on other platforms.
    • Facebook is the most used social media platform, with 3.07 billion monthly users, compared to 550 million Twitter and 500 million Reddit users. [Ref]
    • Facebook is popular across all age groups (18-29, 30-49, 50-64, 65+), with at least 58% usage, compared to 6% for Twitter and 3% for Reddit. [Ref]. Trends are similar for gender, race, ethnicity, income, education, community, and political affiliation [Ref]
    • The male-to-female user ratio on Facebook is 56.3% to 43.7%; on Twitter, it's 66.72% to 23.28%; Reddit does not report this data. [Ref]

    Filtering Process: To ensure high-quality and reliable data, the dataset underwent two levels of filtering:
    a) Model Agreement Filtering: Retained only comments where at least three out of the four models agreed on the sentiment.
    b) Probability Range Safety Margin: Comments with a sentiment probability between 0.49 and 0.51, indicating low confidence in sentiment classification, were excluded.
    After filtering, 4,512 samples were marked as XX. Though these samples have been released for the reader's understanding, they were not used in training the feedback synthesis model proposed in the corresponding research paper.

    Dataset Description
    • Total Samples: 61,734
    • Total Samples Annotated: 57,222 after filtering.
    • Total Posts: 3,646
    • Average Likes per Post: 65.1
    • Average Likes per Comment: 10.5
    • Average Length of News Text: 655 words
    • Average Number of Images per Post: 3.7

    Components of the Dataset
    The dataset comprises two main components:
    CMFeed.csv File: Contains metadata, comment, and reaction details related to each post.
    Images Folder: Contains folders with images corresponding to each post.

    Data Format and Fields of the CSV File
    The dataset is structured in CMFeed.csv file along with corresponding images in related folders. This CSV file includes the following fields:
    Id: Unique identifier
    Post: The heading of the news article.
    News_text: The text of the news article.
    News_link: URL link to the original news article.
    News_Images: A path to the folder containing images related to the post.
    Post_shares: Number of times the post has been shared.
    Post_reaction: A JSON object capturing reactions (like, love, etc.) to the post and their counts.
    Comment: Text of the user comment.
    Comment_like: Number of likes on the comment.
    Comment_reaction_rank: A JSON object detailing the type and count of reactions the comment received.
    Comment_link: URL link to the original comment on Facebook.
    Comment_rank: Rank of the comment based on engagement and relevance.
    Score: Sentiment score computed based on the consensus of sentiment analysis models.
    Agreement: Indicates the consensus level among the sentiment models, ranging from -4 (all negative) to 4 (all positive). 3 negative and 1 positive will result into -2 and 3 positives and 1 negative will result into +2.
    Sentiment_class: Categorizes the sentiment of the comment into 1 (positive) or 0 (negative).

    More Considerations During Dataset Construction
    We thoroughly considered issues such as the choice of social media platform for data collection, bias and generalizability of the data, selection of news handles/websites, ethical protocols, privacy and potential misuse before beginning data collection. While achieving completely unbiased and fair data is unattainable, we endeavored to minimize biases and ensure as much generalizability as possible. Building on these considerations, we made the following decisions about data sources and handling to ensure the integrity and utility of the dataset:

    • Why not merge data from different social media platforms?
    We chose not to merge data from platforms such as Reddit and Twitter with Facebook due to the lack of comprehensive metadata, clear ethical guidelines, and control mechanisms—such as who can comment and whether users' anonymity is maintained—on these platforms other than Facebook. These factors are critical for our analysis. Our focus on Facebook alone was crucial to ensure consistency in data quality and format.

    • Choice of four news handles: We selected four news handles—BBC News, Sky News, Fox News, and NY Daily News—to ensure diversity and comprehensive regional coverage. These news outlets were chosen for their distinct regional focuses and editorial perspectives: BBC News is known for its global coverage with a centrist view, Sky News offers geographically targeted and politically varied content learning center/right in the UK/EU/US, Fox News is recognized for its right-leaning content in the US, and NY Daily News provides left-leaning coverage in New York. Many other news handles such as NDTV, The Hindu, Xinhua, and SCMP are also large-scale but may contain information in regional languages such as Indian and Chinese, hence, they have not been selected. This selection ensures a broad spectrum of political discourse and audience engagement.

    • Dataset Generalizability and Bias: With 3.07 billion of the total 5 billion social media users, the extensive user base of Facebook, reflective of broader social media engagement patterns, ensures that the insights gained are applicable across various platforms, reducing bias and strengthening the generalizability of our findings. Additionally, the geographic and political diversity of these news sources, ranging from local (NY Daily News) to international (BBC News), and spanning political spectra from left (NY Daily News) to right (Fox News), ensures a balanced representation of global and political viewpoints in our dataset. This approach not only mitigates regional and ideological biases but also enriches the dataset with a wide array of perspectives, further solidifying the robustness and applicability of our research.

    • Dataset size and diversity: Facebook prohibits the automatic scraping of its users' personal data. In compliance with this policy, we manually scraped publicly available data. This labor-intensive process requiring around 800 hours of manual effort, limited our data volume but allowed for precise selection. We followed ethical protocols for scraping Facebook data , selecting 1000 posts from each of the four news handles to enhance diversity and reduce bias. Initially, 4000 posts were collected; after preprocessing (detailed in Section 3.1), 3646 posts remained. We then processed all associated comments, resulting in a total of 61734 comments. This manual method ensures adherence to Facebook’s policies and the integrity of our dataset.

    Ethical considerations, data privacy and misuse prevention
    The data collection adheres to Facebook’s ethical guidelines [<a href="https://developers.facebook.com/terms/"

  12. [Tweets] 2023 Brazilian Early Political Events

    • zenodo.org
    zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Raniére Juvino Santos; Lucas Raniére Juvino Santos; Leandro Balby Marinho; Leandro Balby Marinho; Claudio Campelo; Claudio Campelo (2025). [Tweets] 2023 Brazilian Early Political Events [Dataset]. http://doi.org/10.5281/zenodo.14834704
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lucas Raniére Juvino Santos; Lucas Raniére Juvino Santos; Leandro Balby Marinho; Leandro Balby Marinho; Claudio Campelo; Claudio Campelo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2023
    Area covered
    Brazil
    Description

    2023 Brazilian Early Political Events

    This dataset contains 13,910,048 tweets from 1,346,340 users, extracted using 157 search terms over 56 different days between January 1st and June 21st, 2023.

    All tweets in this dataset are in Brazilian Portuguese.

    Data Usage

    The dataset contains textual data from tweets, making it suitable for various NLP analyses, such as sentiment analysis, bias or stance detection, and toxic language detection. Additionally, users and tweets can be linked to create social graphs, enabling Social Network Analysis (SNA) to study polarization, communities, and other social dynamics.

    Extraction Method

    This data set was extracted using Twitter's (now X) official API—when Academic Research API access was still available—following the pipeline:

    1. Twitter/X daily monitoring: The dataset author monitored daily political events appearing in Brazil's Trending Topics. Twitter/X has an automated system for classifying trending terms. When a term was identified as political, it was stored along with its date for later use as a search query.

    2. Tweet collection using saved search terms: Once terms and their corresponding dates were recorded, tweets were extracted from 12:00 AM to 11:59 PM on the day the term entered the Trending Topics. A language filter was applied to select only tweets in Portuguese. The extraction was performed using the official Twitter/X API.

    3. Data storage: The extracted data was organized by day and search term. If the same search term appeared in Trending Topics on consecutive days, a separate file was stored for each respective day.

    Further Information

    For more details, visit:

    - The repository
    - Dataset short paper:

    ---

    DOI: 10.5281/zenodo.14834704
  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Organization logo

Twitter Tweets Sentiment Dataset

Twitter Tweets Sentiment Analysis for Natural Language Processing

Explore at:
37 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

  1. textID - unique ID for each piece of text
  2. text - the text of the tweet
  3. sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

  • Understand the Dataset & cleanup (if required).
  • Build classification models to predict the twitter sentiments.
  • Compare the evaluation metrics of vaious classification algorithms.
Search
Clear search
Close search
Google apps
Main menu