70 datasets found

Twitter Tweets Sentiment Dataset
kaggle.com
opendatabay.com
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

textID - unique ID for each piece of text

text - the text of the tweet

sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

Understand the Dataset & cleanup (if required).

Build classification models to predict the twitter sentiments.

Compare the evaluation metrics of vaious classification algorithms.
c
Sentiment Analysis Dataset
cubig.ai
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Sentiment Analysis Dataset [Dataset]. https://cubig.ai/store/products/270/sentiment-analysis-dataset
Explore at:
Dataset updated
May 20, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Sentiment Analysis Dataset is a dataset for emotional analysis, including large-scale tweet text collected from Twitter and emotional polarity (0=negative, 2=neutral, 4=positive) labels for each tweet, featuring automatic labeling based on emoticons.

2) Data Utilization (1) Sentiment Analysis Dataset has characteristics that: • Each sample consists of six columns: emotional polarity, tweet ID, date of writing, search word, author, and tweet body, and is suitable for training natural language processing and classification models using tweet text and emotion labels. (2) Sentiment Analysis Dataset can be used to: • Emotional Classification Model Development: Using tweet text and emotional polarity labels, we can build positive, negative, and neutral emotional automatic classification models with various machine learning and deep learning models such as logistic regression, SVM, RNN, and LSTM. • Analysis of SNS public opinion and trends: By analyzing the distribution of emotions by time series and keywords, you can explore changes in public opinion on specific issues or brands, positive and negative trends, and key emotional keywords.
Famous Words Twitter Dataset
kaggle.com
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
_w1998 (2023). Famous Words Twitter Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/twitter-dataset-keywords-likes-and-tweets/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
_w1998
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
The Famous Words Twitter Dataset is a comprehensive collection of tweets associated with famous words. The dataset provides valuable insights into the social media engagement and popularity of these words on the Twitter platform. It includes three primary columns: keyword, likes, and tweets.

The keyword column represents the specific famous word or phrase associated with each tweet. It allows researchers and analysts to explore the dynamics of user interactions and discussions surrounding these popular terms on Twitter.

The likes column indicates the number of likes received by each tweet. This metric serves as an indicator of the tweet's popularity and resonation among Twitter users.

The tweet column contains the actual tweet text, capturing the content and context of user-generated messages related to the famous words. This column provides valuable qualitative data for sentiment analysis, topic modeling, and other natural language processing tasks.

Researchers, data scientists, and social media analysts can leverage this dataset to study various aspects, such as tracking trends, sentiment analysis, understanding user engagement patterns, and identifying influential topics associated with famous words on Twitter.

Topics: "COVID-19", "Vaccine", "Zoom", "Bitcoin", "Dogecoin", "NFT", "Elon Musk", "Tesla", "Amazon", "iPhone 12", "Remote work", "TikTok", "Instagram", "Facebook", "YouTube", "Netflix", "GameStop", "Super Bowl", "Olympics", "Black Lives Matter" "India vs England", "Ukraine", "Queen Elizabeth", "World Cup", "Jeffrey Dahmer", "Johnny Depp", "Will Smith", "Weather", "xvideo", "porn", "nba", "Macdonald",

Total has 128837 tweets, and here are the plot for each number of tweets for different keyword

https://i.imgur.com/z4xbbyt.png" alt="">

Note: The dataset is carefully curated, anonymized, and stripped of any personally identifiable information to protect user privacy.
P
Data from: Tweet Sentiment Extraction Dataset
paperswithcode.com
Updated Mar 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Tweet Sentiment Extraction Dataset [Dataset]. https://paperswithcode.com/dataset/tweet-sentiment-extraction
Explore at:
Dataset updated
Mar 23, 2020
Description
"My ridiculous dog is amazing." [sentiment: positive]

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

Help build your skills in this important area with this broad dataset of tweets. Work on your technique to grab a top spot in this competition. What words in tweets support a positive, negative, or neutral sentiment? How can you help make that determination using machine learning tools?

In this competition we've extracted support phrases from Figure Eight's Data for Everyone platform. The dataset is titled Sentiment Analysis: Emotion in Text tweets with existing sentiment labels, used here under creative commons attribution 4.0. international licence. Your objective in this competition is to construct a model that can do the same - look at the labeled sentiment for a given tweet and figure out what word or phrase best supports it.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
o
Twitter Sentiment Classification Data
opendatabay.com
.undefined
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Twitter Sentiment Classification Data [Dataset]. https://www.opendatabay.com/data/ai-ml/89d10076-3c7d-4857-8c75-0b284a9a7f06
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 2, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
This dataset provides a collection of tweets, each categorised by its sentiment. It is designed to assist in developing and evaluating machine learning models, particularly for natural language processing tasks. The primary aim is to distinguish between different sentiments expressed in tweets, helping to address issues like harmful content by enabling the creation of robust classifier models. Each entry includes the tweet text and its corresponding sentiment label, with a specific focus on identifying the exact word or phrase within the tweet that encapsulates that sentiment.

Columns

textID: A unique identifier for each tweet entry.

text: The full content of the tweet.

selected_text: The specific part of the tweet that best represents the given sentiment.

sentiment: The overall sentiment expressed in the tweet, categorised as neutral, positive, or other.

Distribution

The dataset contains approximately 27,500 tweets. It is typically provided in a CSV file format. The textID and text columns each contain 27,481 unique values, while the selected_text column has 22,464 unique values. The sentiment distribution is as follows: 40% are neutral, 31% are positive, and 28% fall into other sentiment categories. When processing the data from the CSV, it is important to remove any beginning or ending quotation marks from the text fields.

Usage

This dataset is ideally suited for tasks involving sentiment analysis and text classification. It can be used to build and train classification models that predict the sentiment of Twitter tweets. Furthermore, it allows for the comparison and evaluation of various classification algorithms based on their performance metrics in predicting sentiments. It is particularly useful for developing strong NLP-based classifier models to identify and categorise tweets by sentiment.

Coverage

The data originates from a global platform, Twitter, and the sentiment analysis is applicable across a wide range of content. The dataset's structure allows for analysis of sentiments in tweets, covering various topics and expressions globally. No specific time range or demographic scope is detailed beyond its global applicability.

License

CCO

Who Can Use It

This dataset is suitable for a diverse range of users, including beginners in data science and machine learning. It is especially beneficial for those interested in social network analysis, text classification, and natural language processing. Intended users include data scientists, researchers, and developers looking to build and test models for predicting social media sentiments or for applications like content moderation.

Dataset Name Suggestions

Twitter Tweet Sentiment Dataset

Tweet Sentiment Analysis Dataset

Social Media Sentiment Prediction Data

Twitter Sentiment Classification Data

Attributes

Original Data Source: Twitter Tweets Sentiment Dataset
t
Sentiment Analysis of Enhanced Twitter Data with Custom Sentiment Scoring
test.dbrepo.tuwien.ac.at
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bouhamidi, Hachem (2025). Sentiment Analysis of Enhanced Twitter Data with Custom Sentiment Scoring [Dataset]. http://doi.org/10.82556/9rzx-7r26
Explore at:
Unique identifier
https://doi.org/10.82556/9rzx-7r26
Dataset updated
Apr 29, 2025
Authors
Bouhamidi, Hachem
Time period covered
2025
Description
This database contains training, validation, and test sets created for a Twitter sentiment classification project. The tweets were cleaned and improved with custom-calculated sentiment scores and magnitudes using a word-weighted dictionary. The data is split to support machine learning experiments.
o
Portuguese Word Sentiment Dataset
opendatabay.com
.undefined
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Portuguese Word Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/b224dfa8-9b42-4aba-a52c-8b7e648c9474
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
This dataset provides a curated list of Portuguese words along with their corresponding sentiment labels. It enables comparative sentiment analysis for content sourced from both Twitter and Buscapé reviews. Each word has a human-annotated sentiment score, ranging from negative to positive with numeric values, allowing for nuanced categorisation and comparison. It serves as an invaluable resource for tasks like mining social media conversations and analysing customer feedback.

Columns

Word: The Portuguese word from the lexicon.

Sentiment_Score: The numerical sentiment label assigned to the word. Labels include -1 for negative, 0 for neutral, and +1 for positive sentiments.

Distribution

The dataset is provided as a CSV file, specifically named portuguese_lexicon.csv. It comprises a total of 114 unique words in its lexicon, each with an associated sentiment score. The dataset is derived from 3,457 tweets and 476 Buscapé reviews. Users will need an environment capable of reading CSV files that contain both text and numerical data to utilise this resource effectively.

Usage

This dataset is ideal for various applications in natural language processing (NLP) and sentiment analysis, including: * Applying to machine learning models for sentiment analysis, text classification, and automated opinion summarisation. * Comparing words or phrases within texts or across different datasets to understand expressed opinions. * Identifying trends in customer opinions over time by comparing sentiment from Twitter and Buscapé reviews. * Understanding how customer review sentiment compares across different Portuguese languages and dialects. * Utilising customer feedback for analytics purposes and gaining insights into public opinion on products based on textual expressions.

Coverage

The dataset's scope covers reviews written in Portuguese from both Twitter and Buscapé, originating from Portuguese-speaking areas. It is considered to have global region relevance. No specific time range or demographic scope beyond "Portuguese-speaking areas" is detailed in the sources.

License

CCO

Who Can Use It

This dataset is suitable for: * Data scientists and machine learning engineers working on NLP tasks. * Researchers interested in social media analysis and cross-platform sentiment comparison. * Businesses and analysts aiming to mine social media conversations and analyse customer feedback for decision-making. * Anyone requiring a linguistically labeled database for Portuguese text analysis.

Dataset Name Suggestions

Portuguese Sentiment Lexicon

Portuguese Social Media Sentiment Corpus

Portuguese Word Sentiment Dataset

Twitter Buscapé Portuguese Sentiment Data

Attributes

Original Data Source: Portuguese Sentiment Corpus for Twitter and
o
Global Covid-19 Tweets with Sentiment Analysis
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Global Covid-19 Tweets with Sentiment Analysis [Dataset]. https://www.opendatabay.com/data/healthcare/f445ec28-4fdd-4832-8d8e-da282f16c84b
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
Area covered
Data Science and Analytics
Description
This dataset captures Twitter activity related to Covid-19, focusing on the initial phase of the pandemic from April to June 2020 [1, 2]. It comprises 235,240 worldwide tweets in English, streamed live at a rate of approximately 10,000 tweets per day after the World Health Organisation declared Covid-19 a pandemic [1, 2]. The tweets were collected using relevant hashtags such as #covid-19, #coronavirus, #covid, #covaccine, #lockdown, #homequarantine, #quarantinecenter, #socialdistancing, #stayhome, and #staysafe [1, 2].

The data has undergone pre-processing, which involved converting all tweets to lowercase, removing extra white spaces, numbers, special characters, ASCII characters, URLs, punctuations, and stopwords [2]. Additionally, all instances of 'covid' were converted to 'covid19', and stemming was applied to reduce inflected words to their root forms [2]. Sentiment analysis has been performed on each cleaned tweet using an NLTK-based Sentiment Analyser, providing sentiment scores for positive, negative, and neutral categories, and a compound sentiment score [2]. Tweets are classified as Positive, Negative, or Neutral based on these scores [2].

Columns

id: Unique identifier for the tweet [1].

Tweet ID: Unique identifier for the tweet [2]. (Note: Appears to be the same as 'id')

created_at: The date and time when the tweet was created [1].

Creation Date & Time: The date and time when the tweet was created [2]. (Note: Appears to be the same as 'created_at')

source: The source link from which the tweet was posted [1].

Source Link: The source link from which the tweet was posted [2]. (Note: Appears to be the same as 'source')

original_text: The full text of the original tweet [1].

Original Tweet: The full text of the original tweet [2]. (Note: Appears to be the same as 'original_text')

lang: The language of the tweet [1].

favorite_count: The number of times the tweet was favourited [1].

Favorite Count: The number of times the tweet was favourited [2]. (Note: Appears to be the same as 'favorite_count')

retweet_count: The number of times the tweet was retweeted [1].

Retweet Count: The number of times the tweet was retweeted [2]. (Note: Appears to be the same as 'retweet_count')

original_author: The original author of the tweet [3].

Original Author: The original author of the tweet [2]. (Note: Appears to be the same as 'original_author')

hashtags: Hashtags included in the tweet [3].

Hashtags: Hashtags included in the tweet [2]. (Note: Appears to be the same as 'hashtags')

user_mentions: User mentions within the tweet [3].

User Mentions: User mentions within the tweet [2]. (Note: Appears to be the same as 'user_mentions')

Place: Location associated with the tweet [2].

Distribution

The dataset consists of 235,240 tweets from the first phase of collection [1, 2]. Data files are typically provided in CSV format [4]. The tweets were collected from 19th April to 20th June 2020 [1].

Usage

This dataset is ideal for various data science and analytics applications, including Natural Language Processing (NLP), Deep Learning, Text Classification, and Ensembling [2]. Its pre-processed nature and included sentiment scores make it particularly useful for sentiment analysis research related to public opinion during the Covid-19 pandemic [2].

Coverage

The dataset covers a time range from 19th April to 20th June 2020 [1]. It includes worldwide tweets [2] and is limited to English language content [2]. Tweet sources are primarily Twitter for Android (31%) and Twitter for iPhone (28%), with 41% originating from other sources [5].

License

CC-BY-SA

Who Can Use It

Data Scientists and Analysts: For conducting social media analysis, trend identification, and public sentiment tracking during the pandemic [2].

Researchers in NLP and Machine Learning: To train and evaluate text classification models, conduct deep learning experiments, and explore ensembling techniques [2].

Public Health Researchers: To understand public response, concerns, and sentiment towards Covid-19, lockdowns, and vaccines [2].

Academics and Students: For academic projects, dissertations, and learning about real-world social media data analysis and sentiment classification [2].

Dataset Name Suggestions

COVID-19 Twitter Sentiment (Apr-Jun 2020)

Pandemic Twitter Activity Dataset (Phase 1)

Global Covid-19 Tweets with Sentiment Analysis

Social Media Response to Covid-19: April-June 2020

Twitter Covid-19 Discourse (Early Pandemic)

Attributes

Original Data Source: Covid-19 Twitter Dataset
Z
Yahoo-Yahoo Hash-Tag Tweets Using Sentiment Analysis and Opinion Mining...
data.niaid.nih.gov
Updated Aug 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abayomi-Alli Adebayo (2022). Yahoo-Yahoo Hash-Tag Tweets Using Sentiment Analysis and Opinion Mining Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4748716
Explore at:
Dataset updated
Aug 18, 2022
Dataset provided by
Abayomi-Alli Olusola
Misra Sanjay
Fernandez-Sanz Luis
Abayomi-Alli Adebayo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background

Social media opinion has become a medium to quickly access large, valuable, and rich details of information on any subject matter within a short period. Twitter being a social microblog site, generate over 330 million tweets monthly across different countries. Analysing trending topics on Twitter presents opportunities to extract meaningful insight into different opinions on various issues.

Aim

This study aims to gain insights into the trending yahoo-yahoo topic on Twitter using content analysis of selected historical tweets.

Methodology

The widgets and workflow engine in the Orange Data mining toolbox were employed for all the text mining tasks. 5500 tweets were collected from Twitter using the “yahoo yahoo” hashtag. The corpus was pre-processed using a pre-trained tweet tokenizer, Valence Aware Dictionary for Sentiment Reasoning (VADER) was used for the sentiment and opinion mining, Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI) was used for topic modelling. In contrast, Multidimensional scaling (MDS) was used to visualize the modelled topics.

Results

Results showed that "yahoo" appeared in the corpus 9555 times, 175 unique tweets were returned after duplicate removal. Contrary to expectation, Spain had the highest number of participants tweeting on the 'yahoo yahoo' topic within the period. The result of Vader sentiment analysis returned 35.85%, 24.53%, 15.09%, and 24.53%, negative, neutral, no-zone, and positive sentiment tweets, respectively. The word yahoo was highly representative of the LDA topics 1, 3, 4, 6, and LSI topic 1.

Conclusion

It can be concluded that emojis are even more representative of the sentiments in tweets faster than the textual contents. Also, despite popular belief, a significant number of youths regard cybercrime as a detriment to society.
Word counts per US county in geo-tagged Tweets posted between 2015 and 2021
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Louf (2023). Word counts per US county in geo-tagged Tweets posted between 2015 and 2021 [Dataset]. http://doi.org/10.6084/m9.figshare.20630919.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20630919.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Thomas Louf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
The zip file contains fourteen Parquet [1] files of two kinds, for each of the seven years between 2015 and 2021 included: - region_counts: for every word found, gives how many times it appeared, regardless of capitalization ("count" column), how many times it appeared with at least one capitalized letter ("count_upper"), in how many different counties it appeared ("nr_cells"), and whether we considered it to be a proper noun ("is_proper") - raw_cell_counts: gives the count for every word by county, regardless of capitalization.

These counts were obtained from geo-tagged Tweets posted those years within the contiguous US, which were collected through the through the streaming API of Twitter, and more specifically using the “statuses/filter” end-point [2]. See the project's paper for more details on methodology, and the code repository to reproduce the analysis.

The two text files are our lists of excluded word forms.
f
International Differences in Covid-19 Vaccination
figshare.com
xlsx
Updated Mar 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Thelwall (2021). International Differences in Covid-19 Vaccination [Dataset]. http://doi.org/10.6084/m9.figshare.14308490.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14308490.v1
Dataset updated
Mar 25, 2021
Dataset provided by
figshare
Authors
Mike Thelwall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spreadsheet of data from the paper, "Can Twitter Give Insights into International Differences in Covid-19 Vaccination? Eight countries’ English tweets to 21 March 2021"Includes the gender analysis and the main analysis. A colour code key is included on the right of the worksheet for the main analysis.
f
Data from: RIFT: A Rule Induction Framework for Twitter Sentiment Analysis
figshare.com
html
Updated Aug 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zubair Asghar; Furqan Khan; Aurangzeb Khan; Fazal Masud Kundi (2017). RIFT: A Rule Induction Framework for Twitter Sentiment Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.5327065.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5327065.v1
Dataset updated
Aug 19, 2017
Dataset provided by
figshare
Authors
Zubair Asghar; Furqan Khan; Aurangzeb Khan; Fazal Masud Kundi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The rapid evolution of microblogging and the emergence of sites such as Twitter have propelled online communities to flourish by enabling people to create, share and disseminate free-flowing messages and information globally. The exponential growth of product-based user reviews has become an ever-increasing resource playing a key role in emerging Twitter-based sentiment analysis (SA) techniques and applications to collect and analyse customer trends and reviews. Existing studies on supervised black-box sentiment analysis systems do not provide adequate information, regarding rules as to why a certain review was classified to a class or classification. The accuracy in some ways is less than our personal judgement. To address these shortcomings, alternative approaches, such as supervised white-box classification algorithms, need to be developed to improve the classification of Twitter-based microblogs. The purpose of this study was to develop a supervised white-box microblogging SA system to analyse user reviews on certain products using rough set theory (RST)-based rule induction algorithms. RST classifies microblogging reviews of products into positive, negative, or neutral class using different rules extracted from training decision tables using RST-centric rule induction algorithms. The primary focus of this study is also to perform sentiment classification of microblogs (i.e. also known as tweets) of product reviews using conventional, and RST-based rule induction algorithms. The proposed RST-centric rule induction algorithm, namely Learning from Examples Module version: 2, and LEM2 +" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">++ Corpus-based rules (LEM2 +" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">++ CBR),which is an extension of the traditional LEM2 algorithm, are used. Corpus-based rules are generated from tweets, which are unclassified using other conventional LEM2 algorithm rules. Experimental results show the proposed method, when compared with baseline methods, is excellent, with regard to accuracy, coverage and the number of rules employed. The approach using this method achieves an average accuracy of 92.57% and an average coverage of 100%, with an average number of rules of 19.14.
o
English Tweet Hate Speech Classifier Data
opendatabay.com
.undefined
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). English Tweet Hate Speech Classifier Data [Dataset]. https://www.opendatabay.com/data/ai-ml/32413cb6-d9db-4c1a-a3b2-23ce6e55bce2
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset, named hate_speech_offensive, is a carefully assembled collection of annotated tweets designed for the purpose of detecting hate speech and offensive language. It consists primarily of English tweets and serves as a vital resource for training machine learning models and algorithms in this domain. Researchers and developers can utilise this dataset to build effective systems for identifying and classifying hateful or offensive content, contributing to safer online environments. The dataset is presented in a CSV file format, specifically 'train.csv', and includes detailed annotations for each tweet.

Columns

count: The total number of annotations provided for each individual tweet. (Integer)

hate_speech_count: The number of annotations that classified a particular tweet as hate speech. (Integer)

offensive_language_count: The number of annotations that categorised a tweet as containing offensive language. (Integer)

neither_count: The number of annotations that identified a tweet as neither hate speech nor offensive language. (Integer)

class: The classification label for the tweet.

tweet: The actual tweet content.

Distribution

The dataset is provided in a CSV file format, specifically 'train.csv'. It is structured with each row representing an individual tweet along with its corresponding annotations. The dataset currently comprises a single training split. There are approximately 24,783 unique tweets within the dataset.

Usage

This dataset is ideal for various applications and use cases, including: * Training machine learning models or algorithms for automated hate speech and offensive language detection. * Conducting Sentiment Analysis on Twitter data to understand the sentiment behind tweets and identify patterns of negative or offensive language. * Developing and evaluating Hate Speech Detection systems that can identify and flag hate speech in real-time. * Improving Content Moderation systems for social media platforms by automatically detecting and removing offensive or hateful content. * Performing Exploratory Data Analysis (EDA) to gain insights into the distribution of tweet classifications, identify common words associated with each class, and analyse co-occurrences of hate speech and offensive language.

Coverage

The dataset primarily consists of English tweets. Its scope is global in potential application, aiming to address social issues and advocacy related to online discourse. While no specific time range for data collection is provided, the dataset focuses on general English tweet content.

License

CCO

Who Can Use It

This dataset is intended for: * Researchers and developers seeking to create and improve machine learning models for detecting hate speech and offensive language on social media platforms like Twitter. * Data scientists and analysts interested in understanding patterns of online discourse and sentiment. * Social media platforms and their moderation teams aiming to enhance automated content moderation systems.

Dataset Name Suggestions

Twitter Hate Speech and Offensive Language Dataset

Annotated Tweet Toxicity Data

Social Media Content Moderation Tweets

English Tweet Hate Speech Classifier Data

Online Language Offensiveness Dataset

Attributes

Original Data Source: Hate Speech and Offensive Language Detection
E
Annotated tweet corpus in Arabizi, French and English
catalog.elra.info
Updated Apr 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2022). Annotated tweet corpus in Arabizi, French and English [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-W0323/
Explore at:
Dataset updated
Apr 5, 2022
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
French
Description
The annotated tweet corpus in Arabizi, French and English was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale des Entreprises, France) through the RAPID programme (2017-2020). This project aimed at studying the mechanisms of information and opinion propagation within social networks: identifying influential leaders, detecting channels for disseminating information and opinion. The purpose of the corpus constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism).For the collection, a tool has been developed in Python (based on the “GetOldTweets3” library) which used information such as the language (EN/FR) and a keyword list as input. With this tool, a maximum of 10,000 tweets per (keyword, language) pair were collected for English and French. For Arabizi, a specific process was setup, consisting in creating a vocabulary list in Arabizi from a corpus of Arabizi SMS (for Moroccan and Tunisian) and Training and test data for Arabizi detection and transliteration (available from ELRA under reference ELRA-W0126, ISLRN ID: 986-364-744-303-9) by selecting the 1000 most frequent words, and downloading the tweets containing each word from this vocabulary and keyword list (places = Morocco, Tunisia, Algeria). The tweets that were kept had to contain at least 5 words in Arabizi.For the annotation, a tool running on Django has been developed in order to provide the following annotations for each tweet in a given sequence:•Theme: with 5 possible annotations (Hooliganism, Racism, Terrorism, Others, Incomprehensible)•Topic: the annotator can add a new topic if it does not exist in the proposed list•Opinion: 3 possible annotations (Negative, Neutral, Positive)In total, 17,103 sequences were annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided. The data are provided in CSV format.Remark: this corpus includes only tweet IDs and corresponding annotations. Original tweets may be obtained by using the Twitter API.
o
Gender Prediction from Tweet Typo Data
opendatabay.com
.undefined
Updated Jul 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Gender Prediction from Tweet Typo Data [Dataset]. https://www.opendatabay.com/data/ai-ml/05c9578a-719d-4ab0-82cd-0aa99bfa2bbe
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
This dataset provides simple Twitter analytics data, focusing on user profiles and tweet content. Its primary purpose is to enable the classification of gender based on tweet characteristics, specifically exploring the likelihood of different genders committing typos on their tweets. It serves as a valuable resource for emerging Natural Language Processing (NLP) enthusiasts looking to apply basic models to real-world social media data. The dataset includes unformatted tweet text, user information, and confidence scores related to various attributes.

Columns

The dataset contains the following key columns: * _unit_id: A unique identifier for the unit. * Tweet ID: The unique identifier for a tweet. * _golden: Indicates whether a user is a Golden User. * _unit_state: The state of the tweet. * _trusted_judgments: The level of trust associated with the judgment. * _last_judgment_at: The timestamp of the last judgment. * gender: The declared or inferred sex of the user. * gender:confidence: The confidence level associated with the gender classification. * profile_yn: A boolean indicating whether the user's profile is active or exists. * profile_yn:confidence: The confidence level for the profile's existence. * created: The date and time when the user's account was created. * Label Count: A count related to various labels within the dataset.

Distribution

The dataset is provided as a single data file, typically in CSV format. It comprises approximately 20,000 records. The structure includes various data types, such as IDs, boolean indicators, numerical confidence scores, and datetime stamps.

Usage

This dataset is ideal for: * Classifying user gender based on tweet content and user profile information. * Analysing spelling errors or typos in tweets in relation to user demographics. * Developing and testing Natural Language Processing (NLP) models, particularly for tasks like text classification and sentiment analysis. * Exploring patterns in social media behaviour and user characteristics on Twitter. * Educational purposes for those new to applying machine learning techniques to real-world tweet data.

Coverage

The dataset offers global geographical coverage as indicated by its region. The time range for tweet activity appears to be concentrated around 26th to 27th October 2015. However, the account creation dates for the users span a much broader period, from 5th August 2006 to 26th October 2015. In terms of demographics, the dataset includes gender distribution, with approximately 33% female, 31% male, and 36% categorised as 'Other'.

License

CCO

Who Can Use It

This dataset is primarily intended for: * Data scientists and analysts interested in social media analytics and user behaviour. * Machine learning practitioners, especially those working on classification problems and NLP tasks. * Students and researchers in fields such as computer science, linguistics, and social sciences. * NLP enthusiasts who are developing or looking to test basic linear or naive models on real-world text data.

Dataset Name Suggestions

Twitter User Profile & Activity Data

Gender Prediction from Tweet Typo Data

Social Media Analytics: Twitter User Gender

Tweet Classification for Gender Studies

Attributes

Original Data Source: Twitter Data
E
Portuguese Comparative Sentences: A Collection of Labeled Sentences on...
live.european-language-grid.eu
data.niaid.nih.gov
+1more
json
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Portuguese Comparative Sentences: A Collection of Labeled Sentences on Twitter and Buscapé [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7664
Explore at:
jsonAvailable download formats
Dataset updated
Dec 10, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
More and more customers demand online reviews of products and comments on the Web to make decisions about buying a product over another. In this context, sentiment analysis techniques constitute the traditional way to summarize a user’s opinions that criticizes or highlights the positive aspects of a product. Sentiment analysis of reviews usually relies on extracting positive and negative aspects of products, neglecting comparative opinions. Such opinions do not directly express a positive or negative view but contrast aspects of products from different competitors. Here, we present the first effort to study comparative opinions in Portuguese, creating two new Portuguese datasets with comparative sentences marked by three humans. This repository consists of three important files: (1) lexicon that contains words frequently used to make a comparison in Portuguese; (2) Twitter dataset with labeled comparative sentences; and (3) Buscapé dataset with labeled comparative sentences.The lexicon is a set of 176 words frequently used to express a comparative opinion in the Portuguese language. In these contexts, the lexicon is aggregated in a filter and used to build two sets of data with comparative sentences from two important contexts: (1) Social Network Online; and (2) Product reviews.For Twitter, we collected all Portuguese tweets published in Brazil on 2018/01/10 and filtered all tweets that contained at least one keyword present in the lexicon, obtaining 130,459 tweets. Our work is based on the sentence level. Thus, all sentences were extracted and a sample with 2,053 sentences was created, which was labeled for three human manuals, reaching an 83.2% agreement with Fleiss' Kappa coefficient. For Buscapé, a Brazilian website (https://www.buscape.com.br/) used to compare product prices on the web, the same methodology was conducted by creating a set of 2,754 labeled sentences, obtained from comments made in 2013. This dataset was labeled by three humans, reaching an agreement of 83.46% with the Fleiss Kappa coefficient.The Twitter dataset has 2,053 labeled sentences, of which 918 are comparative. The Buscapé dataset has 2,754 labeled sentences, of which 1,282 are comparative.The datasets contain some properties labeled:text: the sentence extracted from the review comment.entity_s1: the first entity compared in the sentence.entity_s2: the second entity compared in the sentence.keyword: the comparative keyword used in the sentence to express comparison.preferred_entity: the preferred entity.id_start: the starting position of the keyword in the phrase.id_end: the keyword's final position in the sentence.type: the sentence label, which specifies whether the phrase is a comparison.Additional information:1 - The sentences were separated using a sentence tokenizer.2 - If the compared entity is not specified, the field will receive a value: "_".3 - The property type can contain different five values, they are:0: Non-comparative (Não Comparativa).1: Non-Equal-Gradable (Gradativa com Predileção).2: Equative (Equitativa).3: Superlative (Superlativa).4: Non-Equal-Gradable (Não Gradativa).If you use this data, please cite our paper as follows: "Daniel Kansaon, Michele A. Brandão, Julio C. S. Reis, Matheus Barbosa,Breno Matos, and Fabrício Benevenuto. 2020. Mining Portuguese Comparative Sentences in Online Reviews. In Brazilian Symposium on Multimedia and the Web (WebMedia ’20), November 30-December 4, 2020, São Luís, Brazil. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3428658.3431081"
The Tweets of Wisdom
kaggle.com
Updated Oct 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hsankesara (2019). The Tweets of Wisdom [Dataset]. https://www.kaggle.com/hsankesara/the-tweets-of-wisdom/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 12, 2019
Dataset provided by
Kaggle
Authors
Hsankesara
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

In the last few years, Twitter became one of the most popular social media platforms. From celebrity status to government policies, Twitter can accommodate a diverse range of people and thoughts. In these diverse set of thoughts, there are many Twitter accounts who tweet "self-help" thoughts often. These so-called "wise" thoughts are often related to improving one's life and how to excel at what you're doing. So I went down to the rabbit-hole to search these sorts of tweets. I find many common themes between them. Therefore, I decided to scrap the tweets so that you can explore the words of these "self-help" tweets and understand them much better.

Content

I scraped the data using Tweepy API. I have scraped all the tweets, retweets and retweets with a comment of 40 authors. The data contains more than 40 authors because every retweet from any of the 40 authors is stored as a tweet from the original author. Also, every retweet with a comment contains and tags. The author's comment is followed by tag and then the content of the retweet comes which is followed by . The script I used for scrapping can be found here.

Acknowledgements

I would like to thanks Stack Overflow which helped me at literally every stage of this project from scrapping to data analysis. Also kudos to the Tweepy API which made it far more easier to fetch tweets.

Inspiration

I downloaded this dataset for many reasons. The most important one is that I want to know how similar these tweets are. Also, I like to know what makes some tweets viral and what factors affect a viral tweet. I explore these and many more questions in my kernel which you can find in the kernel section.

Contact Me
Greta Thunberg's Twitter data mining & Frame Analysis
figshare.com
txt
Updated Jul 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sílvia Díaz Pérez (2022). Greta Thunberg's Twitter data mining & Frame Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.20311524.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20311524.v6
Dataset updated
Jul 22, 2022
Dataset provided by
figshare
Authors
Sílvia Díaz Pérez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bigrams, words frequency, frame and engagement analysis of the tweets published by Greta Thunberg between 2019 and 2022.
m
Processed Data (Excerpts of co-occurrence analysis in WordStat) for The...
data.mendeley.com
Updated Aug 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Stevens (2019). Processed Data (Excerpts of co-occurrence analysis in WordStat) for The Emergence and Evolution of Master Terms in the Public Debate about Livestock Farming: Semantic Fields, Communication Strategies and Policy Practices [Dataset]. http://doi.org/10.17632/229cdbbfmf.1
Explore at:
Unique identifier
https://doi.org/10.17632/229cdbbfmf.1
Dataset updated
Aug 7, 2019
Authors
Tim Stevens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Excerpts of the co-occurrence analysis in WordStat (including 2D co-word maps, correspondence maps and heat maps) of two cases: booster-broiler (plofkip) and mega-stable (megastal). Please see the article for more information about the data collection and method.
B
Tweets on the COVID-19 in Alberta
borealisdata.ca
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geoffrey Rockwell; Bennett Kuwan Tchoh; Robert Budac (2023). Tweets on the COVID-19 in Alberta [Dataset]. http://doi.org/10.5683/SP3/HDCFNF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/HDCFNF
Dataset updated
Dec 13, 2023
Dataset provided by
Borealis
Authors
Geoffrey Rockwell; Bennett Kuwan Tchoh; Robert Budac
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Alberta
Description
This dataset contains word frequency (raw frequency, relative frequency, TF-IDF) and sentiment analysis of tweets on the COVID-19 pandemic in Alberta. It also contains groups of tweets that have been made non-consumptive for copyright reasons by shuffling the words of the tweets Our goal was to capture a representative sample of the discourse on the COVID-19 pandemic in Alberta pandemic taking place on Twitter. More details on the dataset can be found in the readme document

Facebook

Twitter

Click to copy link

Link copied

Cite

M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset

Twitter Tweets Sentiment Dataset

Twitter Tweets Sentiment Analysis for Natural Language Processing

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 8, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

M Yasser H

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

textID - unique ID for each piece of text
text - the text of the tweet
sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

Understand the Dataset & cleanup (if required).
Build classification models to predict the twitter sentiments.
Compare the evaluation metrics of vaious classification algorithms.

Clear search

Close search

Google apps

Main menu

Twitter Tweets Sentiment Dataset

Description:

Columns:

Acknowledgement:

Objective:

Sentiment Analysis Dataset

Famous Words Twitter Dataset

Data from: Tweet Sentiment Extraction Dataset

Twitter Sentiment Classification Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Sentiment Analysis of Enhanced Twitter Data with Custom Sentiment Scoring

Portuguese Word Sentiment Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Global Covid-19 Tweets with Sentiment Analysis

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Yahoo-Yahoo Hash-Tag Tweets Using Sentiment Analysis and Opinion Mining...

Word counts per US county in geo-tagged Tweets posted between 2015 and 2021

International Differences in Covid-19 Vaccination

Data from: RIFT: A Rule Induction Framework for Twitter Sentiment Analysis

English Tweet Hate Speech Classifier Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Annotated tweet corpus in Arabizi, French and English

Gender Prediction from Tweet Typo Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Portuguese Comparative Sentences: A Collection of Labeled Sentences on...

The Tweets of Wisdom

Context

Content

Acknowledgements

Inspiration

Contact Me

Greta Thunberg's Twitter data mining & Frame Analysis

Processed Data (Excerpts of co-occurrence analysis in WordStat) for The...

Tweets on the COVID-19 in Alberta

Twitter Tweets Sentiment Dataset

Twitter Tweets Sentiment Analysis for Natural Language Processing

Description:

Columns:

Acknowledgement:

Objective: