100+ datasets found

Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Twitter Tweets Sentiment Dataset
kaggle.com
opendatabay.com
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

textID - unique ID for each piece of text

text - the text of the tweet

sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

Understand the Dataset & cleanup (if required).

Build classification models to predict the twitter sentiments.

Compare the evaluation metrics of vaious classification algorithms.
m
Twitter Sentiments Dataset
data.mendeley.com
Updated May 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SHERIF HUSSEIN (2021). Twitter Sentiments Dataset [Dataset]. http://doi.org/10.17632/z9zw7nt5h2.1
Explore at:
Unique identifier
https://doi.org/10.17632/z9zw7nt5h2.1
Dataset updated
May 14, 2021
Authors
SHERIF HUSSEIN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset has three sentiments namely, negative, neutral, and positive. It contains two fields for the tweet and label.
Z
Brussel mobility Twitter sentiment analysis CSV Dataset
data.niaid.nih.gov
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
van Vessem, Charlotte (2024). Brussel mobility Twitter sentiment analysis CSV Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11401123
Explore at:
Dataset updated
May 31, 2024
Dataset provided by
Tori, Floriano
van Vessem, Charlotte
Betancur Arenas, Juliana
Ginis, Vincent
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brussels
Description
SSH CENTRE (Social Sciences and Humanities for Climate, Energy aNd Transport Research Excellence) is a Horizon Europe project, engaging directly with stakeholders across research, policy, and business (including citizens) to strengthen social innovation, SSH-STEM collaboration, transdisciplinary policy advice, inclusive engagement, and SSH communities across Europe, accelerating the EU’s transition to carbon neutrality. SSH CENTRE is based in a range of activities related to Open Science, inclusivity and diversity – especially with regards Southern and Eastern Europe and different career stages – including: development of novel SSH-STEM collaborations to facilitate the delivery of the EU Green Deal; SSH knowledge brokerage to support regions in transition; and the effective design of strategies for citizen engagement in EU R&I activities. Outputs include action-led agendas and building stakeholder synergies through regular Policy Insight events.This is captured in a high-profile virtual SSH CENTRE generating and sharing best practice for SSH policy advice, overcoming fragmentation to accelerate the EU’s journey to a sustainable future.The documents uploaded here are part of WP2 whereby novel, interdisciplinary teams were provided funding to undertake activities to develop a policy recommendation related to EU Green Deal policy. Each of these policy recommendations, and the activities that inform them, will be written-up as a chapter in an edited book collection. Three books will make up this edited collection - one on climate, one on energy and one on mobility. As part of writing a chapter for the SSH CENTRE book on ‘Mobility’, we set out to analyse the sentiment of users on Twitter regarding shared and active mobility modes in Brussels. This involved us collecting tweets between 2017-2022. A tweet was collected if it contained a previously defined mobility keyword (for example: metro) and either the name of a (local) politician, a neighbourhood or municipality, or a (shared) mobility provider. The files attached to this Zenodo webpage is a csv files containing the tweets collected.”.
Sentiment Analysis on Financial Tweets
kaggle.com
zip
Updated Sep 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivek Rathi (2019). Sentiment Analysis on Financial Tweets [Dataset]. https://www.kaggle.com/datasets/vivekrathi055/sentiment-analysis-on-financial-tweets
Explore at:
zip(2538259 bytes)Available download formats
Dataset updated
Sep 5, 2019
Authors
Vivek Rathi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The following information can also be found at https://www.kaggle.com/davidwallach/financial-tweets. Out of curosity, I just cleaned the .csv files to perform a sentiment analysis. So both the .csv files in this dataset are created by me.

Anything you read in the description is written by David Wallach and using all this information, I happen to perform my first ever sentiment analysis.

"I have been interested in using public sentiment and journalism to gather sentiment profiles on publicly traded companies. I first developed a Python package (https://github.com/dwallach1/Stocker) that scrapes the web for articles written about companies, and then noticed the abundance of overlap with Twitter. I then developed a NodeJS project that I have been running on my RaspberryPi to monitor Twitter for all tweets coming from those mentioned in the content section. If one of them tweeted about a company in the stocks_cleaned.csv file, then it would write the tweet to the database. Currently, the file is only from earlier today, but after about a month or two, I plan to update the tweets.csv file (hopefully closer to 50,000 entries.

I am not quite sure how this dataset will be relevant, but I hope to use these tweets and try to generate some sense of public sentiment score."

Content

This dataset has all the publicly traded companies (tickers and company names) that were used as input to fill the tweets.csv. The influencers whose tweets were monitored were: ['MarketWatch', 'business', 'YahooFinance', 'TechCrunch', 'WSJ', 'Forbes', 'FT', 'TheEconomist', 'nytimes', 'Reuters', 'GerberKawasaki', 'jimcramer', 'TheStreet', 'TheStalwart', 'TruthGundlach', 'Carl_C_Icahn', 'ReformedBroker', 'benbernanke', 'bespokeinvest', 'BespokeCrypto', 'stlouisfed', 'federalreserve', 'GoldmanSachs', 'ianbremmer', 'MorganStanley', 'AswathDamodaran', 'mcuban', 'muddywatersre', 'StockTwits', 'SeanaNSmith'

Acknowledgements

The data used here is gathered from a project I developed : https://github.com/dwallach1/StockerBot

Inspiration

I hope to develop a financial sentiment text classifier that would be able to track Twitter's (and the entire public's) feelings about any publicly traded company (and cryptocurrency)
o
Twitter Public Sentiment Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Twitter Public Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/04ea3224-1b10-48d4-871a-496c9a2633ff
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Telecommunications & Network Data
Description
This dataset provides a collection of 1000 tweets designed for sentiment analysis. The tweets were sourced from Twitter using Python and systematically generated using various modules to ensure a balanced representation of different tweet types, user behaviours, and sentiments. This includes the use of a random module for IDs and text, a faker module for usernames and dates, and a textblob module for assigning sentiment. The dataset's purpose is to offer a robust foundation for analysing and visualising sentiment trends and patterns, aiding in the initial exploration of data and the identification of significant patterns or trends.

Columns

Tweet ID: A unique identifier assigned to each individual tweet.

Text: The actual textual content of the tweet.

User: The username of the individual who posted the tweet.

Created At: The date and time when the tweet was originally published.

Likes: The total number of likes or approvals the tweet received.

Retweets: The total count of times the tweet was shared by other users.

Sentiment: The categorised emotional tone of the tweet, typically labelled as positive, neutral, or negative.

Distribution

The dataset is provided in a CSV file format. It consists of 1000 individual tweet records, structured in a tabular layout with the columns detailed above. A sample file will be made available separately on the platform.

Usage

This dataset is ideal for: * Analysing and visualising sentiment trends and patterns in social media. * Initial data exploration to uncover insights into tweet characteristics and user emotions. * Identifying underlying patterns or trends within social media conversations. * Developing and training machine learning models for sentiment classification. * Academic research into Natural Language Processing (NLP) and social media dynamics. * Educational purposes, allowing students to practise data analysis and visualisation techniques.

Coverage

The dataset spans tweets created between January and April 2023, as observed from the included data samples. While specific geographic or demographic information for users is not available within the dataset, the nature of Twitter implies a general global scope, reflecting a variety of user behaviours and sentiments without specific regional or population group focus.

License

CC0

Who Can Use It

This dataset is valuable for: * Data Scientists and Machine Learning Engineers working on NLP tasks and model development. * Researchers in fields such as Natural Language Processing, Machine Learning Algorithms, Deep Learning, and Computer Science. * Data Analysts looking to extract insights from social media content. * Academics and Students undertaking projects related to sentiment analysis or social media studies. * Anyone interested in understanding online sentiment and user behaviour on social media platforms.

Dataset Name Suggestions

Twitter Public Sentiment Dataset

Social Media Text Sentiment Analysis

General Tweet Mood Data

Twitter Sentiment Collection 2023

Microblog Sentiment Dataset

Attributes

Original Data Source: Twitter Sentiment Analysis using Roberta and VaderTwitter Sentiment Analysis using Roberta and Vader
h
multiclass-sentiment-analysis-dataset
huggingface.co
Updated Jul 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahriar Parvez (2023). multiclass-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2023
Authors
Shahriar Parvez
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Dataset Name

Dataset Summary

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation… See the full description on the dataset page: https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset.
f
coarse-grained sentiment analysis.csv
figshare.com
txt
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chen duan; Zhengwei Huang (2023). coarse-grained sentiment analysis.csv [Dataset]. http://doi.org/10.6084/m9.figshare.21508251.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21508251.v1
Dataset updated
Nov 7, 2023
Dataset provided by
figshare
Authors
chen duan; Zhengwei Huang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study selects the customer service conversation dataset from Jing Dong (JD)[1] to evaluate our proposed model. In the dataset, the number 0 represents the statement issued by the customer, the number 1 represents the statement issued by customer service and chatbots, and the "session-id" represents multi-turn sessions between the same customer service agents and customers. This paper selected about 10,000 dialogue texts for training, validation, and testing, including 5,068 sentences issued by customers and 4,942 sentences issued by customer service and chatbots.

[1] https://www.jd.com
Z
AWARE: Dataset for Aspect-Based Sentiment Analysis of Apps Reviews
data.niaid.nih.gov
explore.openaire.eu
Updated Jan 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamoud Aljamaan (2022). AWARE: Dataset for Aspect-Based Sentiment Analysis of Apps Reviews [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5528480
Explore at:
Dataset updated
Jan 25, 2022
Dataset provided by
Malak Baslyman
Hamoud Aljamaan
Nouf Alturaief
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.

Aspect-Based Sentiment Analysis (ABSA) aims to identify the opinion (sentiment) with respect to a specific aspect. Since there is a lack of smartphone apps reviews dataset that is annotated to support the ABSA task, we present AWARE: ABSA Warehouse of Apps REviews.

AWARE contains apps reviews from three different domains (Productivity, Social Networking, and Games), as each domain has its distinct functionalities and audience. Each sentence is annotated with three labels, as follows:

Aspect Term: a term that exists in the sentence and describes an aspect of the app that is expressed by the sentiment. A term value of “N/A” means that the term is not explicitly mentioned in the sentence.

Aspect Category: one of the pre-defined set of domain-specific categories that represent an aspect of the app (e.g., security, usability, etc.).

Sentiment: positive or negative.

Note: games domain does not contain aspect terms.

We provide a comprehensive dataset of 11323 sentences from the three domains, where each sentence is additionally annotated with a Boolean value indicating whether the sentence expresses a positive/negative opinion. In addition, we provide three separate datasets, one for each domain, containing only sentences that express opinions. The file named “AWARE_metadata.csv” contains a description of the dataset’s columns.

How AWARE can be used?

We designed AWARE such that it can be used to serve various tasks. The tasks can be, but are not limited to:

Sentiment Analysis.

Aspect Term Extraction.

Aspect Category Classification.

Aspect Sentiment Analysis.

Explicit/Implicit Aspect Term Classification.

Opinion/Not-Opinion Classification.

Furthermore, researchers can experiment with and investigate the effects of different domains on users' feedback.
Processed twitter sentiment Dataset | Added Tokens
kaggle.com
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Halemo GPA (2024). Processed twitter sentiment Dataset | Added Tokens [Dataset]. http://doi.org/10.34740/kaggle/ds/5568348
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5568348
Dataset updated
Aug 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Halemo GPA
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:

1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures

Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
o
Twitter Sentiment Classification Data
opendatabay.com
.undefined
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Twitter Sentiment Classification Data [Dataset]. https://www.opendatabay.com/data/ai-ml/89d10076-3c7d-4857-8c75-0b284a9a7f06
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 2, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
This dataset provides a collection of tweets, each categorised by its sentiment. It is designed to assist in developing and evaluating machine learning models, particularly for natural language processing tasks. The primary aim is to distinguish between different sentiments expressed in tweets, helping to address issues like harmful content by enabling the creation of robust classifier models. Each entry includes the tweet text and its corresponding sentiment label, with a specific focus on identifying the exact word or phrase within the tweet that encapsulates that sentiment.

Columns

textID: A unique identifier for each tweet entry.

text: The full content of the tweet.

selected_text: The specific part of the tweet that best represents the given sentiment.

sentiment: The overall sentiment expressed in the tweet, categorised as neutral, positive, or other.

Distribution

The dataset contains approximately 27,500 tweets. It is typically provided in a CSV file format. The textID and text columns each contain 27,481 unique values, while the selected_text column has 22,464 unique values. The sentiment distribution is as follows: 40% are neutral, 31% are positive, and 28% fall into other sentiment categories. When processing the data from the CSV, it is important to remove any beginning or ending quotation marks from the text fields.

Usage

This dataset is ideally suited for tasks involving sentiment analysis and text classification. It can be used to build and train classification models that predict the sentiment of Twitter tweets. Furthermore, it allows for the comparison and evaluation of various classification algorithms based on their performance metrics in predicting sentiments. It is particularly useful for developing strong NLP-based classifier models to identify and categorise tweets by sentiment.

Coverage

The data originates from a global platform, Twitter, and the sentiment analysis is applicable across a wide range of content. The dataset's structure allows for analysis of sentiments in tweets, covering various topics and expressions globally. No specific time range or demographic scope is detailed beyond its global applicability.

License

CCO

Who Can Use It

This dataset is suitable for a diverse range of users, including beginners in data science and machine learning. It is especially beneficial for those interested in social network analysis, text classification, and natural language processing. Intended users include data scientists, researchers, and developers looking to build and test models for predicting social media sentiments or for applications like content moderation.

Dataset Name Suggestions

Twitter Tweet Sentiment Dataset

Tweet Sentiment Analysis Dataset

Social Media Sentiment Prediction Data

Twitter Sentiment Classification Data

Attributes

Original Data Source: Twitter Tweets Sentiment Dataset
i
Twitter Sentiment Analysis Data
ieee-dataport.org
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rabindra Lamsal (2024). Twitter Sentiment Analysis Data [Dataset]. https://ieee-dataport.org/documents/twitter-sentiment-analysis-data
Explore at:
Dataset updated
Aug 6, 2024
Authors
Rabindra Lamsal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
because of COVID-19
i
Mobile review dataset for aspect level sentiment analysis
ieee-dataport.org
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Soni (2024). Mobile review dataset for aspect level sentiment analysis [Dataset]. https://ieee-dataport.org/documents/mobile-review-dataset-aspect-level-sentiment-analysis
Explore at:
Dataset updated
Sep 17, 2024
Authors
Piyush Soni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
camera
E
Sentiment analysis of tech media articles using VADER package and...
live.european-language-grid.eu
csv
Updated Aug 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Sentiment analysis of tech media articles using VADER package and co-occurrence analysis [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1351
Explore at:
csvAvailable download formats
Dataset updated
Aug 16, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sentiment analysis of tech media articles using VADER package and co-occurrence analysis
Sources: Above 140k articles (01.2016-03.2019):
Gigaom 0.5%
Euractiv 0.9%
The Conversation 1.3%
Politico Europe 1.3%
IEEE Spectrum 1.8%
Techforge 4.3%
Fastcompany 4.5%
The Guardian (Tech) 9.2%
Arstechnica 10.0%
Reuters 11%
Gizmodo 17.5%
ZDNet 18.3%
The Register 19.5%
Methodology
The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.
As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.
The process included the following main steps:
The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)
The articles containing the given social issue and co-occurring term are identified
The identified articles are divided into paragraphs
Social issue and co-occurring words are removed from the paragraph
The VADER sentiment analysis is carried out for every identified and modified paragraph
The average for the given word pair is calculated for the final result
Therefore, the procedure has been repeated for 100 words for all identified social issues.
The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.
The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:
The articles containing the given social issue are identified
The paragraphs containing the social issue are selected for sentiment analysis
*Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Files
sentiments_mod11.csv sentiment score based on chosen unigrams
sentiments_mod22.csv sentiment score based on chosen bigrams
sentiments_cooc_mod11.csv, sentiments_cooc_mod12.csv, sentiments_cooc_mod21.csv, sentiments_cooc_mod22.csv combinations of co-occurrences: unigrams-unigrams, unigrams-bigrams, bigrams-unigrams, bigrams-bigrams
o
Text Classification Dataset
opendatabay.com
.undefined
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay (2025). Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1775ad0d-be0d-49c9-bbc1-f94a8a5c8355
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 6, 2025
Dataset authored and provided by
Opendatabay
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.

Dataset Features

1. text: Contains individual English-language comments or posts sourced from various online platforms.

2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:

0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment

Distribution

Format: CSV (Comma-Separated Values)

2 Columns: text: The comment content label: Sentiment classification (0 = Negative, 1 = Neutral, 2 = Positive)

File Size: Approximately 23.9 MB

Structure: Each row contains a single comment and its corresponding sentiment label.

Usage

This dataset is ideal for a variety of applications:

1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.

2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.

3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.

Coverage

Geographic Coverage: Primarily English-language content from global online platforms

Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.

Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.

License

CC0

Who Can Use It

Data Scientists: For training machine learning models.

Researchers: For academic or scientific studies.

Businesses: For analysis, insights, or AI development.
o
NLP Preprocessed Sentiment Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). NLP Preprocessed Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/6323a1b5-7112-49bd-ad55-c1ef6968abc3
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset is a substantial collection of over 241,000 English-language comments, gathered from various online platforms. Each comment within the dataset has been carefully annotated with a sentiment label: 0 for negative sentiment, 1 for neutral, and 2 for positive. The primary aim of this dataset is to facilitate the training and evaluation of multi-class sentiment analysis models, designed to work effectively with real-world text data. The dataset has undergone a preprocessing stage, ensuring comments are in lowercase, and are cleaned of punctuation, URLs, numbers, and stopwords, making it readily usable for Natural Language Processing (NLP) pipelines.

Columns

Comment: This column contains the user-generated text content.

Sentiment: This column provides the corresponding sentiment label for each comment, where 0 denotes Negative, 1 denotes Neutral, and 2 denotes Positive.

Distribution

The dataset comprises over 241,000 records. While the specific file format is not detailed, such datasets are typically provided in a tabular format, often as a CSV file. It is structured with two distinct columns as described above, suitable for direct integration into machine learning workflows.

Usage

This dataset is ideally suited for a variety of applications and use cases, including: * Training sentiment classifiers utilising advanced models such as LSTM, BiLSTM, CNN, BERT, or RoBERTa. * Evaluating the efficacy of different preprocessing and tokenisation strategies for text data. * Benchmarking NLP models on multi-class classification tasks to assess their performance. * Supporting educational projects and research initiatives in the fields of opinion mining or text classification. * Fine-tuning transformer models on a large and diverse collection of sentiment-annotated text.

Coverage

The dataset's coverage is global, comprising English-language comments. It focuses on general user-generated text content without specific demographic notes. The dataset is listed with a version of 1.0.

License

CC0

Who Can Use It

This dataset is suitable for individuals and organisations involved in data science and analytics. Intended users include: * Data Scientists and Machine Learning Engineers for developing and deploying sentiment analysis models. * Researchers and Academics for studies in NLP, text classification, and opinion mining. * Students undertaking educational projects in artificial intelligence and machine learning.

Dataset Name Suggestions

Multi-class Comment Sentiment Data

User Text Sentiment Collection

Online Comment Sentiment Analysis Dataset

English Sentiment Labelled Comments

Preprocessed Sentiment Dataset

Attributes

Original Data Source: Sentiment Analysis Dataset
h
Sentiment-Analysis
huggingface.co
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed Khalid Hussain (2025). Sentiment-Analysis [Dataset]. https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2025
Authors
Syed Khalid Hussain
Description
Sentiment Analysis Dataset

Overview

This dataset is designed for sentiment analysis tasks, providing labeled examples across three sentiment categories:

0: Negative 1: Neutral 2: Positive

It is suitable for training, validating, and testing text classification models in tasks such as social media sentiment analysis, customer feedback evaluation, and opinion mining.

Dataset Details Key Features

Type: CSV Language: English Labels: 0:… See the full description on the dataset page: https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis.
financial sentiment analysis dataset
kaggle.com
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ujjwal Chowdhury (2022). financial sentiment analysis dataset [Dataset]. https://www.kaggle.com/datasets/ujjwalchowdhury/financial-sentiment-analysis-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ujjwal Chowdhury
Description
Dataset

This dataset was created by Ujjwal Chowdhury

Contents
f
Corpus CSV
figshare.com
txt
Updated Oct 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinicius Takeo Friedrich Kuwaki (2021). Corpus CSV [Dataset]. http://doi.org/10.6084/m9.figshare.16745986.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16745986.v1
Dataset updated
Oct 15, 2021
Dataset provided by
figshare
Authors
Vinicius Takeo Friedrich Kuwaki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file describes the corpus in a CSV format using pipe character as separator. The file includes the following columns:- en: The words in English that composes the sentence;- pt_br: The words in Portuguese that composes the sentence;- type: The type of the sentence (OBJ for objective and SUBJ for subjective);- pol: The polarity of the sentence if it is a subjective sentence (-1, 0 or 1).- en_path: The path in OpenSubtitles related to the sentence in English;- pt_br_path: The path in OpenSubtitles related to the sentence in Portuguese;
imdb-dataset-sentiment-analysis-in-csv-format
kaggle.com
Updated Dec 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
love-seeker (2024). imdb-dataset-sentiment-analysis-in-csv-format [Dataset]. https://www.kaggle.com/datasets/loveseeker/imdb-dataset-sentiment-analysis-in-csv-format/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
love-seeker
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by love-seeker

Released under Apache 2.0

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504

Datasets for Sentiment Analysis

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10157504

Dataset updated

Dec 10, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------

The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

File name: sts_gold_tweet.csv

----------- Amazon Sales Dataset ----------------

This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

Features:

product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product

License: CC BY-NC-SA 4.0

File name: amazon.csv

----------- Rotten Tomatoes Reviews Dataset ----------------

This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

File name: data_rt.csv

----------- Preprocessed Dataset Sentiment Analysis ----------------

Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.

The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

DOI: 10.34740/kaggle/dsv/3877817

Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

This dataset was used in the experimental phase of my research.

File name: EcoPreprocessed.csv

----------- Amazon Earphones Reviews ----------------

This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

License: U.S. Government Works

Source: www.amazon.in

File name (original): AllProductReviews.csv (contains 14337 reviews)

File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

----------- Amazon Musical Instruments Reviews ----------------

This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

Source: http://jmcauley.ucsd.edu/data/amazon/

File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

Clear search

Close search

Google apps

Main menu

Datasets for Sentiment Analysis

Twitter Tweets Sentiment Dataset

Description:

Columns:

Acknowledgement:

Objective:

Twitter Sentiments Dataset

Brussel mobility Twitter sentiment analysis CSV Dataset

Sentiment Analysis on Financial Tweets

Context

Content

Acknowledgements

Inspiration

Twitter Public Sentiment Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

multiclass-sentiment-analysis-dataset

coarse-grained sentiment analysis.csv

AWARE: Dataset for Aspect-Based Sentiment Analysis of Apps Reviews

Processed twitter sentiment Dataset | Added Tokens

Twitter Sentiment Classification Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Twitter Sentiment Analysis Data

Mobile review dataset for aspect level sentiment analysis

Sentiment analysis of tech media articles using VADER package and...

Text Classification Dataset

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

NLP Preprocessed Sentiment Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Sentiment-Analysis

financial sentiment analysis dataset

Dataset

Contents

Corpus CSV

imdb-dataset-sentiment-analysis-in-csv-format

Dataset

Contents

Datasets for Sentiment Analysis