100+ datasets found
  1. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  2. Twitter Tweets Sentiment Dataset

    • kaggle.com
    • opendatabay.com
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  3. m

    Twitter Sentiments Dataset

    • data.mendeley.com
    Updated May 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHERIF HUSSEIN (2021). Twitter Sentiments Dataset [Dataset]. http://doi.org/10.17632/z9zw7nt5h2.1
    Explore at:
    Dataset updated
    May 14, 2021
    Authors
    SHERIF HUSSEIN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset has three sentiments namely, negative, neutral, and positive. It contains two fields for the tweet and label.

  4. Z

    Brussel mobility Twitter sentiment analysis CSV Dataset

    • data.niaid.nih.gov
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    van Vessem, Charlotte (2024). Brussel mobility Twitter sentiment analysis CSV Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11401123
    Explore at:
    Dataset updated
    May 31, 2024
    Dataset provided by
    Tori, Floriano
    van Vessem, Charlotte
    Betancur Arenas, Juliana
    Ginis, Vincent
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Brussels
    Description

    SSH CENTRE (Social Sciences and Humanities for Climate, Energy aNd Transport Research Excellence) is a Horizon Europe project, engaging directly with stakeholders across research, policy, and business (including citizens) to strengthen social innovation, SSH-STEM collaboration, transdisciplinary policy advice, inclusive engagement, and SSH communities across Europe, accelerating the EU’s transition to carbon neutrality. SSH CENTRE is based in a range of activities related to Open Science, inclusivity and diversity – especially with regards Southern and Eastern Europe and different career stages – including: development of novel SSH-STEM collaborations to facilitate the delivery of the EU Green Deal; SSH knowledge brokerage to support regions in transition; and the effective design of strategies for citizen engagement in EU R&I activities. Outputs include action-led agendas and building stakeholder synergies through regular Policy Insight events.This is captured in a high-profile virtual SSH CENTRE generating and sharing best practice for SSH policy advice, overcoming fragmentation to accelerate the EU’s journey to a sustainable future.The documents uploaded here are part of WP2 whereby novel, interdisciplinary teams were provided funding to undertake activities to develop a policy recommendation related to EU Green Deal policy. Each of these policy recommendations, and the activities that inform them, will be written-up as a chapter in an edited book collection. Three books will make up this edited collection - one on climate, one on energy and one on mobility. As part of writing a chapter for the SSH CENTRE book on ‘Mobility’, we set out to analyse the sentiment of users on Twitter regarding shared and active mobility modes in Brussels. This involved us collecting tweets between 2017-2022. A tweet was collected if it contained a previously defined mobility keyword (for example: metro) and either the name of a (local) politician, a neighbourhood or municipality, or a (shared) mobility provider. The files attached to this Zenodo webpage is a csv files containing the tweets collected.”.

  5. Sentiment Analysis on Financial Tweets

    • kaggle.com
    zip
    Updated Sep 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivek Rathi (2019). Sentiment Analysis on Financial Tweets [Dataset]. https://www.kaggle.com/datasets/vivekrathi055/sentiment-analysis-on-financial-tweets
    Explore at:
    zip(2538259 bytes)Available download formats
    Dataset updated
    Sep 5, 2019
    Authors
    Vivek Rathi
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The following information can also be found at https://www.kaggle.com/davidwallach/financial-tweets. Out of curosity, I just cleaned the .csv files to perform a sentiment analysis. So both the .csv files in this dataset are created by me.

    Anything you read in the description is written by David Wallach and using all this information, I happen to perform my first ever sentiment analysis.

    "I have been interested in using public sentiment and journalism to gather sentiment profiles on publicly traded companies. I first developed a Python package (https://github.com/dwallach1/Stocker) that scrapes the web for articles written about companies, and then noticed the abundance of overlap with Twitter. I then developed a NodeJS project that I have been running on my RaspberryPi to monitor Twitter for all tweets coming from those mentioned in the content section. If one of them tweeted about a company in the stocks_cleaned.csv file, then it would write the tweet to the database. Currently, the file is only from earlier today, but after about a month or two, I plan to update the tweets.csv file (hopefully closer to 50,000 entries.

    I am not quite sure how this dataset will be relevant, but I hope to use these tweets and try to generate some sense of public sentiment score."

    Content

    This dataset has all the publicly traded companies (tickers and company names) that were used as input to fill the tweets.csv. The influencers whose tweets were monitored were: ['MarketWatch', 'business', 'YahooFinance', 'TechCrunch', 'WSJ', 'Forbes', 'FT', 'TheEconomist', 'nytimes', 'Reuters', 'GerberKawasaki', 'jimcramer', 'TheStreet', 'TheStalwart', 'TruthGundlach', 'Carl_C_Icahn', 'ReformedBroker', 'benbernanke', 'bespokeinvest', 'BespokeCrypto', 'stlouisfed', 'federalreserve', 'GoldmanSachs', 'ianbremmer', 'MorganStanley', 'AswathDamodaran', 'mcuban', 'muddywatersre', 'StockTwits', 'SeanaNSmith'

    Acknowledgements

    The data used here is gathered from a project I developed : https://github.com/dwallach1/StockerBot

    Inspiration

    I hope to develop a financial sentiment text classifier that would be able to track Twitter's (and the entire public's) feelings about any publicly traded company (and cryptocurrency)

  6. o

    Twitter Public Sentiment Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Twitter Public Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/04ea3224-1b10-48d4-871a-496c9a2633ff
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Telecommunications & Network Data
    Description

    This dataset provides a collection of 1000 tweets designed for sentiment analysis. The tweets were sourced from Twitter using Python and systematically generated using various modules to ensure a balanced representation of different tweet types, user behaviours, and sentiments. This includes the use of a random module for IDs and text, a faker module for usernames and dates, and a textblob module for assigning sentiment. The dataset's purpose is to offer a robust foundation for analysing and visualising sentiment trends and patterns, aiding in the initial exploration of data and the identification of significant patterns or trends.

    Columns

    • Tweet ID: A unique identifier assigned to each individual tweet.
    • Text: The actual textual content of the tweet.
    • User: The username of the individual who posted the tweet.
    • Created At: The date and time when the tweet was originally published.
    • Likes: The total number of likes or approvals the tweet received.
    • Retweets: The total count of times the tweet was shared by other users.
    • Sentiment: The categorised emotional tone of the tweet, typically labelled as positive, neutral, or negative.

    Distribution

    The dataset is provided in a CSV file format. It consists of 1000 individual tweet records, structured in a tabular layout with the columns detailed above. A sample file will be made available separately on the platform.

    Usage

    This dataset is ideal for: * Analysing and visualising sentiment trends and patterns in social media. * Initial data exploration to uncover insights into tweet characteristics and user emotions. * Identifying underlying patterns or trends within social media conversations. * Developing and training machine learning models for sentiment classification. * Academic research into Natural Language Processing (NLP) and social media dynamics. * Educational purposes, allowing students to practise data analysis and visualisation techniques.

    Coverage

    The dataset spans tweets created between January and April 2023, as observed from the included data samples. While specific geographic or demographic information for users is not available within the dataset, the nature of Twitter implies a general global scope, reflecting a variety of user behaviours and sentiments without specific regional or population group focus.

    License

    CC0

    Who Can Use It

    This dataset is valuable for: * Data Scientists and Machine Learning Engineers working on NLP tasks and model development. * Researchers in fields such as Natural Language Processing, Machine Learning Algorithms, Deep Learning, and Computer Science. * Data Analysts looking to extract insights from social media content. * Academics and Students undertaking projects related to sentiment analysis or social media studies. * Anyone interested in understanding online sentiment and user behaviour on social media platforms.

    Dataset Name Suggestions

    • Twitter Public Sentiment Dataset
    • Social Media Text Sentiment Analysis
    • General Tweet Mood Data
    • Twitter Sentiment Collection 2023
    • Microblog Sentiment Dataset

    Attributes

    Original Data Source: Twitter Sentiment Analysis using Roberta and VaderTwitter Sentiment Analysis using Roberta and Vader

  7. h

    multiclass-sentiment-analysis-dataset

    • huggingface.co
    Updated Jul 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahriar Parvez (2023). multiclass-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2023
    Authors
    Shahriar Parvez
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

      Dataset Summary
    

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data Fields
    

    [More Information Needed]

      Data Splits
    

    [More Information Needed]

      Dataset Creation… See the full description on the dataset page: https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset.
    
  8. f

    coarse-grained sentiment analysis.csv

    • figshare.com
    txt
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chen duan; Zhengwei Huang (2023). coarse-grained sentiment analysis.csv [Dataset]. http://doi.org/10.6084/m9.figshare.21508251.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 7, 2023
    Dataset provided by
    figshare
    Authors
    chen duan; Zhengwei Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study selects the customer service conversation dataset from Jing Dong (JD)[1] to evaluate our proposed model. In the dataset, the number 0 represents the statement issued by the customer, the number 1 represents the statement issued by customer service and chatbots, and the "session-id" represents multi-turn sessions between the same customer service agents and customers. This paper selected about 10,000 dialogue texts for training, validation, and testing, including 5,068 sentences issued by customers and 4,942 sentences issued by customer service and chatbots.

    [1] https://www.jd.com

  9. Z

    AWARE: Dataset for Aspect-Based Sentiment Analysis of Apps Reviews

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Jan 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamoud Aljamaan (2022). AWARE: Dataset for Aspect-Based Sentiment Analysis of Apps Reviews [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5528480
    Explore at:
    Dataset updated
    Jan 25, 2022
    Dataset provided by
    Malak Baslyman
    Hamoud Aljamaan
    Nouf Alturaief
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.

    Aspect-Based Sentiment Analysis (ABSA) aims to identify the opinion (sentiment) with respect to a specific aspect. Since there is a lack of smartphone apps reviews dataset that is annotated to support the ABSA task, we present AWARE: ABSA Warehouse of Apps REviews.

    AWARE contains apps reviews from three different domains (Productivity, Social Networking, and Games), as each domain has its distinct functionalities and audience. Each sentence is annotated with three labels, as follows:

    Aspect Term: a term that exists in the sentence and describes an aspect of the app that is expressed by the sentiment. A term value of “N/A” means that the term is not explicitly mentioned in the sentence.

    Aspect Category: one of the pre-defined set of domain-specific categories that represent an aspect of the app (e.g., security, usability, etc.).

    Sentiment: positive or negative.

    Note: games domain does not contain aspect terms.

    We provide a comprehensive dataset of 11323 sentences from the three domains, where each sentence is additionally annotated with a Boolean value indicating whether the sentence expresses a positive/negative opinion. In addition, we provide three separate datasets, one for each domain, containing only sentences that express opinions. The file named “AWARE_metadata.csv” contains a description of the dataset’s columns.

    How AWARE can be used?

    We designed AWARE such that it can be used to serve various tasks. The tasks can be, but are not limited to:

    Sentiment Analysis.

    Aspect Term Extraction.

    Aspect Category Classification.

    Aspect Sentiment Analysis.

    Explicit/Implicit Aspect Term Classification.

    Opinion/Not-Opinion Classification.

    Furthermore, researchers can experiment with and investigate the effects of different domains on users' feedback.

  10. Processed twitter sentiment Dataset | Added Tokens

    • kaggle.com
    Updated Aug 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Halemo GPA (2024). Processed twitter sentiment Dataset | Added Tokens [Dataset]. http://doi.org/10.34740/kaggle/ds/5568348
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Halemo GPA
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:

    1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures

    Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

  11. o

    Twitter Sentiment Classification Data

    • opendatabay.com
    .undefined
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Twitter Sentiment Classification Data [Dataset]. https://www.opendatabay.com/data/ai-ml/89d10076-3c7d-4857-8c75-0b284a9a7f06
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    This dataset provides a collection of tweets, each categorised by its sentiment. It is designed to assist in developing and evaluating machine learning models, particularly for natural language processing tasks. The primary aim is to distinguish between different sentiments expressed in tweets, helping to address issues like harmful content by enabling the creation of robust classifier models. Each entry includes the tweet text and its corresponding sentiment label, with a specific focus on identifying the exact word or phrase within the tweet that encapsulates that sentiment.

    Columns

    • textID: A unique identifier for each tweet entry.
    • text: The full content of the tweet.
    • selected_text: The specific part of the tweet that best represents the given sentiment.
    • sentiment: The overall sentiment expressed in the tweet, categorised as neutral, positive, or other.

    Distribution

    The dataset contains approximately 27,500 tweets. It is typically provided in a CSV file format. The textID and text columns each contain 27,481 unique values, while the selected_text column has 22,464 unique values. The sentiment distribution is as follows: 40% are neutral, 31% are positive, and 28% fall into other sentiment categories. When processing the data from the CSV, it is important to remove any beginning or ending quotation marks from the text fields.

    Usage

    This dataset is ideally suited for tasks involving sentiment analysis and text classification. It can be used to build and train classification models that predict the sentiment of Twitter tweets. Furthermore, it allows for the comparison and evaluation of various classification algorithms based on their performance metrics in predicting sentiments. It is particularly useful for developing strong NLP-based classifier models to identify and categorise tweets by sentiment.

    Coverage

    The data originates from a global platform, Twitter, and the sentiment analysis is applicable across a wide range of content. The dataset's structure allows for analysis of sentiments in tweets, covering various topics and expressions globally. No specific time range or demographic scope is detailed beyond its global applicability.

    License

    CCO

    Who Can Use It

    This dataset is suitable for a diverse range of users, including beginners in data science and machine learning. It is especially beneficial for those interested in social network analysis, text classification, and natural language processing. Intended users include data scientists, researchers, and developers looking to build and test models for predicting social media sentiments or for applications like content moderation.

    Dataset Name Suggestions

    • Twitter Tweet Sentiment Dataset
    • Tweet Sentiment Analysis Dataset
    • Social Media Sentiment Prediction Data
    • Twitter Sentiment Classification Data

    Attributes

    Original Data Source: Twitter Tweets Sentiment Dataset

  12. i

    Twitter Sentiment Analysis Data

    • ieee-dataport.org
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rabindra Lamsal (2024). Twitter Sentiment Analysis Data [Dataset]. https://ieee-dataport.org/documents/twitter-sentiment-analysis-data
    Explore at:
    Dataset updated
    Aug 6, 2024
    Authors
    Rabindra Lamsal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    because of COVID-19

  13. i

    Mobile review dataset for aspect level sentiment analysis

    • ieee-dataport.org
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piyush Soni (2024). Mobile review dataset for aspect level sentiment analysis [Dataset]. https://ieee-dataport.org/documents/mobile-review-dataset-aspect-level-sentiment-analysis
    Explore at:
    Dataset updated
    Sep 17, 2024
    Authors
    Piyush Soni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    camera

  14. E

    Sentiment analysis of tech media articles using VADER package and...

    • live.european-language-grid.eu
    csv
    Updated Aug 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Sentiment analysis of tech media articles using VADER package and co-occurrence analysis [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1351
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 16, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sentiment analysis of tech media articles using VADER package and co-occurrence analysis

    Sources: Above 140k articles (01.2016-03.2019):

    Gigaom 0.5%

    Euractiv 0.9%

    The Conversation 1.3%

    Politico Europe 1.3%

    IEEE Spectrum 1.8%

    Techforge 4.3%

    Fastcompany 4.5%

    The Guardian (Tech) 9.2%

    Arstechnica 10.0%

    Reuters 11%

    Gizmodo 17.5%

    ZDNet 18.3%

    The Register 19.5%

    Methodology

    The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.

    As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.

    The process included the following main steps:

    The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)

    The articles containing the given social issue and co-occurring term are identified

    The identified articles are divided into paragraphs

    Social issue and co-occurring words are removed from the paragraph

    The VADER sentiment analysis is carried out for every identified and modified paragraph

    The average for the given word pair is calculated for the final result

    Therefore, the procedure has been repeated for 100 words for all identified social issues.

    The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.

    The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:

    The articles containing the given social issue are identified

    The paragraphs containing the social issue are selected for sentiment analysis

    *Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

    Files

    sentiments_mod11.csv sentiment score based on chosen unigrams

    sentiments_mod22.csv sentiment score based on chosen bigrams

    sentiments_cooc_mod11.csv, sentiments_cooc_mod12.csv, sentiments_cooc_mod21.csv, sentiments_cooc_mod22.csv combinations of co-occurrences: unigrams-unigrams, unigrams-bigrams, bigrams-unigrams, bigrams-bigrams

  15. o

    Text Classification Dataset

    • opendatabay.com
    .undefined
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay (2025). Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1775ad0d-be0d-49c9-bbc1-f94a8a5c8355
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Opendatabay
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.

    Dataset Features

    1. text: Contains individual English-language comments or posts sourced from various online platforms.

    2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:

    0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment

    Distribution

    • Format: CSV (Comma-Separated Values)
    • 2 Columns: text: The comment content label: Sentiment classification (0 = Negative, 1 = Neutral, 2 = Positive)
    • File Size: Approximately 23.9 MB
    • Structure: Each row contains a single comment and its corresponding sentiment label.

    Usage

    This dataset is ideal for a variety of applications:

    • 1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.

    • 2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.

    • 3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.

    Coverage

    • Geographic Coverage: Primarily English-language content from global online platforms

    • Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.

    • Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.

    License

    CC0

    Who Can Use It

    • Data Scientists: For training machine learning models.
    • Researchers: For academic or scientific studies.
    • Businesses: For analysis, insights, or AI development.
  16. o

    NLP Preprocessed Sentiment Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). NLP Preprocessed Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/6323a1b5-7112-49bd-ad55-c1ef6968abc3
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset is a substantial collection of over 241,000 English-language comments, gathered from various online platforms. Each comment within the dataset has been carefully annotated with a sentiment label: 0 for negative sentiment, 1 for neutral, and 2 for positive. The primary aim of this dataset is to facilitate the training and evaluation of multi-class sentiment analysis models, designed to work effectively with real-world text data. The dataset has undergone a preprocessing stage, ensuring comments are in lowercase, and are cleaned of punctuation, URLs, numbers, and stopwords, making it readily usable for Natural Language Processing (NLP) pipelines.

    Columns

    • Comment: This column contains the user-generated text content.
    • Sentiment: This column provides the corresponding sentiment label for each comment, where 0 denotes Negative, 1 denotes Neutral, and 2 denotes Positive.

    Distribution

    The dataset comprises over 241,000 records. While the specific file format is not detailed, such datasets are typically provided in a tabular format, often as a CSV file. It is structured with two distinct columns as described above, suitable for direct integration into machine learning workflows.

    Usage

    This dataset is ideally suited for a variety of applications and use cases, including: * Training sentiment classifiers utilising advanced models such as LSTM, BiLSTM, CNN, BERT, or RoBERTa. * Evaluating the efficacy of different preprocessing and tokenisation strategies for text data. * Benchmarking NLP models on multi-class classification tasks to assess their performance. * Supporting educational projects and research initiatives in the fields of opinion mining or text classification. * Fine-tuning transformer models on a large and diverse collection of sentiment-annotated text.

    Coverage

    The dataset's coverage is global, comprising English-language comments. It focuses on general user-generated text content without specific demographic notes. The dataset is listed with a version of 1.0.

    License

    CC0

    Who Can Use It

    This dataset is suitable for individuals and organisations involved in data science and analytics. Intended users include: * Data Scientists and Machine Learning Engineers for developing and deploying sentiment analysis models. * Researchers and Academics for studies in NLP, text classification, and opinion mining. * Students undertaking educational projects in artificial intelligence and machine learning.

    Dataset Name Suggestions

    • Multi-class Comment Sentiment Data
    • User Text Sentiment Collection
    • Online Comment Sentiment Analysis Dataset
    • English Sentiment Labelled Comments
    • Preprocessed Sentiment Dataset

    Attributes

    Original Data Source: Sentiment Analysis Dataset

  17. h

    Sentiment-Analysis

    • huggingface.co
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syed Khalid Hussain (2025). Sentiment-Analysis [Dataset]. https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2025
    Authors
    Syed Khalid Hussain
    Description

    Sentiment Analysis Dataset

      Overview
    

    This dataset is designed for sentiment analysis tasks, providing labeled examples across three sentiment categories:

    0: Negative 1: Neutral 2: Positive

    It is suitable for training, validating, and testing text classification models in tasks such as social media sentiment analysis, customer feedback evaluation, and opinion mining.

      Dataset Details
    
    
    
    
    
    
    
      Key Features
    

    Type: CSV Language: English Labels: 0:… See the full description on the dataset page: https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis.

  18. financial sentiment analysis dataset

    • kaggle.com
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ujjwal Chowdhury (2022). financial sentiment analysis dataset [Dataset]. https://www.kaggle.com/datasets/ujjwalchowdhury/financial-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ujjwal Chowdhury
    Description

    Dataset

    This dataset was created by Ujjwal Chowdhury

    Contents

  19. f

    Corpus CSV

    • figshare.com
    txt
    Updated Oct 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinicius Takeo Friedrich Kuwaki (2021). Corpus CSV [Dataset]. http://doi.org/10.6084/m9.figshare.16745986.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 15, 2021
    Dataset provided by
    figshare
    Authors
    Vinicius Takeo Friedrich Kuwaki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file describes the corpus in a CSV format using pipe character as separator. The file includes the following columns:- en: The words in English that composes the sentence;- pt_br: The words in Portuguese that composes the sentence;- type: The type of the sentence (OBJ for objective and SUBJ for subjective);- pol: The polarity of the sentence if it is a subjective sentence (-1, 0 or 1).- en_path: The path in OpenSubtitles related to the sentence in English;- pt_br_path: The path in OpenSubtitles related to the sentence in Portuguese;

  20. imdb-dataset-sentiment-analysis-in-csv-format

    • kaggle.com
    Updated Dec 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    love-seeker (2024). imdb-dataset-sentiment-analysis-in-csv-format [Dataset]. https://www.kaggle.com/datasets/loveseeker/imdb-dataset-sentiment-analysis-in-csv-format/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    love-seeker
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by love-seeker

    Released under Apache 2.0

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Organization logo

Datasets for Sentiment Analysis

Explore at:
csvAvailable download formats
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------

The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

File name: sts_gold_tweet.csv

----------- Amazon Sales Dataset ----------------

This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

Features:

  • product_id - Product ID
  • product_name - Name of the Product
  • category - Category of the Product
  • discounted_price - Discounted Price of the Product
  • actual_price - Actual Price of the Product
  • discount_percentage - Percentage of Discount for the Product
  • rating - Rating of the Product
  • rating_count - Number of people who voted for the Amazon rating
  • about_product - Description about the Product
  • user_id - ID of the user who wrote review for the Product
  • user_name - Name of the user who wrote review for the Product
  • review_id - ID of the user review
  • review_title - Short review
  • review_content - Long review
  • img_link - Image Link of the Product
  • product_link - Official Website Link of the Product

License: CC BY-NC-SA 4.0

File name: amazon.csv

----------- Rotten Tomatoes Reviews Dataset ----------------

This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

File name: data_rt.csv

----------- Preprocessed Dataset Sentiment Analysis ----------------

Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.

The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

DOI: 10.34740/kaggle/dsv/3877817

Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

This dataset was used in the experimental phase of my research.

File name: EcoPreprocessed.csv

----------- Amazon Earphones Reviews ----------------

This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

License: U.S. Government Works

Source: www.amazon.in

File name (original): AllProductReviews.csv (contains 14337 reviews)

File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

----------- Amazon Musical Instruments Reviews ----------------

This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

Source: http://jmcauley.ucsd.edu/data/amazon/

File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

Search
Clear search
Close search
Google apps
Main menu