100+ datasets found
  1. Sentiment Analysis Dataset

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abdelmalek eladjelet (2025). Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/abdelmalekeladjelet/sentiment-analysis-dataset
    Explore at:
    zip(9105036 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    abdelmalek eladjelet
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

    📌 Description

    This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:

    • 0 — Negative
    • 1 — Neutral
    • 2 — Positive

    The Data has been gathered from multiple websites such as : Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
    https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

    The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.

    📊 Columns

    ColumnDescription
    CommentUser-generated text content
    SentimentSentiment label (0=Negative, 1=Neutral, 2=Positive)

    🚀 Use Cases

    • 🧠 Train sentiment classifiers using LSTM, BiLSTM, CNN, BERT, or RoBERTa
    • 🔍 Evaluate preprocessing and tokenization strategies
    • 📈 Benchmark NLP models on multi-class classification tasks
    • 🎓 Educational projects and research in opinion mining or text classification
    • 🧪 Fine-tune transformer models on a large and diverse sentiment dataset

    💬 Example

    Comment: "apple pay is so convenient secure and easy to use"
    Sentiment: 2 (Positive)
    
  2. c

    Sentiment Analysis Dataset

    • cubig.ai
    zip
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Sentiment Analysis Dataset [Dataset]. https://cubig.ai/store/products/270/sentiment-analysis-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Sentiment Analysis Dataset is a dataset for emotional analysis, including large-scale tweet text collected from Twitter and emotional polarity (0=negative, 2=neutral, 4=positive) labels for each tweet, featuring automatic labeling based on emoticons.

    2) Data Utilization (1) Sentiment Analysis Dataset has characteristics that: • Each sample consists of six columns: emotional polarity, tweet ID, date of writing, search word, author, and tweet body, and is suitable for training natural language processing and classification models using tweet text and emotion labels. (2) Sentiment Analysis Dataset can be used to: • Emotional Classification Model Development: Using tweet text and emotional polarity labels, we can build positive, negative, and neutral emotional automatic classification models with various machine learning and deep learning models such as logistic regression, SVM, RNN, and LSTM. • Analysis of SNS public opinion and trends: By analyzing the distribution of emotions by time series and keywords, you can explore changes in public opinion on specific issues or brands, positive and negative trends, and key emotional keywords.

  3. h

    custom_sentiment_analysis_dataset

    • huggingface.co
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    choi hyun woo (2024). custom_sentiment_analysis_dataset [Dataset]. https://huggingface.co/datasets/t7439/custom_sentiment_analysis_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Authors
    choi hyun woo
    Description

    Dataset Card for Custom Text Dataset

      Dataset Name
    

    Custom Text Dataset

      Overview
    

    This dataset contains text data for training sentiment analysis models. The data is collected from various sources, including books, articles, and web pages.

      Composition
    

    Number of records: 50,000 Fields: text, label Size: 134 MB

      Collection Process
    

    The data was collected using web scraping and manual extraction from public domain sources.

      Preprocessing… See the full description on the dataset page: https://huggingface.co/datasets/t7439/custom_sentiment_analysis_dataset.
    
  4. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  5. Portuguese Sentiment Corpus for Twitter and

    • kaggle.com
    zip
    Updated Feb 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Portuguese Sentiment Corpus for Twitter and [Dataset]. https://www.kaggle.com/datasets/thedevastator/portuguese-sentiment-corpus-for-twitter-and-busc
    Explore at:
    zip(934 bytes)Available download formats
    Dataset updated
    Feb 18, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Portuguese Sentiment Corpus for Twitter and Buscapé Reviews

    Accurately Labeled Word-Level Annotations

    By [source]

    About this dataset

    This dataset consists of a comprehensive list of Portuguese words and the corresponding sentiment labels attached to them. By providing finer-grained annotation and labeling, this dataset allows for comparative sentiment analysis in Portuguese from Twitter and Buscapé reviews. With humans assigned to annotate this data, it provides an accurate measure of the sentiment of Portuguese words in multiple contexts. The labels range from positive to negative with numeric values, allowing for more nuanced categorization and comparison between different subcategories within reviews. Whether you’re mining social media conversations or utilizing customer feedback for analytics purposes, this labeled corpus provides an invaluable resource that can help inform your decision making process

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset, comprised of Twitter and Buscapé reviews from Portuguese-speaking areas, provides sentiment labels at the word level. This makes it easy to apply to natural language processing models for analysis. The corpus is composed of 3,457 tweets and 476 Buscapé reviews, with a total of 114 unique words in the lexicon along with associated human-annotated sentiment scores for each word.

    To properly utilize this resource for comparative sentiment analysis, you need an environment that can read CSV files containing both text and numerical data. With such setting, users can use machine learning algorithms to compare words or phrases within texts or across different datasets and gain an understanding of the opinion expressed towards various topics so far as they have been labeled in this corpus. This data has been annotated according to 3 possible sentiment labels: negative (–1), neutral (0) or positive (+1).

    In order to work with this dataset effectively here are some tips:

    • Familiarize yourself with the data which contains a list of Portuguese words and their associated sentiment labels – by reading through a full content list you will be able to understand how it works better;
    • Create a visualization tool that allows you not only see the weight assigned for each word but also do comparative analyses such as finding differences between same nouns used in different sentences;
    • Analyzing text holistically by taking into account contextual information;
    • Experimenting on different methods that may increase accuracy when dealing with unequal distribution of examples due to class imbalance;

      By applying these above measures one should easily achieve reliable results by making use of this linguistically labeled database generated from two distinct corpora including tweets and Buscapé reviews which have previously never been bridged together like this before! With its help it is now easier than ever before gain insights into people’s opinion on various products based on their textual expressions in real time!

    Research Ideas

    • Comparing the sentiment of Twitter and Buscapé reviews to identify trends in customer opinions over time.
    • Understanding how the sentiment of customer reviews compares between different Portuguese languages and dialects.
    • Utilizing the labeled corpus for training machine learning models in natural language processing tasks such as sentiment analysis, text classification, and automated opinion summarization

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: portuguese_lexicon.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  6. Data for Sentiment Analysis

    • kaggle.com
    zip
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arun Jangir (2023). Data for Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/arunjangir245/data-for-sentiment-analysis
    Explore at:
    zip(84855679 bytes)Available download formats
    Dataset updated
    Oct 11, 2023
    Authors
    Arun Jangir
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description
    1. Sentiment: Numeric sentiment label (0 for negative, 2 for neutral, 4 for positive).
    2. ID: Unique tweet identifier (e.g., 2087).
    3. Date: Date and time of the tweet (e.g., Sat May 16 23:58:44 UTC 2009).
    4. Query: Search term or "NO_QUERY" if not applicable.
    5. User: Twitter username (e.g., robotickilldozr).
    6. Text: The tweet content (e.g., "Lyx is cool").

    These fields contain sentiment analysis data, tweet details, and content.

  7. BBC datasets for sentiment analysis

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2
    Explore at:
    zip(1921885 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Alan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Name: BBC Articles Sentiment Analysis Dataset

    Source: BBC News

    Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

    Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

    Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

    Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

    Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

    Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

    Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.

  8. Sentiment Analysis outputs based on the combination of three classifiers for...

    • data.europa.eu
    • data.niaid.nih.gov
    • +1more
    unknown
    Updated Mar 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6326348?locale=hu
    Explore at:
    unknown(5792)Available download formats
    Dataset updated
    Mar 14, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text covering the Olympic legacy of Rio 2016 and London 2012. Data was searched via Google search engine. It is composed of sentiment labels assigned to 1271 news articles in total. News outlets: BBC Daily Mail The Telegraph The Guardian Globo Estadao Folha de S. Paulo Events covered by the articles: London 2012 Olympic legacy Rio 2016 Olympic legacy All classifiers were used in texts in English. Text originally published in Portuguese by the Brazilian media were automatically translated. Sentiment classifiers used: Vader BERT (Trained on Amazon data) BERT (Trained on twitter data - 140) Each document (spreadsheet - xlsx) refers to one outlet and one event (London 2012 or Rio 2016). How were labels assigned to the texts? These labels are a combination of the three sentiment classifiers listed above. If two of them agree with the same label, then this label would be considered as right. Otherwise, the label ‘other’ was assigned. For news article body text: the proportion of sentences of each sentiment type was used to assign labels to the whole article instead of averaging the sentence scores. For example, if the proportion of sentences with negative labels is greater than 50%, then the article is assigned a negative label. The documents are composed of the following columns: Rank: the position of the article on Google search ranking Date: date of article's publication (DD/MM/YYYY) Link: article's link Title: article's title Sentiment_Title: final sentiment for article headline Sentiment_Text: final sentiment for article's body text PS: Documents do not include articles' body text. Sentiment is presented in labels as follows: Pos: Positive Neg: Negative Neutral: Neutral other: inconclusive - if each of the 3 classifiers assigned a different label to the article, the label 'other' was used. Therefore, 'other' identifies contradictory results.

  9. S

    Multimodal Sentiment Analysis Experimental Data (Processed MOSI/MOSEI...

    • scidb.cn
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo (2025). Multimodal Sentiment Analysis Experimental Data (Processed MOSI/MOSEI Subset) [Dataset]. http://doi.org/10.57760/sciencedb.30362
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Bo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    [Overview]This dataset targets multimodal sentiment/emotion analysis. It contains aligned and processed features and intermediate artifacts derived from the public dataset(s)

  10. m

    RevBangla: Bangla Product Sentiment Analysis Dataset

    • data.mendeley.com
    Updated Mar 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saieef Sarower Sunny (2024). RevBangla: Bangla Product Sentiment Analysis Dataset [Dataset]. http://doi.org/10.17632/bnbbcdsf4m.1
    Explore at:
    Dataset updated
    Mar 6, 2024
    Authors
    Saieef Sarower Sunny
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Bangla Product Comments Dataset is a comprehensive collection of product reviews gathered from diverse ecommerce platforms in Bangladesh. This dataset offers a rich source of information reflecting customer opinions and sentiments towards various products available online. This dataset holds significant value for businesses, researchers, and data scientists interested in understanding consumer behavior, product perception, and sentiment analysis within the Bangladeshi ecommerce landscape. By leveraging this dataset, stakeholders can derive actionable insights to enhance product quality, marketing strategies, and overall customer satisfaction.

    Columns:

    1. Product_ID: A unique identifier for each product, facilitating organization and referencing.
    2. Date: The date when the comment was posted, providing temporal context for analysis.
    3. Customer Name: The name or identifier of the customer who submitted the comment, ensuring traceability and potential user segmentation.
    4. Rating: A numerical representation (typically on a scale of 1 to 5) reflecting the customer's overall satisfaction level with the product.
    5. Label Sentiment: A categorical label assigned to each comment indicating the sentiment expressed by the customer (e.g., positive, negative). This classification facilitates sentiment analysis tasks.
    6. Comment: The actual text of the customer's review or comment, conveying specific opinions, feedback, or experiences regarding the product.
  11. Customer Sentiment Dataset

    • kaggle.com
    zip
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kundan Sagar Bedmutha (2025). Customer Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/kundanbedmutha/customer-sentiment-dataset
    Explore at:
    zip(296232 bytes)Available download formats
    Dataset updated
    Nov 19, 2025
    Authors
    Kundan Sagar Bedmutha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a comprehensive and realistic representation of customer sentiment across multiple online and offline shopping platforms. It contains 25,000 customer feedback records, each including demographic attributes, product categories, purchase channels, ratings, review text, and sentiment classification.

    The dataset reflects how customers express their experiences on platforms such as Amazon, Flipkart, Meesho, Facebook Marketplace, Myntra, Ajio, Nykaa, Croma, Boat, Reliance Digital, BigBasket, JioMart, Swiggy Instamart, Zepto, and many others. It captures a wide spectrum of sentiments, from highly satisfied customers praising product quality and delivery speed to dissatisfied users reporting issues such as delayed deliveries, low-quality items, or unsatisfactory support.

    Each review is paired with a star rating (1 to 5). Ratings of 4 and 5 are mapped to positive sentiment, 3 to neutral, and 1 and 2 to negative sentiment. Corresponding review text is generated to match the sentiment tone, making the dataset ideal for text and sentiment understanding.

    In addition to sentiment and rating, the dataset includes essential service metrics such as response time (in hours), whether the issue was resolved, and whether a formal complaint was registered. This creates a richer ecosystem of customer experience and feedback patterns.

    The dataset is suitable for a wide variety of uses, including customer insight studies, retail analytics, sentiment analysis, product review exploration, behavior understanding, or business decision making. Since the dataset is fully synthetic and free from personal identifiers, it is safe for all academic, analytical, and research purposes.

  12. E

    Data from: Facebook Data for Sentiment Analysis

    • live.european-language-grid.eu
    binary format
    Updated Jul 16, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2013). Facebook Data for Sentiment Analysis [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1057
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 16, 2013
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.

  13. m

    Data from: KurdiSent: A Corpus For Kurdish Sentiment Analysis

    • data.mendeley.com
    Updated Feb 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soran Badawi (2023). KurdiSent: A Corpus For Kurdish Sentiment Analysis [Dataset]. http://doi.org/10.17632/3yrkswy6ph.2
    Explore at:
    Dataset updated
    Feb 6, 2023
    Authors
    Soran Badawi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Kurdish language is regarded as one of the less-resourced languages. The language is globally practised by 30-40 people. The language has 33 letters that are largely similar to the Arabic language. The Kurdish language has two major dialects Sorani and Badini. The dataset includes a collection of texts written in the Sorani dialect. It contains tweets the Twitter API. Due to security reasons and following the policies of Twitter, we removed the user's identity. We collected the tweets which was published during the time of the Corona Virus pandemic. The tweets are raw texts, and the content covers a varied range of topics, starting from politics, sports, entertainment, social life, etc. Data Labeling We used the Twitter developer (Twitter API) to mine the tweets. The dataset was annotated manually by three Kurdish native speakers. The annotators were required to identify the classes and categories of each text. The classes included positive, negative and neutral and the categories consisted of news, technology, art, social and health. The texts which were agreed upon by at least two annotators to possess a specific label and category were regarded as conflict-free and accepted for further processing. Other texts that caused conflict among all three raters were ignored and have been removed from the dataset. The doccano program was used to help the annotators label each text one by one.

  14. d

    Pseudo-Label Generation for Multi-Label Text Classification

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  15. YouTube Comments Sentiment Dataset

    • kaggle.com
    zip
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amaan Poonawala (2025). YouTube Comments Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/amaanpoonawala/youtube-comments-sentiment-dataset
    Explore at:
    zip(156821847 bytes)Available download formats
    Dataset updated
    Feb 7, 2025
    Authors
    Amaan Poonawala
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    YouTube Comments Sentiment Analysis Dataset (1M+ Labeled Comments)

    Overview

    This dataset comprises over one million YouTube comments, each annotated with sentiment labels—**Positive**, Neutral, or Negative. The comments span a diverse range of topics including programming, news, sports, politics and more, and are enriched with comprehensive metadata to facilitate various NLP and sentiment analysis tasks.

    Dataset Contents

    Each record in the dataset includes the following fields: - CommentID: A unique identifier assigned to each YouTube comment. This allows for individual tracking and analysis of comments. - VideoID: The unique identifier of the YouTube video to which the comment belongs. This links each comment to its corresponding video. - VideoTitle: The title of the YouTube video where the comment was posted. This provides context about the video's content. - AuthorName: The display name of the user who posted the comment. This indicates the commenter's identity. - AuthorChannelID: The unique identifier of the YouTube channel of the comment's author. This allows for tracking comments across different videos from the same author. - CommentText: The actual text content of the YouTube comment. This is the raw data used for sentiment analysis. - Sentiment: The sentiment classification of the comment, typically categorized as positive, negative, or neutral. This represents the emotional tone of the comment. - Likes: The number of likes received by the comment. This indicates the comment's popularity or agreement from other users. - Replies: The number of replies to the comment. This indicates the level of engagement and discussion generated by the comment. - PublishedAt: The date and time when the comment was published. This allows for time-based analysis of comment trends. - CountryCode: The two-letter country code of the user that posted the comment. This can be used to analyze regional sentiment. - CategoryID: The category ID of the video that the comment was posted on. This allows for analysis of sentiment across video categories.

    Key Features:

    • Sentiment Analysis: Each comment has been categorized into positive, negative, or neutral sentiment, allowing for direct analysis of emotional tone.
    • Video and Author Metadata: The dataset includes information about the videos (title, category, ID) and authors (channel ID, name), enabling contextual analysis.
    • Engagement Metrics: Columns such as "Likes" and "Replies" provide insights into comment popularity and discussion levels.
    • Temporal and Geographical Data: "PublishedAt" and "CountryCode" columns allow for time-based and regional sentiment analysis.

    Data Collection & Labeling Process

    • Extraction:
      Comments were gathered using the YouTube Data API, ensuring a rich and diverse collection from multiple channels and regions.
    • Sentiment Labeling:
      A combination of advanced AI (using models such as Gemini) and manual validation was used to accurately label each comment.
    • Cleaning & Preprocessing:
      Comprehensive cleaning steps were applied—removing extraneous noise like timestamps, code snippets, and special characters—to ensure high-quality, ready-to-use text.
    • Augmentation for Balance:
      To address class imbalances (especially for underrepresented negative and neutral sentiments), a comment augmentation process was implemented. This process generated synthetic variations of selected comments, increasing linguistic diversity while preserving the original sentiment, thus ensuring a more balanced dataset.

    Benefits for Users

    • Scale & Diversity:
      With over 1M comments from various domains, this dataset offers a rich resource for training and evaluating sentiment analysis models.
    • Quality & Consistency:
      Rigorous cleaning, preprocessing, and augmentation ensure that the data is both reliable and representative of real-world YouTube interactions.
    • Versatility:
      Ideal for researchers, data scientists, and developers looking to build or fine-tune large language models for sentiment analysis, content moderation, and other NLP applications.

    Uses:

    • Sentiment analysis of YouTube comments.
    • Analysis of viewer engagement and discussion patterns.
    • Exploration of sentiment trends across different video categories.
    • Regional sentiment analysis.
    • Building machine learning models for sentiment prediction.
    • Analyzing the impact of video content on viewer sentiment.

    This dataset is open-sourced to encourage collaboration and innovation. Detailed documentation and the code used for extraction, labeling, and augmentation are available in the accompanying GitHub repository.

  16. d

    Sentiment Analysis Dataset [Consumer Reviews] – Labeled feedback for...

    • datarade.ai
    .csv, .xls, .txt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com, Sentiment Analysis Dataset [Consumer Reviews] – Labeled feedback for training, benchmarking, and CX modeling [Dataset]. https://datarade.ai/data-products/sentiment-analysis-dataset-consumer-reviews-labeled-feedb-wiserbrand-com
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset provided by
    WiserBrand
    Area covered
    Liechtenstein, Bulgaria, Italy, Albania, Greenland, Canada, Panama, Faroe Islands, Saint Pierre and Miquelon, Luxembourg
    Description

    This dataset provides millions of consumer reviews enriched with sentiment labels (positive, neutral, or negative), making it an essential asset for training AI models, analyzing customer satisfaction, and detecting risk signals in customer feedback.

    Collected across 970+ marketplaces (including Amazon, eBay, Temu, Flipkart, and others) and spanning 160+ industries, it reflects how consumers express delight, frustration, or dissatisfaction in real purchase and service situations.

    Each entry includes:

    • Full written review text
    • Assigned sentiment label: positive, neutral, or negative
    • Product/service category and platform (e.g., electronics on Amazon)
    • Optional metadata: review date, star rating, region, brand name

    Use this dataset to:

    • Train sentiment analysis engines and review classifiers
    • Benchmark brand perception and shifts in consumer tone over time
    • Detect complaints masked in neutral or positive ratings
    • Feed LLMs and generative AI with labeled opinion data for alignment tasks
    • Monitor market sentiment by product, platform, or geography

    Whether you're building models or measuring brand trust, this dataset offers a structured view of consumer emotion, helping you turn unstructured feedback into meaningful action.

    The more you purchase, the lower the price will be.

  17. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    • kaggle.com
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  18. Z

    Produced Data of Naive Bayes Sentiment Classifier

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffrey Resnik (2024). Produced Data of Naive Bayes Sentiment Classifier [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7934163
    Explore at:
    Dataset updated
    Jul 12, 2024
    Authors
    Jeffrey Resnik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data produced by the running of the Naive Bayes classifier algorithm. It is a list of every word in the vocabulary of the classifier, as well as the number of occurrences of each word, as well as the likelihood ratio of this word. Please note the likelihood ratio is calculated by taking the likelihood of word given a positive label divided by the likelihood of a word given a negative label. This data is licensed under the CC BY 4.0 international license, and may be taken and used freely with credit given. This data was produced by two different datasets, using a Naive Bayes classifier. These datasets were the Polarity Review v2.0 dataset from Cornell, and the Large Movie Review Dataset from Stanford.

  19. T

    Text Annotation Tool Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Oct 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Text Annotation Tool Report [Dataset]. https://www.archivemarketresearch.com/reports/text-annotation-tool-562724
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Oct 11, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Explore the surging Text Annotation Tool market, projected to reach $850 million by 2025 with an 18.5% CAGR. Discover key drivers like NLP and AI adoption, alongside market trends and competitive landscape.

  20. E

    Data from: News sentiment analysis datasets for Serbian, Bosnian,...

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian and Estonian SADEmma 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23729
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 12, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    We provide annotated datasets on a three-point sentiment scale (positive, neutral and negative) for Serbian, Bosnian, Macedonian, Albanian, and Estonian. For all languages except Estonian, we include pairs of source URL (where corresponding text can be found) and sentiment label.

    For Estonian, we randomly sampled 100 articles from "Ekspress news article archive (in Estonian and Russian) 1.0" (http://hdl.handle.net/11356/1408).

    The data is organized in Tab-Separated Values (TSV) format. For Serbian, Bosnian, Macedonian, and Albanian, the dataset contains two columns: sourceURL and sentiment. For Estonian, the dataset consists of three columns: text ID (from the CLARIN.SI reference above), body text, and sentiment label.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
abdelmalek eladjelet (2025). Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/abdelmalekeladjelet/sentiment-analysis-dataset
Organization logo

Sentiment Analysis Dataset

Dataset for text classification

Explore at:
zip(9105036 bytes)Available download formats
Dataset updated
May 3, 2025
Authors
abdelmalek eladjelet
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

📌 Description

This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:

  • 0 — Negative
  • 1 — Neutral
  • 2 — Positive

The Data has been gathered from multiple websites such as : Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.

📊 Columns

ColumnDescription
CommentUser-generated text content
SentimentSentiment label (0=Negative, 1=Neutral, 2=Positive)

🚀 Use Cases

  • 🧠 Train sentiment classifiers using LSTM, BiLSTM, CNN, BERT, or RoBERTa
  • 🔍 Evaluate preprocessing and tokenization strategies
  • 📈 Benchmark NLP models on multi-class classification tasks
  • 🎓 Educational projects and research in opinion mining or text classification
  • 🧪 Fine-tune transformer models on a large and diverse sentiment dataset

💬 Example

Comment: "apple pay is so convenient secure and easy to use"
Sentiment: 2 (Positive)
Search
Clear search
Close search
Google apps
Main menu