100+ datasets found
  1. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  2. Sentiment Analysis Dataset

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abdelmalek eladjelet (2025). Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/abdelmalekeladjelet/sentiment-analysis-dataset
    Explore at:
    zip(9105036 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    abdelmalek eladjelet
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

    📌 Description

    This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:

    • 0 — Negative
    • 1 — Neutral
    • 2 — Positive

    The Data has been gathered from multiple websites such as : Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
    https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

    The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.

    📊 Columns

    ColumnDescription
    CommentUser-generated text content
    SentimentSentiment label (0=Negative, 1=Neutral, 2=Positive)

    🚀 Use Cases

    • 🧠 Train sentiment classifiers using LSTM, BiLSTM, CNN, BERT, or RoBERTa
    • 🔍 Evaluate preprocessing and tokenization strategies
    • 📈 Benchmark NLP models on multi-class classification tasks
    • 🎓 Educational projects and research in opinion mining or text classification
    • 🧪 Fine-tune transformer models on a large and diverse sentiment dataset

    💬 Example

    Comment: "apple pay is so convenient secure and easy to use"
    Sentiment: 2 (Positive)
    
  3. c

    Sentiment Analysis Dataset

    • cubig.ai
    zip
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Sentiment Analysis Dataset [Dataset]. https://cubig.ai/store/products/270/sentiment-analysis-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Sentiment Analysis Dataset is a dataset for emotional analysis, including large-scale tweet text collected from Twitter and emotional polarity (0=negative, 2=neutral, 4=positive) labels for each tweet, featuring automatic labeling based on emoticons.

    2) Data Utilization (1) Sentiment Analysis Dataset has characteristics that: • Each sample consists of six columns: emotional polarity, tweet ID, date of writing, search word, author, and tweet body, and is suitable for training natural language processing and classification models using tweet text and emotion labels. (2) Sentiment Analysis Dataset can be used to: • Emotional Classification Model Development: Using tweet text and emotional polarity labels, we can build positive, negative, and neutral emotional automatic classification models with various machine learning and deep learning models such as logistic regression, SVM, RNN, and LSTM. • Analysis of SNS public opinion and trends: By analyzing the distribution of emotions by time series and keywords, you can explore changes in public opinion on specific issues or brands, positive and negative trends, and key emotional keywords.

  4. Multiple Data for Sentiment Analysis

    • kaggle.com
    zip
    Updated May 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dangerous AI (2024). Multiple Data for Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/dangerousai/multiple-data-for-sentiment-analysis/code
    Explore at:
    zip(72367351 bytes)Available download formats
    Dataset updated
    May 26, 2024
    Authors
    dangerous AI
    Description

    Special Purpose

    This data set is created for enhancing sentiment analysis, where texts are written by English beginner instead of online blogs. So it may be not useful to sentiment classification upon texts from tweet or Reddit. There are also a slightly big file containing smaples from Tweet, which are much more diverse but less clean.

    Methodology

    Samples from LLMs

    The texts are all generated by LLMs including GPT-3.5-turbo and ChatGLM-4 by simple prompts. The LLMs are prompted to generate new texts on the basis of previous texts, and are strictly required to generated distinctive sentences.

    Large Scale Data

    The large data set contains more than 1.7 million diverse and clean text-sentiment pairs. The data are from those nice datasets:\ https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-dataset\ https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis\ https://www.kaggle.com/datasets/kazanova/sentiment140\ https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset.

  5. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    • kaggle.com
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  6. Hindi_Sentiment_Dataset

    • kaggle.com
    zip
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pratham R Shetty (2024). Hindi_Sentiment_Dataset [Dataset]. https://www.kaggle.com/datasets/praths71018/hindi-sentiment-dataset
    Explore at:
    zip(230354 bytes)Available download formats
    Dataset updated
    Apr 11, 2024
    Authors
    Pratham R Shetty
    Description

    The dataset contains about 8000 sentences in Hindi classified using 7 labels namely 'neutral', 'surprise', 'fear', 'sadness', 'joy', 'disgust', 'anger'. The dataset can be used for sentiment analysis for Hindi sentences via applying NLP or sequential learning models .

  7. Chat Sentiment Dataset

    • kaggle.com
    zip
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nursyahrina (2023). Chat Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/nursyahrina/chat-sentiment-dataset
    Explore at:
    zip(7598 bytes)Available download formats
    Dataset updated
    Mar 22, 2023
    Authors
    Nursyahrina
    Description

    Chat Sentiment Dataset

    A Simple but Rich Dataset for Sentiment Analysis of Chat Messages

    Description:

    This dataset contains a collection of chat messages that can be used to develop a sentiment analysis machine learning model to classify messages into 3 sentiment classes - positive, negative, and neutral. The messages are diverse in nature, containing not only simple text but also special characters, numbers, emoji/emoticons, and URL addresses. The dataset can be used for various natural language processing tasks related to chat analysis.

    Column Descriptions:

    1. message: the content of the chat message.
    2. sentiment: the sentiment of the chat message, can be positive, negative, or neutral.
  8. Weibo sentiment analysis validation dataset

    • figshare.com
    txt
    Updated Nov 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenjing Fan; Zhenyu Gai; Shiqi Li; Yirui Cao; Yueying Gu; Chenxi Jin; Yiyang Zhang; Yanling Ge; Lin Zhou (2022). Weibo sentiment analysis validation dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21524391.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 9, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Chenjing Fan; Zhenyu Gai; Shiqi Li; Yirui Cao; Yueying Gu; Chenxi Jin; Yiyang Zhang; Yanling Ge; Lin Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Humans spend most of their time in settlements, and the built environment of settlements may affect the residents’ sentiments. Research in this field is interdisciplinary, integrating urban planning and public health. However, it has been limited by the difficulty of quantifying subjective sentiments and the small sample size. This study uses 147,613 Weibo text check-ins in Xiamen from 2017 to quantify residents' sentiments in 1,096 neighborhoods in the city. A multilevel regression model and gradient boosting decision tree (GBDT) model are used to investigate the multilevel and nonlinear effects of the built environment of neighborhoods and subdistricts on residents' sentiments. The results show the following: 1) The multilevel regression model indicates that at the neighborhood level, a high land value, low plot ratio, low population density, more security facilities, and neighborhoods close to water are more likely to improve the residents’ sentiments. At the subdistrict level, more green space and commercial land, less industry, higher building density and road density, and a smaller migrant population are more likely to promote positive sentiments. Approximately 19% of the total variance in the sentiments occurred among subdistricts. 2) The number of security facilities, the proportion of green space and commercial land, and the density of buildings and roads are linearly correlated with residents' sentiments. The land value and the number of security facilities are basic needs and exhibit nonlinear correlations with sentiments. The plot ratio, population density, and the proportions of industrial land and the migrant population are advanced needs and are nonlinearly correlated with sentiments. The quantitative analysis of sentiments enables setting a threshold of the influence of the built environment on residents' sentiments in neighborhoods and surrounding areas. Our results provide data support for urban planning and implementing targeted measures to improve the living environment of residents.

  9. d

    Sentiment Analysis Dataset [Consumer Reviews] – Labeled feedback for...

    • datarade.ai
    .csv, .xls, .txt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com, Sentiment Analysis Dataset [Consumer Reviews] – Labeled feedback for training, benchmarking, and CX modeling [Dataset]. https://datarade.ai/data-products/sentiment-analysis-dataset-consumer-reviews-labeled-feedb-wiserbrand-com
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset provided by
    WiserBrand
    Area covered
    Bulgaria, Italy, Faroe Islands, Canada, Panama, Albania, Luxembourg, Saint Pierre and Miquelon, Greenland, Liechtenstein
    Description

    This dataset provides millions of consumer reviews enriched with sentiment labels (positive, neutral, or negative), making it an essential asset for training AI models, analyzing customer satisfaction, and detecting risk signals in customer feedback.

    Collected across 970+ marketplaces (including Amazon, eBay, Temu, Flipkart, and others) and spanning 160+ industries, it reflects how consumers express delight, frustration, or dissatisfaction in real purchase and service situations.

    Each entry includes:

    • Full written review text
    • Assigned sentiment label: positive, neutral, or negative
    • Product/service category and platform (e.g., electronics on Amazon)
    • Optional metadata: review date, star rating, region, brand name

    Use this dataset to:

    • Train sentiment analysis engines and review classifiers
    • Benchmark brand perception and shifts in consumer tone over time
    • Detect complaints masked in neutral or positive ratings
    • Feed LLMs and generative AI with labeled opinion data for alignment tasks
    • Monitor market sentiment by product, platform, or geography

    Whether you're building models or measuring brand trust, this dataset offers a structured view of consumer emotion, helping you turn unstructured feedback into meaningful action.

    The more you purchase, the lower the price will be.

  10. h

    my_dataset

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neil Rainsforth (2025). my_dataset [Dataset]. https://huggingface.co/datasets/wkdnev/my_dataset
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    Neil Rainsforth
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Test Sentiment Dataset

    A small sample dataset for text classification tasks, specifically binary sentiment analysis (positive or negative). Useful for testing, demos, or building and validating pipelines with Hugging Face Datasets.

      Dataset Summary
    

    This dataset contains short text samples labeled as either positive or negative. It is intended for testing purposes and includes:

    10 training samples 4 test samples

    Each example includes:

    text: A short sentence or review… See the full description on the dataset page: https://huggingface.co/datasets/wkdnev/my_dataset.

  11. Twitter Tweets Sentiment Dataset

    • kaggle.com
    zip
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    zip(1289519 bytes)Available download formats
    Dataset updated
    Apr 8, 2022
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  12. Dataset: Towards Trustworthy Sentiment Analysis in Software Engineering:...

    • figshare.com
    xlsx
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Obaidi; Marc Herrmann; Jil Klünder; Kurt Schneider (2025). Dataset: Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection [Dataset]. http://doi.org/10.6084/m9.figshare.29250935.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Martin Obaidi; Marc Herrmann; Jil Klünder; Kurt Schneider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset: Towards Trustworthy Sentiment Analysis in Software Engineering — Dataset Characteristics and Tool SelectionAuthorsMartin Obaidi, Marc Herrmann, Jil Klünder, Kurt SchneiderDescriptionThis dataset accompanies the publication:Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool SelectionThe dataset contains all coded data and annotation results from a comprehensive analysis of sentiment and linguistic characteristics in software engineering communication. The study benchmarks 14 sentiment analysis tools across 10 datasets from five major SE platforms and investigates how dataset characteristics impact tool performance and selection. The coded data underpins the development of a practical questionnaire-based recommendation approach for trustworthy and context-sensitive sentiment analysis in SE.ContentsThe dataset includes the following file:All_Sample_Sets_Coded-v04.xlsxContains manually coded sample sets from five platforms (App Reviews, Code Reviews, GitHub, Jira, Stack Overflow).Each worksheet corresponds to one platform and provides:The raw text of the communication sample (“Text”).Gold-standard sentiment labels (“oracle”): -1 = Negative, 0 = Neutral, 1 = Positive.Annotations for 13 linguistic characteristics:For each characteristic, x = present, n = not present, and an empty cell = not applicable for this item (e.g., if a characteristic is only relevant for positive statements).Enables detailed cross-platform analysis of both sentiment polarity and linguistic features in developer communication.Column details:Text: Communication/document text.oracle: Gold-standard sentiment label.Characteristic 1 – 13: See accompanying paper for definitions. Annotation can be x, n, or empty (not applicable).If you use this dataset, please cite:Obaidi, M., Herrmann, M., Klünder, J., Schneider, K. (2025).Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection.In: 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW).LicenseThis dataset is provided under the Creative Commons Attribution 4.0 International License (CC BY 4.0).ContactFor questions regarding the dataset, please contact the corresponding author as listed in the publication.

  13. h

    lm-human-preferences-sentiment

    • huggingface.co
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRL (2025). lm-human-preferences-sentiment [Dataset]. https://huggingface.co/datasets/trl-lib/lm-human-preferences-sentiment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2025
    Dataset authored and provided by
    TRL
    Description

    LM-Human-Preferences-Sentiment Dataset

      Summary
    

    The LM-Human-Preferences-Sentiment dataset is a processed subset of OpenAI's LM-Human-Preferences, focusing specifically on sentiment analysis tasks. It contains pairs of text samples, each labeled as either "chosen" or "rejected," based on human preferences regarding the sentiment conveyed in the text. This dataset enables models to learn human preferences in sentiment expression, enhancing their ability to generate and… See the full description on the dataset page: https://huggingface.co/datasets/trl-lib/lm-human-preferences-sentiment.

  14. Broad-Coverage German Sentiment Classification Model and Dataset for Dialog...

    • zenodo.org
    • live.european-language-grid.eu
    • +2more
    zip
    Updated Jun 11, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Guhr; Anne-Kathrin Schumann; Frank Bahrmann; Hans-Joachim Böhme; Oliver Guhr; Anne-Kathrin Schumann; Frank Bahrmann; Hans-Joachim Böhme (2020). Broad-Coverage German Sentiment Classification Model and Dataset for Dialog Systems [Dataset]. http://doi.org/10.5281/zenodo.3693810
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Oliver Guhr; Anne-Kathrin Schumann; Frank Bahrmann; Hans-Joachim Böhme; Oliver Guhr; Anne-Kathrin Schumann; Frank Bahrmann; Hans-Joachim Böhme
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems

    This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.

    You can find the code for training testing the models, that was published along with the paper in this repository.

    The germansentiment Python package contains a easy to use interface for the model that was published with this paper.

  15. m

    Code-Mixed Indic Languages with Emoticons for Sarcasm Detection

    • data.mendeley.com
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Shaikhh (2025). Code-Mixed Indic Languages with Emoticons for Sarcasm Detection [Dataset]. http://doi.org/10.17632/bdm2y2p3rc.1
    Explore at:
    Dataset updated
    Oct 10, 2025
    Authors
    Sarah Shaikhh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of code-mixed multilingual text data designed for sentiment analysis research. It captures naturally occurring code-mixed patterns combining English with ten Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu and Urdu. The dataset aims to support studies in multilingual NLP, sentiment classification, and language processing for real-world social media and conversational data. Dataset Description The dataset contains the following attributes: • Text: The original code-mixed text sample. • Sentiment: The corresponding sentiment label (positive, negative, or neutral). • Translated_text: English translation of the original text. • Cleaned_text: Text after preprocessing, including lowercasing, punctuation and stopword removal, and normalization. • Tokens: Tokenized representation of the cleaned text. Preprocessing involved cleaning (removal of punctuation, URLs, and emojis), normalization of repeated characters, language-specific stopword removal, translation to English, and token formation for downstream NLP tasks.

  16. d

    Review Dataset [Transport] – Public consumer feedback for sentiment and...

    • datarade.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com, Review Dataset [Transport] – Public consumer feedback for sentiment and experience [Dataset]. https://datarade.ai/data-products/review-dataset-transport-public-consumer-feedback-for-sen-wiserbrand-com
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset provided by
    WiserBrand
    Area covered
    Sweden, El Salvador, Netherlands, Honduras, Malta, United States of America, Romania, Estonia, Albania, Greece
    Description

    "This dataset includes consumer-submitted reviews from over 421 companies, covering both product- and service-based businesses. It’s built to support CX, AI, and analytics teams seeking structured insight into what real customers say, feel, and expect — across the Transport industry.

    Each review includes:

    • Authentic customer reviews (text, rating, pros and cons)
    • Labeled sentiment and tone (positive, neutral, negative)
    • Service context across industries: purchase, delivery, support, return, usage
    • Industry and company filters (fully customizable per buyer request)
    • Optional metadata: platform, review length, timestamp, geo-location

    The list may vary based on the industry and can be customized as per your request.

    Use this dataset to:

    • Track public perception trends across specific brands or verticals
    • Segment sentiment insights by industry, region, or company
    • Power NLP pipelines that require diverse tone, emotion, and domain specificity
    • Build dashboards or LLM prompts grounded in real user language
    • Train review summarization, classification, or escalation engines

    This dataset offers flexibility for custom delivery-by industry, domain, or company, making it ideal for teams needing scalable consumer voice data tailored to specific strategic goals."

  17. BBC datasets for sentiment analysis

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2
    Explore at:
    zip(1921885 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Alan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Name: BBC Articles Sentiment Analysis Dataset

    Source: BBC News

    Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

    Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

    Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

    Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

    Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

    Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

    Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.

  18. Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymized; Anonymized (2022). Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News Media Headlines Using Automated Labelling with Transformer Language Models" [Dataset]. http://doi.org/10.5281/zenodo.5144113
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymized; Anonymized
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains automated sentiment and emotionality annotations of 23 million headlines from 47 popular news media outlets popular in the United States.

    The set of 47 news media outlets analysed (listed in Figure 1 of the main manuscript) was derived from the AllSides organization 2019 Media Bias Chart v1.1. The human ratings of outlets’ ideological leanings were also taken from this chart and are listed in Figure 2 of the main manuscript.

    News articles headlines from the set of outlets analyzed in the manuscript are available in the outlets’ online domains and/or public cache repositories such as The Internet Wayback Machine, Google cache and Common Crawl. Articles headlines were located in articles’ HTML raw data using outlet-specific XPath expressions.

    The temporal coverage of headlines across news outlets is not uniform. For some media organizations, news articles availability in online domains or Internet cache repositories becomes sparse for earlier years. Furthermore, some news outlets popular in 2019, such as The Huffington Post or Breitbart, did not exist in the early 2000’s. Hence, our data set is sparser in headlines sample size and representativeness for earlier years in the 2000-2019 timeline. Nevertheless, 20 outlets in our data set have chronologically continuous partial or full headline data availability since the year 2000. Figure S 1 in the SI reports the number of headlines per outlet and per year in our analysis.

    In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the headline due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. After manual testing, we determined that the percentage of headlines following in this category is very small. Additionally, our method might miss detecting some articles in the online domains of news outlets. To conclude, in a data analysis of over 23 million headlines, we cannot manually check the correctness of every single data instance and hundred percent accuracy at capturing headlines’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our headlines set is representative of headlines in print news media content for the studied time period and outlets analyzed.

    The list of compressed files in this data set is listed next:

    -analysisScripts.rar contains the analysis scripts used in the main manuscript as well as aggregated data of sentiment and emotionality automated annotations of the headlines and human annotations of a subset of headlines sentiment and emotionality used as ground truth.

    -models.rar contains the Transformer sentiment and emotion annotation models used in the analysis. Namely:

    Siebert/sentiment-roberta-large-english from https://huggingface.co/siebert/sentiment-roberta-large-english. This model is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). See more information from the original authors at https://huggingface.co/siebert/sentiment-roberta-large-english

    DistilbertSST2.rar is the default sentiment classification model of the HuggingFace Transformer library https://huggingface.co/ This model is only used to replicate the results of the sentiment analysis with sentiment-roberta-large-english

    DistilRoberta j-hartmann/emotion-english-distilroberta-base from https://huggingface.co/j-hartmann/emotion-english-distilroberta-base. The model is a fine-tuned checkpoint of DistilRoBERTa-base. The model allows annotation of English text with Ekman's 6 basic emotions, plus a neutral class. The model was trained on 6 diverse datasets. Please refer to the original author at https://huggingface.co/j-hartmann/emotion-english-distilroberta-base for an overview of the data sets used for fine tuning. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

    -headlinesDataWithSentimentLabelsAnnotationsFromSentimentRobertaLargeModel.rar URLs of headlines analyzed and the sentiment annotations of the siebert/sentiment-roberta-large-english Transformer model. https://huggingface.co/siebert/sentiment-roberta-large-english

    -headlinesDataWithSentimentLabelsAnnotationsFromDistilbertSST2.rar URLs of headlines analyzed and the sentiment annotations of the default HuggingFace sentiment analysis model fine-tuned on the SST-2 dataset. https://huggingface.co/

    -headlinesDataWithEmotionLabelsAnnotationsFromDistilRoberta.rar URLs of headlines analyzed and the emotion categories annotations of the j-hartmann/emotion-english-distilroberta-base Transformer model. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

  19. d

    Review Dataset [Social Media and Networking] – Public consumer feedback for...

    • datarade.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com, Review Dataset [Social Media and Networking] – Public consumer feedback for sentiment and experience [Dataset]. https://datarade.ai/data-products/review-dataset-social-media-and-networking-public-consume-wiserbrand-com
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset provided by
    WiserBrand
    Area covered
    Nicaragua, Germany, Portugal, Norway, Monaco, Ireland, Belgium, Bosnia and Herzegovina, Belize, Netherlands
    Description

    "This dataset includes consumer-submitted reviews from over 479 companies, covering both product- and service-based businesses. It’s built to support CX, AI, and analytics teams seeking structured insight into what real customers say, feel, and expect — across the Social Media and Networking industry.

    Each review includes:

    • Authentic customer reviews (text, rating, pros and cons)
    • Labeled sentiment and tone (positive, neutral, negative)
    • Service context across industries: purchase, delivery, support, return, usage
    • Industry and company filters (fully customizable per buyer request)
    • Optional metadata: platform, review length, timestamp, geo-location

    The list may vary based on the industry and can be customized as per your request.

    Use this dataset to:

    • Track public perception trends across specific brands or verticals
    • Segment sentiment insights by industry, region, or company
    • Power NLP pipelines that require diverse tone, emotion, and domain specificity
    • Build dashboards or LLM prompts grounded in real user language
    • Train review summarization, classification, or escalation engines

    This dataset offers flexibility for custom delivery-by industry, domain, or company, making it ideal for teams needing scalable consumer voice data tailored to specific strategic goals."

  20. Sentiment analysis of tech media articles using VADER package and...

    • data.europa.eu
    • live.european-language-grid.eu
    • +1more
    unknown
    Updated Jan 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2020). Sentiment analysis of tech media articles using VADER package and co-occurrence analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2612868?locale=pt
    Explore at:
    unknown(125367)Available download formats
    Dataset updated
    Jan 23, 2020
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sentiment analysis of tech media articles using VADER package and co-occurrence analysis Sources: Above 140k articles (01.2016-03.2019): Gigaom 0.5% Euractiv 0.9% The Conversation 1.3% Politico Europe 1.3% IEEE Spectrum 1.8% Techforge 4.3% Fastcompany 4.5% The Guardian (Tech) 9.2% Arstechnica 10.0% Reuters 11% Gizmodo 17.5% ZDNet 18.3% The Register 19.5% Methodology The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers. As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise. The process included the following main steps: The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology) The articles containing the given social issue and co-occurring term are identified The identified articles are divided into paragraphs Social issue and co-occurring words are removed from the paragraph The VADER sentiment analysis is carried out for every identified and modified paragraph The average for the given word pair is calculated for the final result Therefore, the procedure has been repeated for 100 words for all identified social issues. The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down. The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way: The articles containing the given social issue are identified The paragraphs containing the social issue are selected for sentiment analysis *Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014. Files sentiments_mod11.csv sentiment score based on chosen unigrams sentiments_mod22.csv sentiment score based on chosen bigrams sentiments_cooc_mod11.csv, sentiments_cooc_mod12.csv, sentiments_cooc_mod21.csv, sentiments_cooc_mod22.csv combinations of co-occurrences: unigrams-unigrams, unigrams-bigrams, bigrams-unigrams, bigrams-bigrams

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Organization logo

Datasets for Sentiment Analysis

Explore at:
csvAvailable download formats
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------

The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

File name: sts_gold_tweet.csv

----------- Amazon Sales Dataset ----------------

This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

Features:

  • product_id - Product ID
  • product_name - Name of the Product
  • category - Category of the Product
  • discounted_price - Discounted Price of the Product
  • actual_price - Actual Price of the Product
  • discount_percentage - Percentage of Discount for the Product
  • rating - Rating of the Product
  • rating_count - Number of people who voted for the Amazon rating
  • about_product - Description about the Product
  • user_id - ID of the user who wrote review for the Product
  • user_name - Name of the user who wrote review for the Product
  • review_id - ID of the user review
  • review_title - Short review
  • review_content - Long review
  • img_link - Image Link of the Product
  • product_link - Official Website Link of the Product

License: CC BY-NC-SA 4.0

File name: amazon.csv

----------- Rotten Tomatoes Reviews Dataset ----------------

This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

File name: data_rt.csv

----------- Preprocessed Dataset Sentiment Analysis ----------------

Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.

The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

DOI: 10.34740/kaggle/dsv/3877817

Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

This dataset was used in the experimental phase of my research.

File name: EcoPreprocessed.csv

----------- Amazon Earphones Reviews ----------------

This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

License: U.S. Government Works

Source: www.amazon.in

File name (original): AllProductReviews.csv (contains 14337 reviews)

File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

----------- Amazon Musical Instruments Reviews ----------------

This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

Source: http://jmcauley.ucsd.edu/data/amazon/

File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

Search
Clear search
Close search
Google apps
Main menu