The inspiration behind creating the OYO Review Dataset for sentiment analysis was to explore the sentiment and opinions expressed in hotel reviews on the OYO Hotels platform. Analyzing the sentiment of customer reviews can provide valuable insights into the overall satisfaction of guests, identify areas for improvement, and assist in making data-driven decisions to enhance the hotel experience. By collecting and curating this dataset, Deep Patel, Nikki Patel, and Nimil aimed to contribute to the field of sentiment analysis in the context of the hospitality industry. Sentiment analysis allows us to classify the sentiment expressed in textual data, such as reviews, into positive, negative, or neutral categories. This analysis can help hotel management and stakeholders understand customer sentiments, identify common patterns, and address concerns or issues that may affect the reputation and customer satisfaction of OYO Hotels. The dataset provides a valuable resource for training and evaluating sentiment analysis models specifically tailored to the hospitality domain. Researchers, data scientists, and practitioners can utilize this dataset to develop and test various machine learning and natural language processing techniques for sentiment analysis, such as classification algorithms, sentiment lexicons, or deep learning models. Overall, the goal of creating the OYO Review Dataset for sentiment analysis was to facilitate research and analysis in the area of customer sentiments and opinions in the hotel industry. By understanding the sentiment of hotel reviews, businesses can strive to improve their services, enhance customer satisfaction, and make data-driven decisions to elevate the overall guest experience.
Deep Patel: https://www.linkedin.com/in/deep-patel-55ab48199/ Nikki Patel: https://www.linkedin.com/in/nikipatel9/ Nimil lathiya: https://www.linkedin.com/in/nimil-lathiya-059a281b1/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Amazon Reviews Polarity Dataset discloses eighteen years of customers' ratings and reviews from Amazon.com, offering an unparalleled trove of insight and knowledge. Drawing from the immense pool of over 35 million customer reviews, this dataset presents a broad spectrum of customer opinions on products they have bought or used. This invaluable data is a gold mine for improving products and services as it contains comprehensive information regarding customers' experiences with a product including ratings, titles, and plaintext content. At the same time, this dataset contains both customer-specific data along with product information which encourages deep analytics that could lead to great advances in providing tailored solutions for customers. Has your product been favored by the majority? Are there any aspects that need extra care? Use Amazon Reviews Polarity to gain deeper insights into what your customers want - explore now!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Analyze customer ratings to identify trends: Take a look at how many customers have rated the same product or service with the same score (e.g., 4 stars). You can use this information to identify what customers like or don’t like about it by examining common sentiment throughout the reviews. Identifying these patterns can help you make decisions on which features of your products or services to emphasize in order to boost sales and satisfaction rates.
2 Review content analysis: Analyzing review content is one of the best ways to gauge customer sentiment toward specific features or aspects of a product/service. Using natural language processing tools such as Word2Vec, Latent Dirichlet Allocation (LDA), or even simple keyword search algorithms can quickly reveal general topics that are discussed in relation to your product/service across multiple reviews - allowing you quickly pinpoint areas that may need improvement for particular items within your lines of business.
3 Track associated scores over time: By tracking customer ratings overtime, you may be able to better understand when there has been an issue with something specific related to your product/service - such as negative response toward a feature that was introduced but didn’t seem popular among customers and was removed shortly after introduction.. This can save time and money by identifying issues before they become widespread concerns with larger sets of consumers who invest their money in using your company's item(s).
4 Visualize sentiment data over time graphs : Utilizing visualizations such as bar graphs can help identify trends across different categories quicker than raw numbers alone; combining both numeric values along with color differences associated between different scores allows you spot anomalies easier - allowing faster resolution times when trying figure out why certain spikes occurred where other stayed stable (or vice-versa) when comparing similar data points through time-series based visualization models
- Developing a customer sentiment analysis system that can be used to quickly analyze the sentiment of reviews and identify any potential areas of improvement.
- Building a product recommendation service that takes into account the ratings and reviews of customers when recommending similar products they may be interested in purchasing.
- Training a machine learning model to accurately predict customers’ ratings on new products they have not yet tried and leverage this for further product development optimization initiatives
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------| | label | The sentiment of the review, either positive or negative. (String) | | title | The title of the review. (String) ...
This dataset was created by Kanwal Zahoor
Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this dataset develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help serve the customers better but can also reveal lolot of customer traits present/hidden in the reviews.
The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘uHack Sentiments 2.0: Decode Code Words’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/manishtripathi86/uhack-sentiments-20-decode-code-words on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The challenge here is to analyze and deep dive into the natural language text (reviews) and bucket them based on their topics of discussion. Furthermore, analyzing the overall sentiment will also help the business to make tangible decisions.
The data set provided to you has a mix of customer reviews for products across categories and retailers. We would like you to model on the data
to bucket the future reviews in their respective topics (Note: A review can talk about multiple topics)
Overall polarity (positive/negative sentiment)
Topics (Components, Delivery and Customer Support, Design and Aesthetics, Dimensions, Features, Functionality, Installation, Material, Price, Quality and Usability) Polarity (Positive/Negative) Note: The target variables are all encoded in the train dataset for convenience. Please submit the test results in the similar encoded fashion for us to evaluate your results.
| | Field Name Data Type Purpose Variable type
Id Integer Unique identifier for each review Input
Review String Review written by customers on a retail website Input
Components String 1: aspects related to components Target
0: None
Delivery and Customer Support String 1: some aspects related to delivery, return, exchange and customer support Target
0: None
Design and Aesthetics String 1: some aspects related to components Target
0: None
Dimensions String 1: related to product dimension and size Target
0: None
Features String 1: related to product features Target
0 : None
Functionality String 1: related to working of a product Target
0: None
Installation String 1: related to installation of the product Target
0: None
Material String 1: related to material of the product Target
0: None
Price String 1: related to pricing details of a product Target
0: None
Quality String 1: related to quality aspects of a product Target
0: None
Usability String 1: related to usability of a product Target
0: None
Polarity Integer 1: Positive sentiment; Target
0: Negative Sentiment | |
| --- | --- |
| | | | |
| --- | --- |
| | |
Skills: Text Pre-processing – Lemmatization , Tokenization, N-Grams and other relevant methods Multi-Class Classification, Multi-label Classification Optimizing Log Loss
Overview Ugam, a Merkle company, is a leading analytics and technology services company. Our customer-centric approach delivers impactful business results for large corporations by leveraging data, technology, and expertise.
We consistently deliver superior, impactful results through the right blend of human intelligence and AI. With 3300+ people spread across locations worldwide, we successfully deploy our services to create success stories across industries like Retail & Consumer Brands, High Tech, BFSI, Distribution, and Market Research & Consulting. Over the past 21 years, Ugam has been recognized by several firms including Forrester and Gartner, named the No.1 data science company in India by Analytics Insight, and certified as a Great Place to Work®.
Problem Statement: The last two decades have witnessed a significant change in how consumers purchase products and express their experience/opinions in reviews, posts, and content across platforms. These online reviews are not only useful to reflect customers’ sentiment towards a product but also help businesses fix gaps and find potential opportunities which could further influence future purchases.
Participants need develop a machine learning model that can analyse customers’ sentiments based on their reviews and feedback.
NOTE: The prize money will be for the interested candidates who are willing to get interviewed or hired by Ugam. Winner are requested to come to the Machine Leaning Developers Summit2022, happening at Bangalore, for receiving the prize money.
dataset link: https://machinehack.com/hackathon/uhack_sentiments_20_decode_code_words/overview
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset with two columns: "Text" and "Label". The "Text" column contains sentiments of Pakistani traffic, which includes both positive and negative reviews. The "Label" column is used to classify each sentiment as either positive or negative, where positive reviews are labeled with "0" and negative reviews are labeled with "1". This dataset can be used for sentiment analysis tasks, which involve using natural language processing techniques to analyze and classify text data based on the emotions and opinions expressed within the text. By training a machine learning model on this dataset, you can create a system that can automatically classify new traffic sentiments as either positive or negative. Some possible applications of this type of sentiment analysis include monitoring public opinion about traffic-related issues, identifying areas where improvements are needed, and evaluating the effectiveness of traffic-related policies and initiatives. Additionally, businesses in the transportation industry could use this type of analysis to understand customer feedback and improve their services accordingly.
Dataset with sentiment of Russian text
Contains aggregated dataset of Russian texts from 6 datasets.
Labels meaning
0: NEUTRAL 1: POSITIVE 2: NEGATIVE
Datasets
Sentiment Analysis in Russian
Sentiments (positive, negative or neutral) of news in russian language from Kaggle competition.
Russian Language Toxic Comments
Small dataset with labeled comments from 2ch.hk and pikabu.ru.
Dataset of car reviews for machine learning (sentiment analysis)
Glazkova A.… See the full description on the dataset page: https://huggingface.co/datasets/MonoHime/ru_sentiment_dataset.
If you are interested in joining Kaggle University Club, please e-mail Jessica Li at lijessica@google.com
This Hackathon is open to all undergraduate, master, and PhD students who are part of the Kaggle University Club program. The Hackathon provides students with a chance to build capacity via hands-on ML, learn from one another, and engage in a self-defined project that is meaningful to their careers.
Teams must register via Google Form to be eligible for the Hackathon. The Hackathon starts on Monday, November 12, 2018 and ends on Monday, December 10, 2018. Teams have one month to work on a team submission. Teams must do all work within the Kernel editor and set Kernel(s) to public at all times.
The freestyle format of hackathons has time and again stimulated groundbreaking and innovative data insights and technologies. The Kaggle University Club Hackathon recreates this environment virtually on our platform. We challenge you to build a meaningful project around the UCI Machine Learning - Drug Review Dataset. Teams are free to let their creativity run and propose methods to analyze this dataset and form interesting machine learning models.
Machine learning has permeated nearly all fields and disciplines of study. One hot topic is using natural language processing and sentiment analysis to identify, extract, and make use of subjective information. The UCI ML Drug Review dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. This data was published in a study on sentiment analysis of drug experience over multiple facets, ex. sentiments learned on specific aspects such as effectiveness and side effects (see the acknowledgments section to learn more).
The sky's the limit here in terms of what your team can do! Teams are free to add supplementary datasets in conjunction with the drug review dataset in their Kernel. Discussion is highly encouraged within the forum and Slack so everyone can learn from their peers.
Here are just a couple ideas as to what you could do with the data:
There is no one correct answer to this Hackathon, and teams are free to define the direction of their own project. That being said, there are certain core elements generally found across all outstanding Kernels on the Kaggle platform. The best Kernels are:
Teams with top submissions have a chance to receive exclusive Kaggle University Club swag and be featured on our official blog and across social media.
IMPORTANT: Teams must set all Kernels to public at all times. This is so we can track each team's progression, but more importantly it encourages collaboration, productive discussion, and healthy inspiration to all teams. It is not so that teams can simply copycat good ideas. If a team's Kernel isn't their own organic work, it will not be considered a top submission. Teams must come up with a project on their own.
The final Kernel submission for the Hackathon must contain the following information:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains reviews of the top 10 rated airlines in 2023 sourced from the Airline Quality (https://www.airlinequality.com) website. The reviews cover various aspects of the flight experience, including seat comfort, staff service, food and beverages, inflight entertainment, value for money, and overall rating. The dataset is suitable for sentiment analysis, customer satisfaction analysis, and other similar tasks.
Usage - Download the dataset file airlines_reviews.csv. - Use the dataset for analysis, visualization, and machine learning tasks.
List of Airlines 1. Singapore Airlines 2. Qatar Airways 3. All Nippon Airways 4. Emirates 5. Japan Airlines 6. Turkish Airlines 7. Air France 8. Cathay Pacific Airways 9. EVA Air 10.Korean Air
This dataset is provided under the MIT License.
**Overview Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this weekend hackathon, we challenge the machinehackers community to develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help us serve the customers better but can also reveal lot of customer traits present/hidden in the reviews.
The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.
In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.
Dataset Description:
Train.csv - 6364 rows x 4 columns (Inlcudes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission
Attribute Description:
Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment Skills:
NLP, Sentiment Analysis Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing multi-class log loss to generalize well on unseen data
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By imdb (From Huggingface) [source]
The IMDb Large Movie Review Dataset is a comprehensive collection of movie reviews used for sentiment classification. The dataset includes a wide range of movie reviews along with their corresponding sentiment labels, which indicate whether the review is positive or negative in nature. This invaluable dataset is aimed at facilitating sentiment analysis and classification tasks in the field of natural language processing.
The main purpose of the train.csv file within this dataset is to provide a curated collection of movie reviews, each accompanied by its respective sentiment label. This file proves particularly useful for training machine learning models to accurately predict sentiment and classify reviews based on their emotional tone.
Similarly, the test.csv file contains another set of movie reviews along with corresponding sentiment labels. Meant for testing and validating the performance of trained models, this dataset enables researchers and developers to evaluate their models' effectiveness in real-world scenarios.
Additionally, the unsupervised.csv file offers an alternative subset within the dataset. Unlike train.csv and test.csv, unsupervised.csv does not include any associated sentiment labels for individual movie reviews. This specific subset serves as a valuable resource for exploring unsupervised learning techniques within the domain of sentiment classification.
By utilizing this meticulously compiled IMDb Large Movie Review Dataset, researchers and data scientists can delve into various aspects related to analyzing sentiments in textual data. With its carefully labeled data points covering both positive and negative sentiments expressed in diverse film critiques, this dataset empowers users to develop sophisticated machine learning algorithms that accurately assess subjective opinions from text data
Introduction:
Dataset Overview: - Train.csv: This file contains a set of movie reviews along with their sentiment labels. It is intended for training your sentiment analysis models. - Test.csv: This file provides another set of movie reviews along with their corresponding sentiment labels. You can use this file to evaluate the performance of your trained models. - Unsupervised.csv: This file includes movie reviews without any associated sentiment labels. It can be used for unsupervised sentiment classification tasks.
Columns in the Dataset: - text: The main column containing the text of each movie review. - label: The sentiment label assigned to each review, indicating whether it is positive or negative.
Guidelines for Using the Dataset:
Training Your Model:
- Begin by loading and preprocessing the data from train.csv
- Treat 'text' as your input feature and 'label' as your target variable
- Explore different machine learning or deep learning algorithms suitable for text classification
- Train your model using various techniques, such as bag-of-words, word embeddings, or transformers
- Evaluate and fine-tune your model's performance using test.csv
Evaluating Your Model:
- Load test.csv and preprocess the data similar to what you did with train.csv
- Use this preprocessed test data to evaluate the accuracy, precision, recall, F1 score or other relevant metrics of your trained model on unseen data
- Analyze these metrics to understand how well your model is performing in predicting sentiments
Advancing Your Model (Unsupervised Classification):
- Utilize unsupervised.csv for unsupervised sentiment classification tasks
- Preprocess the movie reviews in this file and explore techniques like clustering, topic modeling, or self-supervised learning
- Extract patterns, themes, or sentiments from the reviews without any guidance from labeled data
Conclusion:
- Sentiment Analysis: This dataset can be used to train models for sentiment analysis, where the goal is to predict whether a movie review is positive or negative based on its text.
- NLP Research: The dataset can be used for various natural language processing (NLP) tasks such as text classification, information extraction, or named entity recognition. Researchers and practitioners can leverage this dataset to develop and evaluate new algorithms and techniques in the field of NLP.
- Recommendation Systems: The sentiment labels in this dataset can be used as a source of feedback or user preferences for recommendation systems. By analyzing the sentiments expressed in reviews,...
Airline Passenger Reviews Dataset This dataset contains real-world airline passenger reviews gathered from various flights across different countries and airlines. Each row represents an individual passenger’s experience and feedback on a specific flight. Dataset Overview The dataset includes reviews on several aspects of the airline experience, such as seat comfort, food, cabin crew service, value for money, and more. It can be used for sentiment analysis, NLP-based text classification, airline performance evaluation, and other machine learning or data visualization tasks.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Emotions play a vital role in human communication, and detecting emotions from text data is a challenging task. The ability to automatically recognize emotions from text has many practical applications, such as in sentiment analysis, social media monitoring, and customer feedback analysis.
In this project, we will discuss the working principle of a text emotion recognition model and its important terminologies. We will also provide a detailed description of the model architecture used and its training process. Finally, we will conclude by evaluating the model using confusion matrix and classification report. Here, in the "emotions" column 0: sad 1: happy
slang.txt in Abbreviations step can be taken from: https://www.kaggle.com/datasets/mansis97/slangs
To get improved results on Machine Learning Algorithms, and other techniques used in Data Mining.
Comprises of two columns, the First row consists of comparative reviews, the second row contains polarities.
I pay thanks to my supervisor, Dr Muhammad Zubair Asghar, Assitant Professor, ICIT, Gomal University (KPK). Di.Khan. Without his guidance, I can't accomplish this task.
Comparative opinion mining is becoming the most popular research area in the field of Data Mining. These three comparative reviews datasets will help the researchers who are working in the area of opinion mining and sentiment analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 4,992 rows of structured information derived from a triage system designed for managing and prioritizing comments in collaborative environments. Using advanced machine learning techniques, such as GEMMA-2B for intent classification, Hugging Face models for sentiment analysis, and Latent Dirichlet Allocation (LDA) for topic modeling, each comment is analyzed across six dimensions: urgency, importance, sentiment, actionability, resolution status, and thematic relevance.
The dataset can support tasks in:
Key Features: Hierarchical Labels: Multi-level classifications (level_0 to level_4) for each comment. Priority Scores: Aggregated values representing the criticality of each comment. Sentiment Analysis: Positive, neutral, and negative sentiment scoring. LDA Topics: Thematic insights for comment context.
Metadata: Rows: 4,992 Columns: 49 Tags: NLP, Machine Learning, Sentiment Analysis, Comment Triage, Topic Modeling, Collaboration
File Details: Filename: triaged_comments_with_priority_and_labels_hierarchy.csv License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Source: https://www.cs.jhu.edu/~mdredze/datasets/sentiment/
Kaggle kernel take care of the tar.gz files for you :-)
This dataset features slightly older product reviews from Amazon and derives from the Johns Hopkins University’s Department of Computer Science.
unprocessed.tar.gz processed_acl.tar.gz processed_stars.tar.gz
John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007. [PDF]
John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jenn Wortman. Learning Bounds for Domain Adaptation. Neural Information Processing Systems (NIPS), 2008. [PDF]
Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-Weighted Linear Classification. International Conference on Machine Learning (ICML), 2008. [PDF]
Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain Adaptation with Multiple Sources. Neural Information Processing Systems (NIPS), 2009.
If you use this data for your research or a publication, please cite the first (ACL 2007) paper as the reference for the data. Also, please drop me a line so I know that you found the data useful.
The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. This page contains some descriptions about the data. If you have questions, please email Mark Dredze or John Blitzer.
1) unprocessed.tar.gz contains the original data. 2) processed.acl.tar.gz contains the data pre-processed and balanced. That is, the format of Blitzer et al. (ACL 2007) 3) processed.realvalued.tar.gz contains the data pre-processed and balanced, but with the number of stars, rather than just positive or negative. That is, the format of Mansour et al. (NIPS 2009)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
*Also find Metacritic Movies and Metacritic TV Shows datasets.*
This dataset contains a collection of video games and their corresponding reviews from Metacritic, a popular aggregate review site. The data provides insights into various video games across different platforms, including PC, PlayStation, Xbox, and others. Each game entry includes critical reviews, user reviews, ratings, and other relevant information that can be used for analysis, natural language processing, machine learning, and predictive modeling.
Important Note: *The games in this collection are selected from Metacritic's Best Games of All Time list, which only includes titles that have received at least 7 reviews, ensuring a minimum level of critical and user input.*
Up-to-dateness: *This dataset is accurate as of March 14, 2025, and includes the most current rankings and game details available at that time.*
The dataset contains general information and scores of 13K+ games and their corresponding 1.6M+ user/critic reviews collected by sending automated requests to Metacritic's public backend API using Python's requests and pandas libraries.
This dataset is perfect for researchers, game enthusiasts, and data scientists who are interested in exploring the gaming industry through data analysis.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset encapsulates a comprehensive collection of public sentiments expressed through comments on a pivotal interview featuring Vladimir Putin and Tucker Carlson. The conversation, set against the backdrop of a tumultuous global economy and the critical discourse surrounding Putin's maneuvers in Ukraine, offers a deep dive into the reasons and viewpoints presented by the Russian leader. With over 100,000 comments, this dataset not only reflects diverse global perspectives but also serves as a mirror to the world's reaction to key international events discussed in the interview.
This dataset was ethically sourced, respecting user privacy and adhering to platform guidelines. Personal identifiers have been anonymized to maintain confidentiality.
We extend our gratitude to YouTube for facilitating an open platform where such vibrant discussions can take place and to the data science community for providing the tools and methodologies that enable this analysis.
As data scientists, we approach this dataset with neutrality, aiming to extract insights without bias. Our role is to analyze and interpret the data impartially, contributing to a broader understanding of public sentiment on significant global issues. Let's harness this opportunity to showcase the power of data science in illuminating the complexities of human discourse.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Collection of documents and its emotions, It helps greatly in NLP Classification tasks
List of documents with emotion flag, Dataset is split into train, test & validation for building the machine learning model
Example :- i feel like I am still looking at a blank canvas blank pieces of paper;sadness
Thanks to Elvis - https://lnkd.in/eXJ8QVB & Hugging face team The technique to prepare the dataset - https://www.aclweb.org/anthology/D18-1404/
Dataset helps the community to develop emotion classification models with NLP based approach.
Few questions your emotion classification model can answer based on your customer review
What is the sentiment of your customer comment? What is the mood of today's special food ?
The inspiration behind creating the OYO Review Dataset for sentiment analysis was to explore the sentiment and opinions expressed in hotel reviews on the OYO Hotels platform. Analyzing the sentiment of customer reviews can provide valuable insights into the overall satisfaction of guests, identify areas for improvement, and assist in making data-driven decisions to enhance the hotel experience. By collecting and curating this dataset, Deep Patel, Nikki Patel, and Nimil aimed to contribute to the field of sentiment analysis in the context of the hospitality industry. Sentiment analysis allows us to classify the sentiment expressed in textual data, such as reviews, into positive, negative, or neutral categories. This analysis can help hotel management and stakeholders understand customer sentiments, identify common patterns, and address concerns or issues that may affect the reputation and customer satisfaction of OYO Hotels. The dataset provides a valuable resource for training and evaluating sentiment analysis models specifically tailored to the hospitality domain. Researchers, data scientists, and practitioners can utilize this dataset to develop and test various machine learning and natural language processing techniques for sentiment analysis, such as classification algorithms, sentiment lexicons, or deep learning models. Overall, the goal of creating the OYO Review Dataset for sentiment analysis was to facilitate research and analysis in the area of customer sentiments and opinions in the hotel industry. By understanding the sentiment of hotel reviews, businesses can strive to improve their services, enhance customer satisfaction, and make data-driven decisions to elevate the overall guest experience.
Deep Patel: https://www.linkedin.com/in/deep-patel-55ab48199/ Nikki Patel: https://www.linkedin.com/in/nikipatel9/ Nimil lathiya: https://www.linkedin.com/in/nimil-lathiya-059a281b1/