The inspiration behind creating the OYO Review Dataset for sentiment analysis was to explore the sentiment and opinions expressed in hotel reviews on the OYO Hotels platform. Analyzing the sentiment of customer reviews can provide valuable insights into the overall satisfaction of guests, identify areas for improvement, and assist in making data-driven decisions to enhance the hotel experience. By collecting and curating this dataset, Deep Patel, Nikki Patel, and Nimil aimed to contribute to the field of sentiment analysis in the context of the hospitality industry. Sentiment analysis allows us to classify the sentiment expressed in textual data, such as reviews, into positive, negative, or neutral categories. This analysis can help hotel management and stakeholders understand customer sentiments, identify common patterns, and address concerns or issues that may affect the reputation and customer satisfaction of OYO Hotels. The dataset provides a valuable resource for training and evaluating sentiment analysis models specifically tailored to the hospitality domain. Researchers, data scientists, and practitioners can utilize this dataset to develop and test various machine learning and natural language processing techniques for sentiment analysis, such as classification algorithms, sentiment lexicons, or deep learning models. Overall, the goal of creating the OYO Review Dataset for sentiment analysis was to facilitate research and analysis in the area of customer sentiments and opinions in the hotel industry. By understanding the sentiment of hotel reviews, businesses can strive to improve their services, enhance customer satisfaction, and make data-driven decisions to elevate the overall guest experience.
Deep Patel: https://www.linkedin.com/in/deep-patel-55ab48199/ Nikki Patel: https://www.linkedin.com/in/nikipatel9/ Nimil lathiya: https://www.linkedin.com/in/nimil-lathiya-059a281b1/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of CĂłrdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115â124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a valuable collection of customer reviews for products purchased from Flipkart, a prominent e-commerce platform. It captures the customer experience and feedback regarding specific products, including their assigned ratings. The dataset is ideal for sentiment analysis, product insights, and understanding customer satisfaction.
The dataset is typically provided in a CSV format. It contains approximately 2,303 records across its columns. While the exact file size is not specified, its structure is well-defined with distinct columns for product information, reviews, and ratings.
This dataset is highly suitable for various analytical and machine learning applications, including: * Performing sentiment analysis on customer reviews to gauge product perception. * Extracting product insights and understanding common customer feedback themes. * Developing and training Natural Language Processing (NLP) models for text classification and opinion mining. * Building recommendation systems based on user ratings and review content. * Analysing customer satisfaction levels and identifying areas for product improvement.
The dataset focuses on customer reviews from Flipkart. While Flipkart primarily operates in India, the dataset's stated region for availability is global. There are no specific notes on time range or demographic scope within the provided information for the reviews themselves. The dataset was listed on 17/06/2025.
CC0
This dataset is intended for a wide range of users, including: * Data scientists and machine learning engineers for building sentiment analysis models. * NLP researchers for advancing text understanding and processing techniques. * Product managers seeking to understand customer feedback and improve product offerings. * Business analysts looking to derive actionable insights from customer reviews and ratings. * Individuals interested in AI and LLM data for training and experimentation.
Original Data Source: Flipkart Reviews Sentiment Analysis
This dataset is a collection of customer reviews obtained from Amazon.com. It is designed for multilingual sentiment analysis and opinion mining, containing reviews in five different languages: Italian, German, Spanish, French, and English. The dataset is valuable for natural language processing tasks, sentiment analysis algorithms, and various machine learning applications that require diverse language data for training and evaluation. It can be used to train and fine-tune models to automatically classify sentiments, predict customer satisfaction, and extract key information from customer reviews.
The dataset is typically provided in a CSV file format. While specific total row counts are not available, examples of column value distributions are present, such as 675 total values for user names and 640 total values for star ratings, with 92% being 5/5 reviews. The dataset is structured to support various text and NLP applications.
This dataset is ideal for a range of applications, including: * Multilingual sentiment analysis. * Opinion mining studies. * Developing and testing natural language processing tasks. * Building sentiment analysis algorithms. * Training machine learning models to classify sentiments. * Predicting customer satisfaction from review data. * Extracting key insights and information from customer feedback.
The dataset's coverage is global, drawing reviews from Amazon.com. It includes content in Italian, German, Spanish, French, and English, indicating its relevance to regions where these languages are spoken. The dataset contains a 'date' column for each review; however, a specific time range for the reviews themselves is not provided.
CC-BY-NC
This dataset is suitable for: * Data Scientists and Researchers: For developing and testing machine learning models for sentiment analysis, NLP, and text classification across multiple languages. * E-commerce Analysts: To understand customer satisfaction, product performance, and market sentiment from user reviews. * Language Model Developers: To fine-tune large language models with diverse text data for improved natural language understanding. * Businesses: To gain insights into customer feedback and improve product or service offerings.
Original Data Source: Amazon Review Dataset LLM
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Amazon Reviews Polarity Dataset discloses eighteen years of customers' ratings and reviews from Amazon.com, offering an unparalleled trove of insight and knowledge. Drawing from the immense pool of over 35 million customer reviews, this dataset presents a broad spectrum of customer opinions on products they have bought or used. This invaluable data is a gold mine for improving products and services as it contains comprehensive information regarding customers' experiences with a product including ratings, titles, and plaintext content. At the same time, this dataset contains both customer-specific data along with product information which encourages deep analytics that could lead to great advances in providing tailored solutions for customers. Has your product been favored by the majority? Are there any aspects that need extra care? Use Amazon Reviews Polarity to gain deeper insights into what your customers want - explore now!
For more datasets, click here.
- đ¨ Your notebook can be here! đ¨!
- Analyze customer ratings to identify trends: Take a look at how many customers have rated the same product or service with the same score (e.g., 4 stars). You can use this information to identify what customers like or donât like about it by examining common sentiment throughout the reviews. Identifying these patterns can help you make decisions on which features of your products or services to emphasize in order to boost sales and satisfaction rates.
2 Review content analysis: Analyzing review content is one of the best ways to gauge customer sentiment toward specific features or aspects of a product/service. Using natural language processing tools such as Word2Vec, Latent Dirichlet Allocation (LDA), or even simple keyword search algorithms can quickly reveal general topics that are discussed in relation to your product/service across multiple reviews - allowing you quickly pinpoint areas that may need improvement for particular items within your lines of business.
3 Track associated scores over time: By tracking customer ratings overtime, you may be able to better understand when there has been an issue with something specific related to your product/service - such as negative response toward a feature that was introduced but didnât seem popular among customers and was removed shortly after introduction.. This can save time and money by identifying issues before they become widespread concerns with larger sets of consumers who invest their money in using your company's item(s).
4 Visualize sentiment data over time graphs : Utilizing visualizations such as bar graphs can help identify trends across different categories quicker than raw numbers alone; combining both numeric values along with color differences associated between different scores allows you spot anomalies easier - allowing faster resolution times when trying figure out why certain spikes occurred where other stayed stable (or vice-versa) when comparing similar data points through time-series based visualization models
- Developing a customer sentiment analysis system that can be used to quickly analyze the sentiment of reviews and identify any potential areas of improvement.
- Building a product recommendation service that takes into account the ratings and reviews of customers when recommending similar products they may be interested in purchasing.
- Training a machine learning model to accurately predict customersâ ratings on new products they have not yet tried and leverage this for further product development optimization initiatives
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------| | label | The sentiment of the review, either positive or negative. (String) | | title | The title of the review. (String) ...
Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this dataset develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help serve the customers better but can also reveal lolot of customer traits present/hidden in the reviews.
The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset features 171,000 product reviews, meticulously labelled with sentiment indicators. It includes associated metadata such as product names and prices. The core purpose of this dataset is to facilitate sentiment analysis of product reviews, allowing for the automatic classification of textual content into positive, negative, or neutral sentiments. This resource is invaluable for understanding customer perception and informing business strategies or consumer purchasing decisions.
The dataset comprises approximately 171,000 product reviews. It typically exists in a tabular structure, often suitable for formats like CSV. The price data exhibits a wide range, with a significant number of entries between 59.00 and 4405.55. Review ratings are distributed across the scale, with notable counts for ratings between 4.80-5.00 and 1.00-1.20. Sentiment labels are well-distributed across positive, neutral, and negative categories.
This dataset is ideal for: * Developing and training machine learning algorithms for sentiment analysis. * Automating the classification of product reviews by sentiment. * Tracking customer sentiment trends over time for specific products or brands. * Identifying product strengths and areas for improvement based on customer feedback. * Empowering consumers to make informed purchasing decisions by aggregating sentiment.
The dataset's region of coverage is global. No specific time range for the reviews themselves is specified within the available information, though the dataset was listed on 17/06/2025. No specific demographic scope is provided.
CC0
Original Data Source: 171k product review with Sentiment Dataset
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a substantial collection of over 241,000 English-language comments, gathered from various online platforms. Each comment within the dataset has been carefully annotated with a sentiment label: 0 for negative sentiment, 1 for neutral, and 2 for positive. The primary aim of this dataset is to facilitate the training and evaluation of multi-class sentiment analysis models, designed to work effectively with real-world text data. The dataset has undergone a preprocessing stage, ensuring comments are in lowercase, and are cleaned of punctuation, URLs, numbers, and stopwords, making it readily usable for Natural Language Processing (NLP) pipelines.
The dataset comprises over 241,000 records. While the specific file format is not detailed, such datasets are typically provided in a tabular format, often as a CSV file. It is structured with two distinct columns as described above, suitable for direct integration into machine learning workflows.
This dataset is ideally suited for a variety of applications and use cases, including: * Training sentiment classifiers utilising advanced models such as LSTM, BiLSTM, CNN, BERT, or RoBERTa. * Evaluating the efficacy of different preprocessing and tokenisation strategies for text data. * Benchmarking NLP models on multi-class classification tasks to assess their performance. * Supporting educational projects and research initiatives in the fields of opinion mining or text classification. * Fine-tuning transformer models on a large and diverse collection of sentiment-annotated text.
The dataset's coverage is global, comprising English-language comments. It focuses on general user-generated text content without specific demographic notes. The dataset is listed with a version of 1.0.
CC0
This dataset is suitable for individuals and organisations involved in data science and analytics. Intended users include: * Data Scientists and Machine Learning Engineers for developing and deploying sentiment analysis models. * Researchers and Academics for studies in NLP, text classification, and opinion mining. * Students undertaking educational projects in artificial intelligence and machine learning.
Original Data Source: Sentiment Analysis Dataset
This dataset was created by Kanwal Zahoor
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset with two columns: "Text" and "Label". The "Text" column contains sentiments of Pakistani traffic, which includes both positive and negative reviews. The "Label" column is used to classify each sentiment as either positive or negative, where positive reviews are labeled with "0" and negative reviews are labeled with "1". This dataset can be used for sentiment analysis tasks, which involve using natural language processing techniques to analyze and classify text data based on the emotions and opinions expressed within the text. By training a machine learning model on this dataset, you can create a system that can automatically classify new traffic sentiments as either positive or negative. Some possible applications of this type of sentiment analysis include monitoring public opinion about traffic-related issues, identifying areas where improvements are needed, and evaluating the effectiveness of traffic-related policies and initiatives. Additionally, businesses in the transportation industry could use this type of analysis to understand customer feedback and improve their services accordingly.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains customer reviews for various products, including details about product categories, brands, user ratings, and sentiment analysis. It is designed for applications such as sentiment classification, product recommendation systems, and the analysis of consumer behaviour. The dataset allows users to identify trends in customer satisfaction and gain insights into consumer preferences based on brand and category.
The data file is typically available in CSV format. The dataset comprises approximately 14,221 records. Analysis of the sentiment distribution within the dataset indicates that 84% of reviews are classified as positive, while 16% are classified as negative.
This dataset is ideally suited for several applications, including: * Performing sentiment analysis on product reviews to gauge public opinion. * Identifying patterns and trends in customer satisfaction over time. * Developing and improving product recommendation systems. * Understanding consumer preferences based on specific brands and product categories.
The dataset covers a time range from 30th July 2009 to 25th July 2017. The data has a global regional scope. No specific demographic scope is detailed within the available information.
CCO
This dataset is valuable for a range of users and their specific applications: * Data Scientists and Machine Learning Engineers: To train and evaluate sentiment analysis models, develop natural language processing (NLP) applications, and build recommendation engines. * Marketing Professionals: To understand customer feedback, identify popular products, and assess the impact of marketing campaigns on brand perception. * Businesses and Product Managers: To inform product development strategies, monitor customer satisfaction, and identify areas for improvement based on consumer feedback. * Researchers: For academic studies on consumer behaviour, sentiment analysis techniques, and market trends.
Original Data Source: đŹđď¸đ Consumer Sentiments and Ratings
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains reviews of the top 10 rated airlines in 2023 sourced from the Airline Quality (https://www.airlinequality.com) website. The reviews cover various aspects of the flight experience, including seat comfort, staff service, food and beverages, inflight entertainment, value for money, and overall rating. The dataset is suitable for sentiment analysis, customer satisfaction analysis, and other similar tasks.
Usage - Download the dataset file airlines_reviews.csv. - Use the dataset for analysis, visualization, and machine learning tasks.
List of Airlines 1. Singapore Airlines 2. Qatar Airways 3. All Nippon Airways 4. Emirates 5. Japan Airlines 6. Turkish Airlines 7. Air France 8. Cathay Pacific Airways 9. EVA Air 10.Korean Air
This dataset is provided under the MIT License.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset comprises a collection of product reviews gathered from prominent e-commerce platforms, including Hepsiburada, Trendyol, and N11. It provides a valuable resource for various data science and analytics tasks, offering insights into customer feedback and sentiment towards diverse products. The dataset is particularly well-suited for developing and evaluating models for classification, text mining, and natural language processing applications.
The dataset is provided in CSV UTF-8 format. It contains a total of 15,170 reviews. Within this collection, there are 6,799 positive reviews, 6,978 negative reviews, and 1,393 neutral reviews, providing a varied distribution of sentiment.
This dataset is ideal for: * Sentiment Analysis: Building and training models to classify product review sentiment. * Text Classification: Developing algorithms to categorise text based on expressed opinions. * Natural Language Processing (NLP) Research: Exploring various NLP techniques such as topic modelling, named entity recognition, and language understanding in the context of e-commerce feedback. * Customer Feedback Analysis: Gaining insights into customer satisfaction and product performance.
The data originates from various e-commerce platforms, offering a global scope in terms of its potential applicability. The dataset was listed on 8th June 2025. Specific demographic or precise temporal ranges for the collected reviews are not detailed in the available information, though it pertains to online consumer product commentary.
CCO
This dataset is suitable for: * Data Scientists: For machine learning projects focused on text analysis and classification. * Machine Learning Engineers: To train and test sentiment analysis models. * Academic Researchers: For studies in computational linguistics, natural language processing, and e-commerce analytics. * Businesses: To understand customer opinions and improve product offerings or services.
Original Data Source: E-Ticaret ĂrĂźn YorumlarÄą
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By imdb (From Huggingface) [source]
The IMDb Large Movie Review Dataset is a comprehensive collection of movie reviews used for sentiment classification. The dataset includes a wide range of movie reviews along with their corresponding sentiment labels, which indicate whether the review is positive or negative in nature. This invaluable dataset is aimed at facilitating sentiment analysis and classification tasks in the field of natural language processing.
The main purpose of the train.csv file within this dataset is to provide a curated collection of movie reviews, each accompanied by its respective sentiment label. This file proves particularly useful for training machine learning models to accurately predict sentiment and classify reviews based on their emotional tone.
Similarly, the test.csv file contains another set of movie reviews along with corresponding sentiment labels. Meant for testing and validating the performance of trained models, this dataset enables researchers and developers to evaluate their models' effectiveness in real-world scenarios.
Additionally, the unsupervised.csv file offers an alternative subset within the dataset. Unlike train.csv and test.csv, unsupervised.csv does not include any associated sentiment labels for individual movie reviews. This specific subset serves as a valuable resource for exploring unsupervised learning techniques within the domain of sentiment classification.
By utilizing this meticulously compiled IMDb Large Movie Review Dataset, researchers and data scientists can delve into various aspects related to analyzing sentiments in textual data. With its carefully labeled data points covering both positive and negative sentiments expressed in diverse film critiques, this dataset empowers users to develop sophisticated machine learning algorithms that accurately assess subjective opinions from text data
Introduction:
Dataset Overview: - Train.csv: This file contains a set of movie reviews along with their sentiment labels. It is intended for training your sentiment analysis models. - Test.csv: This file provides another set of movie reviews along with their corresponding sentiment labels. You can use this file to evaluate the performance of your trained models. - Unsupervised.csv: This file includes movie reviews without any associated sentiment labels. It can be used for unsupervised sentiment classification tasks.
Columns in the Dataset: - text: The main column containing the text of each movie review. - label: The sentiment label assigned to each review, indicating whether it is positive or negative.
Guidelines for Using the Dataset:
Training Your Model:
- Begin by loading and preprocessing the data from train.csv
- Treat 'text' as your input feature and 'label' as your target variable
- Explore different machine learning or deep learning algorithms suitable for text classification
- Train your model using various techniques, such as bag-of-words, word embeddings, or transformers
- Evaluate and fine-tune your model's performance using test.csv
Evaluating Your Model:
- Load test.csv and preprocess the data similar to what you did with train.csv
- Use this preprocessed test data to evaluate the accuracy, precision, recall, F1 score or other relevant metrics of your trained model on unseen data
- Analyze these metrics to understand how well your model is performing in predicting sentiments
Advancing Your Model (Unsupervised Classification):
- Utilize unsupervised.csv for unsupervised sentiment classification tasks
- Preprocess the movie reviews in this file and explore techniques like clustering, topic modeling, or self-supervised learning
- Extract patterns, themes, or sentiments from the reviews without any guidance from labeled data
Conclusion:
- Sentiment Analysis: This dataset can be used to train models for sentiment analysis, where the goal is to predict whether a movie review is positive or negative based on its text.
- NLP Research: The dataset can be used for various natural language processing (NLP) tasks such as text classification, information extraction, or named entity recognition. Researchers and practitioners can leverage this dataset to develop and evaluate new algorithms and techniques in the field of NLP.
- Recommendation Systems: The sentiment labels in this dataset can be used as a source of feedback or user preferences for recommendation systems. By analyzing the sentiments expressed in reviews,...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of âuHack Sentiments 2.0: Decode Code Wordsâ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/manishtripathi86/uhack-sentiments-20-decode-code-words on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The challenge here is to analyze and deep dive into the natural language text (reviews) and bucket them based on their topics of discussion. Furthermore, analyzing the overall sentiment will also help the business to make tangible decisions.
The data set provided to you has a mix of customer reviews for products across categories and retailers. We would like you to model on the data
to bucket the future reviews in their respective topics (Note: A review can talk about multiple topics)
Overall polarity (positive/negative sentiment)
Topics (Components, Delivery and Customer Support, Design and Aesthetics, Dimensions, Features, Functionality, Installation, Material, Price, Quality and Usability) Polarity (Positive/Negative) Note: The target variables are all encoded in the train dataset for convenience. Please submit the test results in the similar encoded fashion for us to evaluate your results.
| | Field Name Data Type Purpose Variable type
Id Integer Unique identifier for each review Input
Review String Review written by customers on a retail website Input
Components String 1: aspects related to components Target
0: None
Delivery and Customer Support String 1: some aspects related to delivery, return, exchange and customer support Target
0: None
Design and Aesthetics String 1: some aspects related to components Target
0: None
Dimensions String 1: related to product dimension and size Target
0: None
Features String 1: related to product features Target
0 : None
Functionality String 1: related to working of a product Target
0: None
Installation String 1: related to installation of the product Target
0: None
Material String 1: related to material of the product Target
0: None
Price String 1: related to pricing details of a product Target
0: None
Quality String 1: related to quality aspects of a product Target
0: None
Usability String 1: related to usability of a product Target
0: None
Polarity Integer 1: Positive sentiment; Target
0: Negative Sentiment | |
| --- | --- |
| | | | |
| --- | --- |
| | |
Skills: Text Pre-processing â Lemmatization , Tokenization, N-Grams and other relevant methods Multi-Class Classification, Multi-label Classification Optimizing Log Loss
Overview Ugam, a Merkle company, is a leading analytics and technology services company. Our customer-centric approach delivers impactful business results for large corporations by leveraging data, technology, and expertise.
We consistently deliver superior, impactful results through the right blend of human intelligence and AI. With 3300+ people spread across locations worldwide, we successfully deploy our services to create success stories across industries like Retail & Consumer Brands, High Tech, BFSI, Distribution, and Market Research & Consulting. Over the past 21 years, Ugam has been recognized by several firms including Forrester and Gartner, named the No.1 data science company in India by Analytics Insight, and certified as a Great Place to WorkÂŽ.
Problem Statement: The last two decades have witnessed a significant change in how consumers purchase products and express their experience/opinions in reviews, posts, and content across platforms. These online reviews are not only useful to reflect customersâ sentiment towards a product but also help businesses fix gaps and find potential opportunities which could further influence future purchases.
Participants need develop a machine learning model that can analyse customersâ sentiments based on their reviews and feedback.
NOTE: The prize money will be for the interested candidates who are willing to get interviewed or hired by Ugam. Winner are requested to come to the Machine Leaning Developers Summit2022, happening at Bangalore, for receiving the prize money.
dataset link: https://machinehack.com/hackathon/uhack_sentiments_20_decode_code_words/overview
--- Original source retains full ownership of the source dataset ---
This dataset contains Google Play Store reviews for Nykaa, a multi-brand cosmetics e-commerce company, collected up to August 2021. It aims to provide insights into customer sentiment, categorised as positive, neutral, or negative based on star ratings. The dataset is valuable for understanding customer satisfaction and identifying key themes within app reviews.
The dataset is structured with two columns and is provided in a Fasttext-compatible format. A test split constitutes 20% of the total data, which is approximately one-quarter the size of the training data. Specific total row or record counts are not available in the provided information.
This dataset is ideal for a range of natural language processing (NLP) tasks, including sentiment analysis, text classification, and customer feedback analysis. It can be utilised by data scientists and machine learning engineers to build and train models for predicting customer sentiment, identifying common complaints or praises, and gaining actionable insights into user experience for e-commerce applications.
The dataset covers Google Play Store reviews for Nykaa, collected up until August 2021. It has global regional coverage, capturing a broad spectrum of user feedback.
CC-BY-NC
Intended users include data scientists, machine learning practitioners, NLP researchers, and businesses in the e-commerce sector. They can use this dataset to develop sentiment analysis models, understand customer satisfaction trends, inform product development, and enhance user engagement strategies.
Nykaa App Reviews Sentiment, E-commerce App Review Sentiment, Nykaa Customer Review Data, Mobile App User Sentiment
Original Data Source: Nykaa App Review Sentiment
**Overview Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this weekend hackathon, we challenge the machinehackers community to develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help us serve the customers better but can also reveal lot of customer traits present/hidden in the reviews.
The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.
In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.
Dataset Description:
Train.csv - 6364 rows x 4 columns (Inlcudes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission
Attribute Description:
Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment Skills:
NLP, Sentiment Analysis Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing multi-class log loss to generalize well on unseen data
If you are interested in joining Kaggle University Club, please e-mail Jessica Li at lijessica@google.com
This Hackathon is open to all undergraduate, master, and PhD students who are part of the Kaggle University Club program. The Hackathon provides students with a chance to build capacity via hands-on ML, learn from one another, and engage in a self-defined project that is meaningful to their careers.
Teams must register via Google Form to be eligible for the Hackathon. The Hackathon starts on Monday, November 12, 2018 and ends on Monday, December 10, 2018. Teams have one month to work on a team submission. Teams must do all work within the Kernel editor and set Kernel(s) to public at all times.
The freestyle format of hackathons has time and again stimulated groundbreaking and innovative data insights and technologies. The Kaggle University Club Hackathon recreates this environment virtually on our platform. We challenge you to build a meaningful project around the UCI Machine Learning - Drug Review Dataset. Teams are free to let their creativity run and propose methods to analyze this dataset and form interesting machine learning models.
Machine learning has permeated nearly all fields and disciplines of study. One hot topic is using natural language processing and sentiment analysis to identify, extract, and make use of subjective information. The UCI ML Drug Review dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. This data was published in a study on sentiment analysis of drug experience over multiple facets, ex. sentiments learned on specific aspects such as effectiveness and side effects (see the acknowledgments section to learn more).
The sky's the limit here in terms of what your team can do! Teams are free to add supplementary datasets in conjunction with the drug review dataset in their Kernel. Discussion is highly encouraged within the forum and Slack so everyone can learn from their peers.
Here are just a couple ideas as to what you could do with the data:
There is no one correct answer to this Hackathon, and teams are free to define the direction of their own project. That being said, there are certain core elements generally found across all outstanding Kernels on the Kaggle platform. The best Kernels are:
Teams with top submissions have a chance to receive exclusive Kaggle University Club swag and be featured on our official blog and across social media.
IMPORTANT: Teams must set all Kernels to public at all times. This is so we can track each team's progression, but more importantly it encourages collaboration, productive discussion, and healthy inspiration to all teams. It is not so that teams can simply copycat good ideas. If a team's Kernel isn't their own organic work, it will not be considered a top submission. Teams must come up with a project on their own.
The final Kernel submission for the Hackathon must contain the following information:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains tweets posted for various services and products along with the emotion contained in each tweet. It is designed to be used for training various machine learning models focused on analysing sentiments in tweets. The dataset includes key information such as the tweet text, the specific product or service the tweet references, and the emotion expressed within the tweet.
The dataset is typically provided in a tabular format, suitable for data analysis and machine learning tasks. While the exact number of rows or records is not specified in the provided information, it consists of a collection of tweet entries. Data files are usually in CSV format.
This dataset is ideally suited for: * Developing and training machine learning models for sentiment analysis. * Analysing customer feedback and public opinion towards products and services expressed on social media. * Research into natural language processing (NLP) and text classification. * Understanding trends in public sentiment related to specific brands or industries.
The dataset has a global coverage, making it applicable for analysis of tweets from various regions. Specific time ranges or demographic scopes are not detailed in the available information.
CCO
This dataset is intended for: * Machine Learning Engineers and Data Scientists for model development. * Researchers in natural language processing, social media analysis, and marketing. * Businesses looking to analyse public sentiment regarding their products or market trends. * Students learning about data analysis, NLP, and machine learning.
Original Data Source: Product Tweets Dataset
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a collection of over 12,000 user reviews for various applications from an app store. It includes user-assigned ratings, which can be used to classify reviews as either positive or negative. The dataset is a valuable resource for conducting sentiment analysis tasks and can assist beginners in working with annotated, real-world data to understand user feedback on mobile applications. It serves as a foundation for exploring consumer sentiment and application performance insights.
The dataset contains over 12,000 distinct reviews, with 12,495 unique review identifiers recorded. Ratings are distributed across the 1 to 5 scale, with significant counts for scores like 1.00-1.20 (2,506 reviews), 2.00-2.20 (2,344 reviews), 3.00-3.20 (1,991 reviews), 4.00-4.20 (2,775 reviews), and 4.80-5.00 (2,879 reviews). The number of upvotes (thumbsUpCount) for reviews spans a wide range, from 0 to 397. Many reviews (17%) do not specify a version, while '1.5.11' accounts for 4% of review versions. A substantial portion of reviews (53%) do not have a corresponding reply content. The data is typically provided in a CSV file format.
This dataset is ideally suited for a variety of analytical and machine learning applications. It is particularly useful for: * Performing sentiment analysis to gauge public opinion on mobile applications. * Developing and training natural language processing (NLP) models, such as BERT-based sentiment classifiers. * Extracting key insights and trends from user feedback to inform app development and marketing strategies. * Educating beginners in the field of sentiment analysis and text mining using annotated, real-world data. * Analysing user engagement and the impact of replies on review visibility.
The dataset offers a global scope, encompassing reviews from users worldwide. The time range for user-posted reviews extends from 8th February 2015 to 28th October 2020. Replies to reviews cover a slightly broader period, from 14th January 2013 to 28th October 2020. The data reflects feedback from real users of various app store applications, providing a diverse demographic perspective on mobile app usage and satisfaction.
CCO
This dataset is beneficial for a wide range of users, including: * Data Scientists and Machine Learning Engineers: For building and evaluating sentiment analysis models, text classification systems, and other NLP applications. * Researchers: To study user behaviour, app success factors, and the dynamics of online reviews. * App Developers and Product Managers: To understand user feedback, identify pain points, and prioritise feature development based on sentiment. * Market Analysts: To monitor brand perception, conduct competitor analysis, and track market trends in the app industry. * Students: As an excellent practical resource for learning about data cleaning, text preprocessing, and sentiment analysis techniques.
Original Data Source: Google Play Store Reviews
The inspiration behind creating the OYO Review Dataset for sentiment analysis was to explore the sentiment and opinions expressed in hotel reviews on the OYO Hotels platform. Analyzing the sentiment of customer reviews can provide valuable insights into the overall satisfaction of guests, identify areas for improvement, and assist in making data-driven decisions to enhance the hotel experience. By collecting and curating this dataset, Deep Patel, Nikki Patel, and Nimil aimed to contribute to the field of sentiment analysis in the context of the hospitality industry. Sentiment analysis allows us to classify the sentiment expressed in textual data, such as reviews, into positive, negative, or neutral categories. This analysis can help hotel management and stakeholders understand customer sentiments, identify common patterns, and address concerns or issues that may affect the reputation and customer satisfaction of OYO Hotels. The dataset provides a valuable resource for training and evaluating sentiment analysis models specifically tailored to the hospitality domain. Researchers, data scientists, and practitioners can utilize this dataset to develop and test various machine learning and natural language processing techniques for sentiment analysis, such as classification algorithms, sentiment lexicons, or deep learning models. Overall, the goal of creating the OYO Review Dataset for sentiment analysis was to facilitate research and analysis in the area of customer sentiments and opinions in the hotel industry. By understanding the sentiment of hotel reviews, businesses can strive to improve their services, enhance customer satisfaction, and make data-driven decisions to elevate the overall guest experience.
Deep Patel: https://www.linkedin.com/in/deep-patel-55ab48199/ Nikki Patel: https://www.linkedin.com/in/nikipatel9/ Nimil lathiya: https://www.linkedin.com/in/nimil-lathiya-059a281b1/