Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for amazon reviews for sentiment analysis
Dataset Summary
One of the most important problems in e-commerce is the correct calculation of the points given to after-sales products. The solution to this problem is to provide greater customer satisfaction for the e-commerce site, product prominence for sellers, and a seamless shopping experience for buyers. Another problem is the correct ordering of the comments given to the products. The prominence of misleading… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/amazon-reviews-sentiment-analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
camera
chillies/course-review-multilabel-sentiment-analysis dataset hosted on Hugging Face and contributed by the HF Datasets community
Product Review Datasets: Uncover user sentiment
Harness the power of Product Review Datasets to understand user sentiment and insights deeply. These datasets are designed to elevate your brand and product feature analysis, help you evaluate your competitive stance, and assess investment risks.
Data sources:
Leave the data collection challenges to us and dive straight into market insights with clean, structured, and actionable data, including:
Choose from multiple data delivery options to suit your needs:
Why choose Oxylabs?
Fresh and accurate data: Access organized, structured, and comprehensive data collected by our leading web scraping professionals.
Time and resource savings: Concentrate on your core business goals while we efficiently handle the data extraction process at an affordable cost.
Adaptable solutions: Share your specific data requirements, and we'll craft a customized data collection approach to meet your objectives.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA standards.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Join the ranks of satisfied customers who appreciate our meticulous attention to detail and personalized support. Experience the power of Product Review Datasets today to uncover valuable insights and enhance decision-making.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The objective of sentiment analysis for movie reviews is to automatically analyze and categorize the sentiments expressed in reviews, providing insights into audience opinions, emotions, and reactions towards films.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IMDB movie review sentiment classification dataset (Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011)). For more information please refer to: https://ai.stanford.edu/~amaas/data/sentiment/
The IMDB dataset was modified as follows to prepare it for use in a Galaxy Training Tutorial (https://training.galaxyproject.org/):
The top 50 words are excluded (mostly stop words). Included the next 10,000 top words. Reviews are limited to 500 words max (Longer reviews trimmed and shorter reviews are padded). 25,000 reviews are used for training and testing each. Files are in tsv (tab separated value) format to be consumed by Galaxy (www.usegalaxy.org).
FABSA, An aspect-based sentiment analysis dataset in the Customer Feedback space (Trustpilot, Google Play and Apple Store reviews).
A professionally annotated dataset released by Chattermill AI, with 8 years of experience in leveraging advanced ML analytics in the customer feedback space for high-profile clients such as Amazon and Uber.
Two annotators possess extensive experience in developing human-labeled ABSA datasets for commercial companies, while the third annotator holds a PhD in computational linguistics.
There has been a lack of high-quality ABSA datasets covering broad domains and addressing real-world applications. Academic progress has been confined to benchmarking on domain-specific, toy datasets such as restaurants and laptops, which are limited in size (e.g., SemEval Task ABSA or SentiHood).
This dataset is part of the FABSA paper, and we release it hoping to advance academic progress as tools for ingesting and analyzing customer feedback at scale improve significantly, yet evaluation datasets continue to lag. FABSA is a new, large-scale, multi-domain ABSA dataset of feedback reviews, consisting of approximately 10,500 reviews spanning 10 domains (Fashion, Consulting, Travel Booking, Ride-hailing, Banking, Trading, Streaming, Price Comparison, Information Technology, and Groceries).
Academic Paper
@article{KONTONATSIOS2023126867, title = {FABSA: An aspect-based sentiment analysis dataset of user reviews}, journal = {Neurocomputing}, volume = {562}, pages = {126867}, year = {2023}, issn = {0925-2312}, doi = {https://doi.org/10.1016/j.neucom.2023.126867}, url = {https://www.sciencedirect.com/science/article/pii/S0925231223009906}, author = {Georgios Kontonatsios and Jordan Clive and Georgia Harrison and Thomas Metcalfe and Patrycja Sliwiak and Hassan Tahir and Aji Ghose}, keywords = {ABSA, Multi-domain dataset, Deep learning}, }
The data has 67K+ reviews in Arabic for sentiment analysis Data collecting using web scraping for many companies Like ( talabat,kabiter,nasla,swifil,alsiwidiu,kilubatra,dumati,.........etc)
Coulnms
Reviews : review description rating : 1 postive , 0 neutral , -1 negative Company : continues company name for each review
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains preprocessed Amazon product review data for the Gen3EcoDot, primarily scraped from amazon.in. It is designed to facilitate the training and testing of classification models, particularly for sentiment analysis. The reviews have been stemmed and lemmatised using NLTK, and sentiment labels are generated using TextBlob polarity scores, making it ready for direct use in machine learning and natural language processing tasks.
The dataset is provided in a tabular format, typically a CSV file, and is preprocessed for immediate use. While specific row counts are not explicitly stated as a single number, the 'division' column includes approximately 4156 records, categorised into various ranges. The 'polarity' column also contains counts for sentiment ranges, summing to approximately 4084 records, indicating positive, neutral, and negative sentiments.
This dataset is ideal for a variety of applications, including: * Training and testing sentiment classification models. * Developing and evaluating Natural Language Processing (NLP) algorithms. * Conducting sentiment analysis on product reviews. * Academic research in text analytics and machine learning. * Building applications that require pre-classified text data.
The data primarily covers product reviews from amazon.in, providing a global scope for e-commerce sentiment. The listing date for the dataset is noted as 16/06/2025. No specific historical time range for the reviews themselves or demographic details are provided.
CC0
This dataset is suitable for: * Data scientists and machine learning engineers looking for preprocessed text data to train classification models. * Researchers and academics in NLP, sentiment analysis, and text mining. * Students learning about text preprocessing and sentiment modelling. * Developers building applications that require sentiment understanding from review data.
Original Data Source: Preprocessed Dataset Sentiment Analysis
MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.
Aspect-Based Sentiment Analysis (ABSA) aims to identify the opinion (sentiment) with respect to a specific aspect. Since there is a lack of smartphone apps reviews dataset that is annotated to support the ABSA task, we present AWARE: ABSA Warehouse of Apps REviews.
AWARE contains apps reviews from three different domains (Productivity, Social Networking, and Games), as each domain has its distinct functionalities and audience. Each sentence is annotated with three labels, as follows:
Aspect Term: a term that exists in the sentence and describes an aspect of the app that is expressed by the sentiment. A term value of “N/A” means that the term is not explicitly mentioned in the sentence.
Aspect Category: one of the pre-defined set of domain-specific categories that represent an aspect of the app (e.g., security, usability, etc.).
Sentiment: positive or negative.
Note: games domain does not contain aspect terms.
We provide a comprehensive dataset of 11323 sentences from the three domains, where each sentence is additionally annotated with a Boolean value indicating whether the sentence expresses a positive/negative opinion. In addition, we provide three separate datasets, one for each domain, containing only sentences that express opinions. The file named “AWARE_metadata.csv” contains a description of the dataset’s columns.
How AWARE can be used?
We designed AWARE such that it can be used to serve various tasks. The tasks can be, but are not limited to:
Sentiment Analysis.
Aspect Term Extraction.
Aspect Category Classification.
Aspect Sentiment Analysis.
Explicit/Implicit Aspect Term Classification.
Opinion/Not-Opinion Classification.
Furthermore, researchers can experiment with and investigate the effects of different domains on users' feedback.
"This dataset includes consumer-submitted reviews from over 160 industries, covering both product- and service-based businesses. It’s built to support CX, AI, and analytics teams seeking structured insight into what real customers say, feel, and expect — across sectors like finance, healthcare, travel, telecom, retail, and more.
Each review includes:
The list may vary based on the industry and can be customized as per your request.
Use this dataset to:
This dataset offers flexibility for custom delivery-by industry, domain, or company, making it ideal for teams needing scalable consumer voice data tailored to specific strategic goals."
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This dataset offers a focused and invaluable window into user perceptions and experiences with applications listed on the Apple App Store. It is a vital resource for app developers, product managers, market analysts, and anyone seeking to understand the direct voice of the customer in the dynamic mobile app ecosystem.
Dataset Specifications:
Last crawled:
(This field is blank in your provided info, which means its recency is currently unknown. If this were a real product, specifying this would be critical for its value proposition.)Richness of Detail (11 Comprehensive Fields):
Each record in this dataset provides a detailed breakdown of a single App Store review, enabling multi-dimensional analysis:
Review Content:
review
: The full text of the user's written feedback, crucial for Natural Language Processing (NLP) to extract themes, sentiment, and common keywords.title
: The title given to the review by the user, often summarizing their main point.isEdited
: A boolean flag indicating whether the review has been edited by the user since its initial submission. This can be important for tracking evolving sentiment or understanding user behavior.Reviewer & Rating Information:
username
: The public username of the reviewer, allowing for analysis of engagement patterns from specific users (though not personally identifiable).rating
: The star rating (typically 1-5) given by the user, providing a quantifiable measure of satisfaction.App & Origin Context:
app_name
: The name of the application being reviewed.app_id
: A unique identifier for the application within the App Store, enabling direct linking to app details or other datasets.country
: The country of the App Store storefront where the review was left, allowing for geographic segmentation of feedback.Metadata & Timestamps:
_id
: A unique identifier for the specific review record in the dataset.crawled_at
: The timestamp indicating when this particular review record was collected by the data provider (Crawl Feeds).date
: The original date the review was posted by the user on the App Store.Expanded Use Cases & Analytical Applications:
This dataset is a goldmine for understanding what users truly think and feel about mobile applications. Here's how it can be leveraged:
Product Development & Improvement:
review
text to identify recurring technical issues, crashes, or bugs, allowing developers to prioritize fixes based on user impact.review
text to inform future product roadmap decisions and develop features users actively desire.review
field.rating
and sentiment
after new app updates to assess the effectiveness of bug fixes or new features.Market Research & Competitive Intelligence:
Marketing & App Store Optimization (ASO):
review
and title
fields to gauge overall user satisfaction, pinpoint specific positive and negative aspects, and track sentiment shifts over time.rating
trends and identify critical reviews quickly to facilitate timely responses and proactive customer engagement.Academic & Data Science Research:
review
and title
fields are excellent for training and testing NLP models for sentiment analysis, topic modeling, named entity recognition, and text summarization.rating
distribution, isEdited
status, and date
to understand user engagement and feedback cycles.country
-specific reviews to understand regional differences in app perception, feature preferences, or cultural nuances in feedback.This App Store Reviews dataset provides a direct, unfiltered conduit to understanding user needs and ultimately driving better app performance and greater user satisfaction. Its structured format and granular detail make it an indispensable asset for data-driven decision-making in the mobile app industry.
The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BDFoodReview is a large-scale dataset containing 334,119 restaurant reviews collected from "Foodpanda Bangladesh". The dataset includes customer reviews in mixed languages (Bangla, English, and Banglish), translated into English, along with their corresponding ratings and sentiment labels.
Dataset Statistics Total Reviews: 334,119 Features/Columns: 19
Potential Applications Sentiment Analysis Restaurant Review Classification Customer Satisfaction Analysis Opinion Mining Natural Language Processing Research Food Service Industry Analysis
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By imdb (From Huggingface) [source]
The IMDb Large Movie Review Dataset is a comprehensive collection of movie reviews used for sentiment classification. The dataset includes a wide range of movie reviews along with their corresponding sentiment labels, which indicate whether the review is positive or negative in nature. This invaluable dataset is aimed at facilitating sentiment analysis and classification tasks in the field of natural language processing.
The main purpose of the train.csv file within this dataset is to provide a curated collection of movie reviews, each accompanied by its respective sentiment label. This file proves particularly useful for training machine learning models to accurately predict sentiment and classify reviews based on their emotional tone.
Similarly, the test.csv file contains another set of movie reviews along with corresponding sentiment labels. Meant for testing and validating the performance of trained models, this dataset enables researchers and developers to evaluate their models' effectiveness in real-world scenarios.
Additionally, the unsupervised.csv file offers an alternative subset within the dataset. Unlike train.csv and test.csv, unsupervised.csv does not include any associated sentiment labels for individual movie reviews. This specific subset serves as a valuable resource for exploring unsupervised learning techniques within the domain of sentiment classification.
By utilizing this meticulously compiled IMDb Large Movie Review Dataset, researchers and data scientists can delve into various aspects related to analyzing sentiments in textual data. With its carefully labeled data points covering both positive and negative sentiments expressed in diverse film critiques, this dataset empowers users to develop sophisticated machine learning algorithms that accurately assess subjective opinions from text data
Introduction:
Dataset Overview: - Train.csv: This file contains a set of movie reviews along with their sentiment labels. It is intended for training your sentiment analysis models. - Test.csv: This file provides another set of movie reviews along with their corresponding sentiment labels. You can use this file to evaluate the performance of your trained models. - Unsupervised.csv: This file includes movie reviews without any associated sentiment labels. It can be used for unsupervised sentiment classification tasks.
Columns in the Dataset: - text: The main column containing the text of each movie review. - label: The sentiment label assigned to each review, indicating whether it is positive or negative.
Guidelines for Using the Dataset:
Training Your Model:
- Begin by loading and preprocessing the data from train.csv
- Treat 'text' as your input feature and 'label' as your target variable
- Explore different machine learning or deep learning algorithms suitable for text classification
- Train your model using various techniques, such as bag-of-words, word embeddings, or transformers
- Evaluate and fine-tune your model's performance using test.csv
Evaluating Your Model:
- Load test.csv and preprocess the data similar to what you did with train.csv
- Use this preprocessed test data to evaluate the accuracy, precision, recall, F1 score or other relevant metrics of your trained model on unseen data
- Analyze these metrics to understand how well your model is performing in predicting sentiments
Advancing Your Model (Unsupervised Classification):
- Utilize unsupervised.csv for unsupervised sentiment classification tasks
- Preprocess the movie reviews in this file and explore techniques like clustering, topic modeling, or self-supervised learning
- Extract patterns, themes, or sentiments from the reviews without any guidance from labeled data
Conclusion:
- Sentiment Analysis: This dataset can be used to train models for sentiment analysis, where the goal is to predict whether a movie review is positive or negative based on its text.
- NLP Research: The dataset can be used for various natural language processing (NLP) tasks such as text classification, information extraction, or named entity recognition. Researchers and practitioners can leverage this dataset to develop and evaluate new algorithms and techniques in the field of NLP.
- Recommendation Systems: The sentiment labels in this dataset can be used as a source of feedback or user preferences for recommendation systems. By analyzing the sentiments expressed in reviews,...
The inspiration behind creating the OYO Review Dataset for sentiment analysis was to explore the sentiment and opinions expressed in hotel reviews on the OYO Hotels platform. Analyzing the sentiment of customer reviews can provide valuable insights into the overall satisfaction of guests, identify areas for improvement, and assist in making data-driven decisions to enhance the hotel experience. By collecting and curating this dataset, Deep Patel, Nikki Patel, and Nimil aimed to contribute to the field of sentiment analysis in the context of the hospitality industry. Sentiment analysis allows us to classify the sentiment expressed in textual data, such as reviews, into positive, negative, or neutral categories. This analysis can help hotel management and stakeholders understand customer sentiments, identify common patterns, and address concerns or issues that may affect the reputation and customer satisfaction of OYO Hotels. The dataset provides a valuable resource for training and evaluating sentiment analysis models specifically tailored to the hospitality domain. Researchers, data scientists, and practitioners can utilize this dataset to develop and test various machine learning and natural language processing techniques for sentiment analysis, such as classification algorithms, sentiment lexicons, or deep learning models. Overall, the goal of creating the OYO Review Dataset for sentiment analysis was to facilitate research and analysis in the area of customer sentiments and opinions in the hotel industry. By understanding the sentiment of hotel reviews, businesses can strive to improve their services, enhance customer satisfaction, and make data-driven decisions to elevate the overall guest experience.
Deep Patel: https://www.linkedin.com/in/deep-patel-55ab48199/ Nikki Patel: https://www.linkedin.com/in/nikipatel9/ Nimil lathiya: https://www.linkedin.com/in/nimil-lathiya-059a281b1/
Source: https://www.cs.jhu.edu/~mdredze/datasets/sentiment/
Kaggle kernel take care of the tar.gz files for you :-)
This dataset features slightly older product reviews from Amazon and derives from the Johns Hopkins University’s Department of Computer Science.
unprocessed.tar.gz processed_acl.tar.gz processed_stars.tar.gz
John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007. [PDF]
John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jenn Wortman. Learning Bounds for Domain Adaptation. Neural Information Processing Systems (NIPS), 2008. [PDF]
Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-Weighted Linear Classification. International Conference on Machine Learning (ICML), 2008. [PDF]
Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain Adaptation with Multiple Sources. Neural Information Processing Systems (NIPS), 2009.
If you use this data for your research or a publication, please cite the first (ACL 2007) paper as the reference for the data. Also, please drop me a line so I know that you found the data useful.
The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. This page contains some descriptions about the data. If you have questions, please email Mark Dredze or John Blitzer.
1) unprocessed.tar.gz contains the original data. 2) processed.acl.tar.gz contains the data pre-processed and balanced. That is, the format of Blitzer et al. (ACL 2007) 3) processed.realvalued.tar.gz contains the data pre-processed and balanced, but with the number of stars, rather than just positive or negative. That is, the format of Mansour et al. (NIPS 2009)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset: Amazon Customer Review Data for sentiment analysis
Size: 60889 appox.
Format: .CSV
Period: 2013 to 2019
Categories: 5…… (Mobiles, Smart TV, Books, Mobile Accessories, Refrigerator)
Unique_ID: Customized (Primary Key)
Review_Header: user’s comment in few words
Review_Text: User’s comment in details (3-4 lines)
Rating: (1- Very Low, 2 🡪 Low, 3🡪 Avg, 4 🡪 Good, 5 - Excellent)
Posting Period: 2013 to 2019
Own_Rating: for 1-2 🡪 Negative, 3🡪 Neutral, 4-5 🡪 Positive
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)