Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for amazon reviews for sentiment analysis
Dataset Summary
One of the most important problems in e-commerce is the correct calculation of the points given to after-sales products. The solution to this problem is to provide greater customer satisfaction for the e-commerce site, product prominence for sellers, and a seamless shopping experience for buyers. Another problem is the correct ordering of the comments given to the products. The prominence of misleading… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/amazon-reviews-sentiment-analysis.
Product Review Datasets: Uncover user sentiment
Harness the power of Product Review Datasets to understand user sentiment and insights deeply. These datasets are designed to elevate your brand and product feature analysis, help you evaluate your competitive stance, and assess investment risks.
Data sources:
Leave the data collection challenges to us and dive straight into market insights with clean, structured, and actionable data, including:
Choose from multiple data delivery options to suit your needs:
Why choose Oxylabs?
Fresh and accurate data: Access organized, structured, and comprehensive data collected by our leading web scraping professionals.
Time and resource savings: Concentrate on your core business goals while we efficiently handle the data extraction process at an affordable cost.
Adaptable solutions: Share your specific data requirements, and we'll craft a customized data collection approach to meet your objectives.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA standards.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Join the ranks of satisfied customers who appreciate our meticulous attention to detail and personalized support. Experience the power of Product Review Datasets today to uncover valuable insights and enhance decision-making.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains over 4,900 customer reviews from Amazon, including text-based feedback, star ratings, and helpfulness votes.
It can be used for:
reviewText
: Full written reviewoverall
: Star rating (1 to 5)summary
: Short summary of the reviewhelpful_yes
: Number of users who found the review helpfultotal_vote
: Total votes on helpfulnessday_diff
: Days since the review was writtenThis dataset is suitable for natural language processing (NLP) and supervised learning tasks.
This is a publicly available dataset for educational and research use.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains customer reviews for various products, including details about product categories, brands, user ratings, and sentiment analysis. It is designed for applications such as sentiment classification, product recommendation systems, and the analysis of consumer behaviour. The dataset allows users to identify trends in customer satisfaction and gain insights into consumer preferences based on brand and category.
The data file is typically available in CSV format. The dataset comprises approximately 14,221 records. Analysis of the sentiment distribution within the dataset indicates that 84% of reviews are classified as positive, while 16% are classified as negative.
This dataset is ideally suited for several applications, including: * Performing sentiment analysis on product reviews to gauge public opinion. * Identifying patterns and trends in customer satisfaction over time. * Developing and improving product recommendation systems. * Understanding consumer preferences based on specific brands and product categories.
The dataset covers a time range from 30th July 2009 to 25th July 2017. The data has a global regional scope. No specific demographic scope is detailed within the available information.
CCO
This dataset is valuable for a range of users and their specific applications: * Data Scientists and Machine Learning Engineers: To train and evaluate sentiment analysis models, develop natural language processing (NLP) applications, and build recommendation engines. * Marketing Professionals: To understand customer feedback, identify popular products, and assess the impact of marketing campaigns on brand perception. * Businesses and Product Managers: To inform product development strategies, monitor customer satisfaction, and identify areas for improvement based on consumer feedback. * Researchers: For academic studies on consumer behaviour, sentiment analysis techniques, and market trends.
Original Data Source: 🏬🛍️😀 Consumer Sentiments and Ratings
"This dataset includes consumer-submitted reviews from over 160 industries, covering both product- and service-based businesses. It’s built to support CX, AI, and analytics teams seeking structured insight into what real customers say, feel, and expect — across sectors like finance, healthcare, travel, telecom, retail, and more.
Each review includes:
The list may vary based on the industry and can be customized as per your request.
Use this dataset to:
This dataset offers flexibility for custom delivery-by industry, domain, or company, making it ideal for teams needing scalable consumer voice data tailored to specific strategic goals."
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the Review of Beauty Product in the Bahasa Indonesia text representation. Each text in the dataset has been categorized into Price, Packaging, Product, and Aroma. Also, each category has been classified into Positive, Neutral, and Negative.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock detailed insights with our Amazon UK Shoes Products Reviews Dataset, an invaluable resource for businesses, researchers, and data analysts. This dataset features comprehensive information, including product names, review texts, star ratings, and customer feedback for a wide range of shoe products available on Amazon UK.
Whether you're delving into customer behavior, conducting market research, or improving product offerings, this dataset empowers you to make informed decisions. By working with a dataset enriched with real-world feedback, you can:
Explore related datasets like the Amazon product review dataset, offering insights across various categories and regions. For specific needs, our curated product reviews dataset is tailored to help you gain a granular understanding of niche markets.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Amazon Reviews Polarity Dataset discloses eighteen years of customers' ratings and reviews from Amazon.com, offering an unparalleled trove of insight and knowledge. Drawing from the immense pool of over 35 million customer reviews, this dataset presents a broad spectrum of customer opinions on products they have bought or used. This invaluable data is a gold mine for improving products and services as it contains comprehensive information regarding customers' experiences with a product including ratings, titles, and plaintext content. At the same time, this dataset contains both customer-specific data along with product information which encourages deep analytics that could lead to great advances in providing tailored solutions for customers. Has your product been favored by the majority? Are there any aspects that need extra care? Use Amazon Reviews Polarity to gain deeper insights into what your customers want - explore now!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Analyze customer ratings to identify trends: Take a look at how many customers have rated the same product or service with the same score (e.g., 4 stars). You can use this information to identify what customers like or don’t like about it by examining common sentiment throughout the reviews. Identifying these patterns can help you make decisions on which features of your products or services to emphasize in order to boost sales and satisfaction rates.
2 Review content analysis: Analyzing review content is one of the best ways to gauge customer sentiment toward specific features or aspects of a product/service. Using natural language processing tools such as Word2Vec, Latent Dirichlet Allocation (LDA), or even simple keyword search algorithms can quickly reveal general topics that are discussed in relation to your product/service across multiple reviews - allowing you quickly pinpoint areas that may need improvement for particular items within your lines of business.
3 Track associated scores over time: By tracking customer ratings overtime, you may be able to better understand when there has been an issue with something specific related to your product/service - such as negative response toward a feature that was introduced but didn’t seem popular among customers and was removed shortly after introduction.. This can save time and money by identifying issues before they become widespread concerns with larger sets of consumers who invest their money in using your company's item(s).
4 Visualize sentiment data over time graphs : Utilizing visualizations such as bar graphs can help identify trends across different categories quicker than raw numbers alone; combining both numeric values along with color differences associated between different scores allows you spot anomalies easier - allowing faster resolution times when trying figure out why certain spikes occurred where other stayed stable (or vice-versa) when comparing similar data points through time-series based visualization models
- Developing a customer sentiment analysis system that can be used to quickly analyze the sentiment of reviews and identify any potential areas of improvement.
- Building a product recommendation service that takes into account the ratings and reviews of customers when recommending similar products they may be interested in purchasing.
- Training a machine learning model to accurately predict customers’ ratings on new products they have not yet tried and leverage this for further product development optimization initiatives
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------| | label | The sentiment of the review, either positive or negative. (String) | | title | The title of the review. (String) ...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a collection of Indonesian product review data, meticulously annotated with emotion and sentiment labels. It was gathered from Tokopedia, a prominent e-commerce platform in Indonesia, encompassing product reviews from 29 distinct product categories. Each review is assigned a single emotion label, such as love, happiness, anger, fear, or sadness. The emotion annotation process was conducted by a group of annotators who followed specific criteria established by an expert in clinical psychology. The dataset also includes other valuable attributes related to the product reviews, including location, price, overall rating, number sold, total reviews, and customer rating, designed to facilitate further research. The data is considered clean.
While a specific original data sample is not available to list all columns in detail, based on the dataset description, the following attributes are included: * Product Review Text: The original review content. * Emotion Label: Categorical label indicating the primary emotion (e.g., love, happiness, anger, fear, sadness). * Sentiment Label: Overall sentiment associated with the review. * Location: Geographic information related to the review or product. * Price: The price of the product reviewed. * Overall Rating: The product's general rating. * Number Sold: The quantity of the product sold. * Total Review: The total number of reviews for the product. * Customer Rating: The rating provided by the customer for the specific product.
The dataset is typically provided in a CSV file format. It contains product reviews from 29 different product categories. Specific figures for the total number of rows or records are not detailed in the provided information.
This dataset is ideally suited for various applications and research endeavours, including: * Learning: Excellent for educational purposes in data science, natural language processing, and text analytics. * Research: Supports in-depth studies in natural language processing (NLP), text processing, consumer emotion analysis, text mining, and sentiment analysis. * Model Training: Can be used for training machine learning models, including large language models (LLMs), for tasks such as emotion classification, sentiment analysis, and text understanding in Indonesian. * Application Development: Useful for developing applications that require understanding consumer feedback and emotions from product reviews.
The dataset's geographic scope is focused on Indonesia, specifically product reviews from an Indonesian e-commerce platform, Tokopedia, written in the Indonesian language. The listed date for the dataset on the platform is 08/06/2025; however, the actual time range during which the data was collected for the reviews themselves is not specified in the sources. There are no specific notes on data availability for certain demographic groups or years beyond general product review consumers in Indonesia.
CCO
This dataset is beneficial for a wide range of users, including: * Academics and Researchers: For exploring topics in NLP, sentiment analysis, and consumer behaviour. * Students: As a practical resource for learning about text data processing, emotion classification, and data analysis. * Data Scientists and Machine Learning Engineers: For building and fine-tuning models capable of understanding and classifying emotions and sentiments from textual data. * Businesses: Potentially for market research and understanding customer feedback trends, particularly within the Indonesian e-commerce sector.
Original Data Source: PRDECT-ID: Indonesian Emotion Classification
Amazon Product Review Dataset (2023)
Dataset Overview
The Amazon Product Review Dataset (2023) contains product reviews from Amazon customers. The dataset includes product information, review details, and metadata about the customers who left the reviews. This dataset can be used for various natural language processing (NLP) tasks, including sentiment analysis, review prediction, recommendation systems, and more.
Dataset Name: Amazon Product Review Dataset (2023) Dataset… See the full description on the dataset page: https://huggingface.co/datasets/kevykibbz/Consumer_goods_reviews.
-> If you use Turkish_Product_Reviews_by_Gozukara_and_Ozel_2016 dataset please cite: https://dergipark.org.tr/en/pub/cukurovaummfd/issue/28708/310341
@research article { cukurovaummfd310341, journal = {Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi}, issn = {1019-1011}, eissn = {2564-7520}, address = {Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi Yayın Kurulu Başkanlığı 01330 ADANA}, publisher = {Cukurova University}, year = {2016}, volume = {31}, pages = {464 - 482}, doi = {10.21605/cukurovaummfd.310341}, title = {Türkçe ve İngilizce Yorumların Duygu Analizinde Doküman Vektörü Hesaplama Yöntemleri için Bir Deneysel İnceleme}, key = {cite}, author = {Gözükara, Furkan and Özel, Selma Ayşe} }
https://doi.org/10.21605/cukurovaummfd.310341
-> Turkish_Product_Reviews_by_Gozukara_and_Ozel_2016 dataset is composed as below: ->-> Top 50 E-commerce sites in Turkey are crawled and their comments are extracted. Then randomly 2000 comments selected and manually labelled by a field expert. ->-> After manual labeling the selected comments is done, 600 negative and 600 positive comments are left. ->-> This dataset contains these comments.
-> English_Movie_Reviews_by_Pang_and_Lee_2004 ->-> Pang, B., Lee, L., 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, In Proceedings of the 42nd annual meeting on Association for Computational Linguistics (p. 271). ->-> Source: https://www.cs.cornell.edu/people/pabo/movie-review-data/ | polarity dataset v2.0 - review_polarity.tar.gz
-> English_Movie_Reviews_Sentences_by_Pang_and_Lee_2005 ->-> Pang, B., Lee, L., 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 115-124), Association for Computational Linguistics ->-> Source: https://www.cs.cornell.edu/people/pabo/movie-review-data/ | sentence polarity dataset v1.0 - rt-polaritydata.tar.gz
-> English_Product_Reviews_by_Blitzer_et_al_2007 ->-> Article of the dataset: Blitzer, J., Dredze, M., Pereira, F., 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification, In ACL (Vol. 7, pp. 440-447). ->-> Source: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/ | processed_acl.tar.gz
-> Turkish_Movie_Reviews_by_Demirtas_and_Pechenizkiy_2013 ->-> Demirtas, E., Pechenizkiy, M., 2013. Cross-lingual polarity detection with machine translation, In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining (p. 9). ACM. ->-> http://www.win.tue.nl/~mpechen/projects/smm/#Datasets Turkish_Movie_Sentiment.zip
-> The dataset files are provided as used in the article. -> Weka files are generated with Raw Frequency of terms rather than used Weighting Schemes
-> The folder Cross_Validation contains 10-fold cross-validation each fold files. -> Inside Cross_Validation folder, each turn of the cross-validation is named as test_X where X is the turn number -> Inside test_X folder * Test_Set_Negative_RAW: Contains raw negative class Test data of that cross-validation turn * Test_Set_Negative_Processed: Contains pre-processed negative class Test data of that cross-validation turn * Test_Set_Positive_RAW: Contains raw positive class Test data of that cross-validation turn * Test_Set_Positive_Processed: Contains pre-processed positive class Test data of that cross-validation turn * Train_Set_Negative_RAW: Contains raw negative class Train data of that cross-validation turn * Train_Set_Negative_Processed: Contains pre-processed negative class Train data of that cross-validation turn * Train_Set_Positive_RAW: Contains raw positive class Train data of that cross-validation turn * Train_Set_Positive_Processed: Contains pre-processed positive class Train data of that cross-validation turn * Train_Set_For_Weka: Contains processed Train set formatted for Weka * Test_Set_For_Weka: Contains processed Test set formatted for Weka
-> The folder Entire_Dataset contains files for Entire Dataset * Negative_Processed: Contains all negative comments processed data * Positive_Processed: Contains all positive comments processed data * Negative_RAW: Contains all negative comments RAW data * Positive_RAW: Contains all positive comments RAW data * Entire_Dataset_WEKA: Contains all documents processed data in WEKA format
Usecase/Applications possible with the data:
Customer feedback analysis: Analyzing customer feedback can be helpful for businesses to keep customers happy, stay loyal to the brand, and identify any areas to improve.
Social media monitoring: With sentiment analysis, companies can monitor what's being said about them on social media and use that to figure out how people feel about their products and services and track any new trends.
Market research: Sentiment analysis can be used to analyze market trends and consumer preferences, which can help companies make informed business decisions and develop effective marketing strategies.
Financial analysis: You can use sentiment analysis to determine what people say about the stock market through news and social media, which can help you make investing decisions.
For e-commerce (amazon/Bestbuy/home depot and much more) following data fields can be included: Title Price Vendor Name Ratings Reviews Brand ASIN URL Sentiment analysis for each review And other fields, as per request
The EPRSTMT dataset, also known as EPR-sentiment, is a binary sentiment analysis dataset based on product reviews on an e-commerce platform. Each sample in the dataset is labeled as either Positive or Negative. It was collected by the ICIP Lab of Beijing Normal University and has been re-organized to make it suitable for sentiment analysis tasks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Product Reviews and Ratings (Sentiment Analysis)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mafaisal007/product-reviews-and-ratings-sentiment-analysis on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This dataset is from a toy store in Europe that contains customer reviews about a particular prodcut it is to be used for text mining and sentiment anlaysis.
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Amazon Food Products Dataset is a large-scale collection of product listings, reviews, and metadata sourced from Amazon. This dataset is valuable for understanding consumer behaviour, analyzing product trends, and training machine learning models for recommendation systems and sentiment analysis. It includes various categories, providing insights into customer preferences, product ratings, and review sentiments.
Each record in the dataset contains the following key fields:
This dataset is ideal for a variety of applications:
CC0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview:This collection contains three synthetic datasets produced by gpt-4o-mini for sentiment analysis and PDT (Product Desirability Toolkit) testing. Each dataset contains 1000 hypothetical software product reviews with the aim to produce a diversity of sentiment and text. The datasets were created as part of the research described in:
Hastings, J.D., Weitl-Harms, S., Doty, J., Myers, Z. L., and Thompson, W., “Utilizing Large Language Models to Synthesize Product Desirability Datasets,” in Proceedings of the 2024 IEEE International Conferenceon Big Data (BigData-24), Workshop on Large Language and Foundation Models (WLLFM-24), Dec. 2024.https://arxiv.org/abs/2411.13485.
Briefly, each row in the datasets was produced as follows:1) Word+Review: The LLM selected a word and synthesized a review that would align with a random target sentiment.2) Review+Word: The LLM produced a review to align with the target sentiment score, and then selected a word appropriate for the review.3) Supply-Word: A word was supplied to the LLM which was then scored, and a review was produced to align with that score.
For sentiment analysis and PDT testing, the two columns of main interest across the datasets are likely 'Selected Word' and 'Hypothetical Review'.
License:This data is licensed under the CC Attribution 4.0 international license, and may be taken and used freely with credit given. Cite as:
Hastings, J., Weitl-Harms, S., Doty, J., Myers, Z., & Thompson, W. (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14188456
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This dataset offers a focused and invaluable window into user perceptions and experiences with applications listed on the Apple App Store. It is a vital resource for app developers, product managers, market analysts, and anyone seeking to understand the direct voice of the customer in the dynamic mobile app ecosystem.
Dataset Specifications:
Last crawled:
(This field is blank in your provided info, which means its recency is currently unknown. If this were a real product, specifying this would be critical for its value proposition.)Richness of Detail (11 Comprehensive Fields):
Each record in this dataset provides a detailed breakdown of a single App Store review, enabling multi-dimensional analysis:
Review Content:
review
: The full text of the user's written feedback, crucial for Natural Language Processing (NLP) to extract themes, sentiment, and common keywords.title
: The title given to the review by the user, often summarizing their main point.isEdited
: A boolean flag indicating whether the review has been edited by the user since its initial submission. This can be important for tracking evolving sentiment or understanding user behavior.Reviewer & Rating Information:
username
: The public username of the reviewer, allowing for analysis of engagement patterns from specific users (though not personally identifiable).rating
: The star rating (typically 1-5) given by the user, providing a quantifiable measure of satisfaction.App & Origin Context:
app_name
: The name of the application being reviewed.app_id
: A unique identifier for the application within the App Store, enabling direct linking to app details or other datasets.country
: The country of the App Store storefront where the review was left, allowing for geographic segmentation of feedback.Metadata & Timestamps:
_id
: A unique identifier for the specific review record in the dataset.crawled_at
: The timestamp indicating when this particular review record was collected by the data provider (Crawl Feeds).date
: The original date the review was posted by the user on the App Store.Expanded Use Cases & Analytical Applications:
This dataset is a goldmine for understanding what users truly think and feel about mobile applications. Here's how it can be leveraged:
Product Development & Improvement:
review
text to identify recurring technical issues, crashes, or bugs, allowing developers to prioritize fixes based on user impact.review
text to inform future product roadmap decisions and develop features users actively desire.review
field.rating
and sentiment
after new app updates to assess the effectiveness of bug fixes or new features.Market Research & Competitive Intelligence:
Marketing & App Store Optimization (ASO):
review
and title
fields to gauge overall user satisfaction, pinpoint specific positive and negative aspects, and track sentiment shifts over time.rating
trends and identify critical reviews quickly to facilitate timely responses and proactive customer engagement.Academic & Data Science Research:
review
and title
fields are excellent for training and testing NLP models for sentiment analysis, topic modeling, named entity recognition, and text summarization.rating
distribution, isEdited
status, and date
to understand user engagement and feedback cycles.country
-specific reviews to understand regional differences in app perception, feature preferences, or cultural nuances in feedback.This App Store Reviews dataset provides a direct, unfiltered conduit to understanding user needs and ultimately driving better app performance and greater user satisfaction. Its structured format and granular detail make it an indispensable asset for data-driven decision-making in the mobile app industry.
"This dataset includes millions of consumer reviews tagged with emotion signals, making it ideal for training AI systems to detect how people feel — not just what they say. Built for sentiment-aware product development, CX strategy, and emotional behavior modeling, it offers deep insight into real consumer experience.
Features include:
-Labeled review sentiment (positive, neutral, negative) -Retail product and service context (e.g., delivery, pricing, quality) -Touchpoint mapping (pre-purchase, usage, return, support) -Optional region, channel, and timestamp data
The list may vary based on the industry and can be customized as per your request.
This dataset enables:
-Training empathetic AI agents and emotion-detecting LLMs -Mapping customer sentiment across retail segments or journey stages -dentifying emotional drivers behind repeat purchases and churn -Benchmarking brand sentiment versus competitors -Segmenting user feedback for trend and CX impact analysis
Available in clean, structured formats and optimized for large-scale NLP, this dataset is indispensable for data science, product, and CX teams focused on emotional intelligence and experience-driven growth."
The dataset contains reviews which were web scraped with the Python library BeautifulSoup, where the reviews were webscraped from Amazon products.
The columns of the dataset:
How did I label my dataset, or rather how did I label the reviews as inconsistent (1) or consistent (0) ?
To begin, the VADER Sentiment tool was utilized to extract the compound sentiment value for each text review. Subsequently, the polarity of the review's text was assigned by labeling it as 'Positive' if the review's compound value exceeded 0.05, 'Negative' if the compound value was below -0.05, and 'Neutral' otherwise. Once the text polarity had been extracted for all reviews, the star polarity for each review was determined based on the number of stars assigned. Specifically, reviews that contained a star rating of 1 or 2 were labeled as 'Negative', reviews with a rating of 3 were labeled as 'Neutral', and those with 4 or 5 stars were labeled as 'Positive'.
In order to identify inconsistencies or mismatches within a review, a comparison was made between the review's text polarity and star polarity. Reviews that had matching polarities were labeled as 'Consistent' (represented by 0 in binary). Conversely, if there was a mismatch between the two polarities, the review was labeled as 'Inconsistent' (represented by 1 in binary). This binary value was then recorded in the 'inconsistentStatus' column.
FYI : You could delete off the column 'inconsistentStatus' and use your own logic for labelling the rows as consistent or inconsistent.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)