Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.
This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.
To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset
The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.
The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.
The next column contains the next character after the sequence.
There are about 200 million samples are in the dataset.
Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html
This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.
Large dataset of movie reviews from the Internet Movie Database (IMDb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Amazon Product Reviews Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/amazon-product-reviews-datasete on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains 30K records of product reviews from amazon.com.
This dataset was created by PromptCloud and DataStock
This dataset contains the following:
Total Records Count: 43729
Domain Name: amazon.com
Date Range: 01st Jan 2020 - 31st Mar 2020
File Extension: CSV
Available Fields:
-- Uniq Id,
-- Crawl Timestamp,
-- Billing Uniq Id,
-- Rating,
-- Review Title,
-- Review Rating,
-- Review Date,
-- User Id,
-- Brand,
-- Category,
-- Sub Category,
-- Product Description,
-- Asin,
-- Url,
-- Review Content,
-- Verified Purchase,
-- Helpful Review Count,
-- Manufacturer Response
We wouldn't be here without the help of our in house teams at PromptCloud and DataStock. Who has put their heart and soul into this project like all other projects? We want to provide the best quality data and we will continue to do so.
The inspiration for these datasets came from research. Reviews are something that is important wit everybody across the globe. So we decided to come up with this dataset that shows us exactly how the user reviews help companies to better their products.
This dataset was created by PromptCloud and contains around 0 samples along with Billing Uniq Id, Verified Purchase, technical information and other features such as: - Crawl Timestamp - Manufacturer Response - and more.
- Analyze Helpful Review Count in relation to Sub Category
- Study the influence of Review Date on Product Description
- More datasets
If you use this dataset in your research, please credit PromptCloud
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains over 4,900 customer reviews from Amazon, including text-based feedback, star ratings, and helpfulness votes.
It can be used for:
reviewText
: Full written reviewoverall
: Star rating (1 to 5)summary
: Short summary of the reviewhelpful_yes
: Number of users who found the review helpfultotal_vote
: Total votes on helpfulnessday_diff
: Days since the review was writtenThis dataset is suitable for natural language processing (NLP) and supervised learning tasks.
This is a publicly available dataset for educational and research use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘App Store Reviews’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/app-store-reviews on 28 January 2022.
--- Dataset description provided by original source is as follows ---
- The dataset contains scraped written reviews from the App store. This dataset was created by CrawlFeeds and contains around 10K reviews along with Country & Date and other features such as:
- User Name
- Is Edited?
- Date of crawl
- And more.
- Analyze the sentiment of the review, try to isolate the phrases associated with positive/negative reviews.
- Study the connection between country and review sentiment
- Study the connection between the time of day and sentiment
- More datasets
If you use this dataset in your research, please credit CrawlFeeds
--- Original source retains full ownership of the source dataset ---
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.
Metadata includes
reviews
add-to-shelf, read, review actions
book attributes: title, isbn
graph of similar books
Basic Statistics:
Items: 1,561,465
Users: 808,749
Interactions: 225,394,930
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Reviews are a way to gain insight into a product/service. In machine learning tasks, text reviews play an important role in predicting/gaining insights. IMDb, one of the largest databases for films, tv-shows, and similar content, people provide their valuable opinion on them every year. These reviews eventually play a huge role in the viewer's community. However, these reviews can sometime contain spoilers, and the user also marks them considering their functionality.
The dataset is collected from IMDb and each individual data is publically available. Below is the table of content of each element dictionary element -
| Content| Details|
| --- | --- |
| review_id
| It is generated by IMBb and unique to each review |
| reviewer
| Public identity or username of the reviewer |
| movie
| It represents the name of the show (can be - movie, tv-series, etc.) |
| rating
| Rating of movie
out of 10, can be None
for older reviews |
| review_summary
| Plain summary of the review |
| review_date
| Date of the posted review |
| spoiler_tag
| If 1
= spoiler & 0
= not spoiler |
| review_detail
| Details of the review |
| helpful
| list[0]
people find the review helpful out of list[1]
|
Dataset Stats:
# total records
= 5, 571, 499
# total shows
= 453, 528
# users
= 1, 699, 310
# spoilers
= 1, 186, 611
All rights reserved to IMDb
and the user who spent valuable time providing reviewers.
If you intend to use this dataset, please cite the following -
@misc{enam biswas_2021,
title={IMDb Review Dataset - ebD},
url={https://www.kaggle.com/dsv/1836923},
DOI={10.34740/KAGGLE/DSV/1836923},
publisher={Kaggle},
author={Enam Biswas},
year={2021} }
Please feel free to contact - Enam Biswas if you have any kind of questions.
700, 000
reviews of USA neighborhood.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Travel Review Rating Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/wirachleelakiatiwong/travel-review-rating-dataset on 30 September 2021.
--- Dataset description provided by original source is as follows ---
This data set has been sourced from the Machine Learning Repository of University of California, Irvine (UC Irvine) : Travel Review Ratings Data Set. This data set is populated by capturing user ratings from Google reviews. Reviews on attractions from 24 categories across Europe are considered. Google user rating ranges from 1 to 5 and average user rating per category is calculated.
Attribute 1 : Unique user id Attribute 2 : Average ratings on churches Attribute 3 : Average ratings on resorts Attribute 4 : Average ratings on beaches Attribute 5 : Average ratings on parks Attribute 6 : Average ratings on theatres Attribute 7 : Average ratings on museums Attribute 8 : Average ratings on malls Attribute 9 : Average ratings on zoo Attribute 10 : Average ratings on restaurants Attribute 11 : Average ratings on pubs/bars Attribute 12 : Average ratings on local services Attribute 13 : Average ratings on burger/pizza shops Attribute 14 : Average ratings on hotels/other lodgings Attribute 15 : Average ratings on juice bars Attribute 16 : Average ratings on art galleries Attribute 17 : Average ratings on dance clubs Attribute 18 : Average ratings on swimming pools Attribute 19 : Average ratings on gyms Attribute 20 : Average ratings on bakeries Attribute 21 : Average ratings on beauty & spas Attribute 22 : Average ratings on cafes Attribute 23 : Average ratings on view points Attribute 24 : Average ratings on monuments Attribute 25 : Average ratings on gardens
This data set has been sourced from the Machine Learning Repository of University of California, Irvine (UC Irvine) : Travel Review Ratings Data Set
The UCI page mentions the following publication as the original source of the data set: Renjith, Shini, A. Sreekumar, and M. Jathavedan. 2018. Evaluation of Partitioning Clustering Algorithms for Processing Social Media Data in Tourism Domain. In 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 12731. IEEE
I'm kind of people who love traveling. But sometimes I've problems like where should I visit? Are there somewhere interesting places matched with my lifestyle? Often I spent hours to search for interesting place to go out. Such a waste of time.
What if we can build a recommender system which can recommend you several interesting venue based on your preferences. With information from Google review, I'll try to divide Google review user into cluster of similar interest for further work of building recommender system based on thier preference.
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Reviews are a way to gain insight into a product/service. In machine learning tasks, text reviews play an important role in predicting/gaining insights. User-generated place reviews are extremely handy when it comes to choosing a neighborhood to live in. Niche has got a huge amount of review-rating for American neighborhood, which is perfect for several NLP tasks.
The dataset is collected from Niche and each individual data is publically available. Below is the overall dataset stats -
# total records
= 712, 107
# total places
= 56, 800
Some insight about data:
# guid
Generated by Niche and unique to place/entity.
# body
Actual review data.
# rating
Rating on a scale of 0 to 5.
# author
Provider of the review/rating. (aka Niche user)
# created
Timestamp.
# categories
Experience type (about the entity).
All rights reserved to Niche and the user who spent valuable time providing reviewers-ratings.
If you intend to use this dataset, please cite the following -
@misc{enam biswas_2021,
title={Place Review Dataset - Niche (USA)},
url={https://www.kaggle.com/dsv/1842046},
DOI={10.34740/KAGGLE/DSV/1842046},
publisher={Kaggle},
author={Enam Biswas},
year={2021} }
Please feel free to contact - Enam Biswas if you have any kind of questions.
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘⭐ McDonalds Review Sentiment’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/mcdonalds-review-sentimente on 13 February 2022.
--- Dataset description provided by original source is as follows ---
A sentiment analysis of negative McDonald's reviews. Contributors were given reviews culled from low-rated McDonald's from random metro areas and asked to classify why the locations received low reviews. Options given were: * Rude Service
- Slow Service
- Problem with Order
- Bad Food
- Bad Neighborhood
- Dirty Location
- Cost
- Missing Item Added: March 6, 2015 by CrowdFlower | Data Rows: 1500 Download Now
Source: https://www.crowdflower.com/data-for-everyone/
This dataset was created by CrowdFlower and contains around 2000 samples along with Unit State, Policies Violated, technical information and other features such as: - Review - Policies Violated Gold - and more.
- Analyze Policies Violated:confidence in relation to City
- Study the influence of Last Judgment At on Trusted Judgments
- More datasets
If you use this dataset in your research, please credit CrowdFlower
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The FiFAR dataset, is comprised of 30K bank account opening application instances, accompanied by the judgments of a team of 50 synthetic fraud analysts with realistic decision-making properties on whether or not each instance is a fraudulent application. Each instance contains information regarding the bank account opening application and the applicant, as well as the ground truth label: 0 - legitimate, 1 - fraudulent. Furthermore, each instance contains the prediction of each of the 50 experts, following the same convention as the label. We provide every expert’s prediction for every 30K instances in the Bank Account Fraud dataset (https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022/versions/1?select=Base.csv) deemed fraudulent by a fraud detection model, thus simulating an “alert-review” scenario, where experts are tasked with reviewing high-risk bank account opening applications.
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.
Metadata includes
product IDs
bounding boxes
Basic Statistics:
Scenes: 47,739
Products: 38,111
Scene-Product Pairs: 93,274
These datasets contain peer-to-peer trades from various recommendation platforms.
Metadata includes
peer-to-peer trades
have and want lists
image data (tradesy)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While ancient scientists often had patrons to fund their work, peer review of proposals for the allocation of resources is a foundation of modern science
This is the anonymized dataset obtained from the DPR Experiment run at ESO in Fall 2018
previous work available at 10.1038/s41550-020-1038-y
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘uHack Sentiments 2.0: Decode Code Words’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/manishtripathi86/uhack-sentiments-20-decode-code-words on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The challenge here is to analyze and deep dive into the natural language text (reviews) and bucket them based on their topics of discussion. Furthermore, analyzing the overall sentiment will also help the business to make tangible decisions.
The data set provided to you has a mix of customer reviews for products across categories and retailers. We would like you to model on the data
to bucket the future reviews in their respective topics (Note: A review can talk about multiple topics)
Overall polarity (positive/negative sentiment)
Topics (Components, Delivery and Customer Support, Design and Aesthetics, Dimensions, Features, Functionality, Installation, Material, Price, Quality and Usability) Polarity (Positive/Negative) Note: The target variables are all encoded in the train dataset for convenience. Please submit the test results in the similar encoded fashion for us to evaluate your results.
| | Field Name Data Type Purpose Variable type
Id Integer Unique identifier for each review Input
Review String Review written by customers on a retail website Input
Components String 1: aspects related to components Target
0: None
Delivery and Customer Support String 1: some aspects related to delivery, return, exchange and customer support Target
0: None
Design and Aesthetics String 1: some aspects related to components Target
0: None
Dimensions String 1: related to product dimension and size Target
0: None
Features String 1: related to product features Target
0 : None
Functionality String 1: related to working of a product Target
0: None
Installation String 1: related to installation of the product Target
0: None
Material String 1: related to material of the product Target
0: None
Price String 1: related to pricing details of a product Target
0: None
Quality String 1: related to quality aspects of a product Target
0: None
Usability String 1: related to usability of a product Target
0: None
Polarity Integer 1: Positive sentiment; Target
0: Negative Sentiment | |
| --- | --- |
| | | | |
| --- | --- |
| | |
Skills: Text Pre-processing – Lemmatization , Tokenization, N-Grams and other relevant methods Multi-Class Classification, Multi-label Classification Optimizing Log Loss
Overview Ugam, a Merkle company, is a leading analytics and technology services company. Our customer-centric approach delivers impactful business results for large corporations by leveraging data, technology, and expertise.
We consistently deliver superior, impactful results through the right blend of human intelligence and AI. With 3300+ people spread across locations worldwide, we successfully deploy our services to create success stories across industries like Retail & Consumer Brands, High Tech, BFSI, Distribution, and Market Research & Consulting. Over the past 21 years, Ugam has been recognized by several firms including Forrester and Gartner, named the No.1 data science company in India by Analytics Insight, and certified as a Great Place to Work®.
Problem Statement: The last two decades have witnessed a significant change in how consumers purchase products and express their experience/opinions in reviews, posts, and content across platforms. These online reviews are not only useful to reflect customers’ sentiment towards a product but also help businesses fix gaps and find potential opportunities which could further influence future purchases.
Participants need develop a machine learning model that can analyse customers’ sentiments based on their reviews and feedback.
NOTE: The prize money will be for the interested candidates who are willing to get interviewed or hired by Ugam. Winner are requested to come to the Machine Leaning Developers Summit2022, happening at Bangalore, for receiving the prize money.
dataset link: https://machinehack.com/hackathon/uhack_sentiments_20_decode_code_words/overview
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The competition is over 2 yrs ago. I just wanna play around the dataset.
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.
The origin place is here. Awesome tutorial is here, we can play with it.
Just for study and learning
Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.
This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.
To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset
The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.
The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.
The next column contains the next character after the sequence.
There are about 200 million samples are in the dataset.
Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html
This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.