Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains TripAdvisor and Yelp review data, and tweets related to points of interest in Florida and New York. twitter, yelp, Florida, New York, data mining
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The graph shows the changes in the impact factor of ^ and its corresponding percentile for the sake of comparison with the entire literature. Impact Factor is the most common scientometric index, which is defined by the number of citations of papers in two preceding years divided by the number of papers published in those years.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains Yelp reviews labeled for fake review detection using opinion mining techniques. It is designed for training machine learning models to classify reviews as genuine or fake/spam.
Dataset Contents: - Review text content from Yelp platform - Labels: Genuine (0) and Fake (1) classifications - Metadata: Reviewer information, ratings, timestamps - Product/business information - Sentiment indicators
Use Cases: - Train supervised machine learning models for fake review detection - Perform sentiment analysis and opinion mining - Text classification and NLP research - Spam detection systems - E-commerce fraud prevention
Recommended Algorithms: - Naive Bayes (Bernoulli, Multinomial) - Support Vector Machines (SVM/LinearSVC) - Logistic Regression - Random Forest - LSTM/RNN for deep learning approaches
Preprocessing Required: - Text cleaning (remove stop words, punctuation) - Tokenization - TF-IDF or word embedding feature extraction - Train-test split (recommended 70-30)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: With the e-commerce growth, more people are buying products over the internet. To increase customer satisfaction, merchants provide spaces for product and service reviews. Products with positive reviews attract customers, while products with negative reviews lose customers. Following this idea, some individuals and corporations write fake reviews to promote their products and services or defame their competitors. The difficulty for finding these reviews was in the large amount of information available. One solution is to use data mining techniques and tools, such as the classification function. Exploring this situation, the present work evaluates classification techniques to identify fake reviews about products and services on the Internet. The research also presents a literature systematic review on fake reviews. The research used 8 classification algorithms. The algorithms were trained and tested with a hotels database. The CONCENSO algorithm presented the best result, with 88% in the precision indicator. After the first test, the algorithms classified reviews on another hotels database. To compare the results of this new classification, the Review Skeptic algorithm was used. The SVM and GLMNET algorithms presented the highest convergence with the Review Skeptic algorithm, classifying 83% of reviews with the same result. The research contributes by demonstrating the algorithms ability to understand consumers’ real reviews to products and services on the Internet. Another contribution is to be the pioneer in the investigation of fake reviews in Brazil and in production engineering.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
List of Top Disciplines of Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery sorted by citations.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Multi-aspect Reviews dataset primarily encompasses beer review data from RateBeer and BeerAdvocate, with a focus on multiple rated dimensions providing a comprehensive insight into sensory aspects such as taste, look, feel, and smell. This dataset facilitates the analysis of different facets of reviews, thus aiding in a deeper understanding of user preferences and product characteristics.
Basic Statistics: - RateBeer - Number of users: 40,213 - Number of items: 110,419 - Number of ratings/reviews: 2,855,232 - Timespan: Apr 2000 - Nov 2011
Metadata: - Reviews: Textual reviews provided by users. - Aspect-specific ratings: Ratings on taste, look, feel, smell, and overall impression. - Product Category: Categories of beer products. - ABV (Alcohol By Volume): Indicates the alcohol content in the beer.
Examples:
- RateBeer Example
json
{
"beer/name": "John Harvards Simcoe IPA",
"beer/beerId": "63836",
"beer/brewerId": "8481",
"beer/ABV": "5.4",
"beer/style": "India Pale Ale (IPA)",
"review/appearance": "4/5",
"review/aroma": "6/10",
"review/palate": "3/5",
"review/taste": "6/10",
"review/overall": "13/20",
"review/time": "1157587200",
"review/profileName": "hopdog",
"review/text": "On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass."
}
Download Links: - BeerAdvocate Data - RateBeer Data - Sentences with aspect labels (annotator 1) - Sentences with aspect labels (annotator 2)
Citations: - Learning attitudes and attributes from multi-aspect reviews, Julian McAuley, Jure Leskovec, Dan Jurafsky, International Conference on Data Mining (ICDM), 2012. pdf - From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews, Julian McAuley, Jure Leskovec, WWW, 2013. pdf
Use Cases: 1. Aspect-Based Sentiment Analysis (ABSA): Analyzing sentiments on different aspects of beers like taste, look, feel, and smell to gain deeper insights into user preferences and opinions. 2. Recommendation Systems: Developing personalized recommendation systems that consider multiple aspects of user preferences. 3. Product Development: Utilizing the feedback on various aspects to improve the product. 4. Consumer Behavior Analysis: Studying how different aspects influence consumer choice and satisfaction. 5. Competitor Analysis: Comparing ratings on different aspects with competitors to identify strengths and weaknesses. 6. Trend Analysis: Identifying trends in consumer preferences over time across different aspects. 7. Marketing Strategies: Formulating marketing strategies based on insights drawn from aspect-based reviews. 8. Natural Language Processing (NLP): Developing and enhancing NLP models to understand and categorize multi-aspect reviews. 9. Learning User Expertise Evolution: Studying how user expertise evolves through reviews and ratings over time. 10. Training Machine Learning Models: Training supervised learning models to predict aspect-based ratings from review text.
This dataset is extremely valuable for researchers, marketers, product developers, and machine learning practitioners looking to delve into multi-dimensional review analysis and understand user-product interaction on a granular level.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patchwork data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of reviews collected from restaurants on a Korean delivery app platform running a review event. A total of 128,668 reviews were collected from 136 restaurants by crawling reviews using the Selenium library in Python. The 136 chosen restaurants run review events which demand customers to write reviews with 5 stars and photos. So the annotation of data was done by considering 1) whether the review gives five-star ratings, and 2) whether the review contains photo(s).
Facebook
TwitterContains data on compliance reviews and new entrant safety audits performed by FMCSA and State grantees.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
List of Top Authors of Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery sorted by article citations.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?
Steps:
- Set the working directory and read the data.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt="">
- Data cleaning. Check for missing values and data types of variables
- Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer")
- TEXT ACQUISITION and AGGREGATION. Create corpus.
- TEXT PRE-PROCESSING. Cleaning the text
- Replace special characters with " ". We use the tm_map function for this purpose
- make all the alphabets lower case
- remove punctuations
- remove whitespace
- remove stopwords
- remove numbers
- stem the document
- create term document matrix
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt="">
- convert into matrix and find out frequency of words
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt="">
- convert into a data frame
- TEXT EXPLORATION find out the words which appear most frequently and least frequently
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt="">
- Create Wordcloud
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains the underlying research data for the publication "The Valuation of User-Generated Content: A Structural, Stylistic and Semantic Analysis of Online Reviews" and the full-text is available from: https://ink.library.smu.edu.sg/etd_coll/78The ability and ease for users to create and publish content has provided vast amount of online product reviews. However, the amount of data is overwhelmingly large and unstructured, making information difficult to quantify. This creates challenge in understanding how online reviews affect consumers’ purchase decisions. In my dissertation, I explore the structural, stylistic and semantic content of online reviews. Firstly, I present a measurement that quantifies sentiments with respect to a multi-point scale and conduct a systematic study on the impact of online reviews on product sales. Using the sentiment metrics generated, I estimate the weight that customers place on each segment of the review and examine how these segments affect the sales for a given product. The results empirically verified that sentiments influence sales, of which ratings alone do not capture. Secondly, I propose a method to detect online review manipulation using writing style analysis and assess how consumers respond to such manipulation. Finally, I find that societal norms have influence on posting behavior and significant differences do exist across cultures. Users should therefore exercise care in interpreting the information from online reviews. This dissertation advances our understanding on the consumer decision making process and shed insight on the relevance of online review ratings and sentiments over a sequential decision making process. Having tapped into the abundant supply of online review data, the results in this work are based on large-scale datasets which extend beyond the scale of traditional word-of-mouth research.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A dataset consisting of 751,500 English app reviews of 12 online shopping apps. The dataset was scraped from the internet using a python script. This ShoppingAppReviews dataset contains app reviews of the 12 most popular online shopping android apps: Alibaba, Aliexpress, Amazon, Daraz, eBay, Flipcart, Lazada, Meesho, Myntra, Shein, Snapdeal and Walmart. Each review entry contains many metadata like review score, thumbsupcount, review posting time, reply content etc. The dataset is organized in a zip file, under which there are 12 json files and 12 csv files for 12 online shopping apps. This dataset can be used to obtain valuable information about customers' feedback regarding their user experience of these financially important apps.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Chronic pain (CP) lasts for more than 3 months, causing prolonged physical and mental burdens to patients. According to the US Centers for Disease Control and Prevention, CP contributes to more than 500 billion US dollars yearly in direct medical cost plus the associated productivity loss. CP is complex in etiology and can occur anywhere in the body, making it difficult to treat and manage. There is a pressing need for research to better summarize the common health issues faced by consumers living with CP and their experience in accessing over-the-counter analgesics or therapeutic devices. Modern online shopping platforms offer a broad array of opportunities for the secondary use of consumer-generated data in CP research. In this study, we performed an exploratory data mining study that analyzed CP-related Amazon product reviews. Our descriptive analyses characterized the review language, the reviewed products, the representative topics, and the network of comorbidities mentioned in the reviews. The results indicated that most of the reviews were concise yet rich in terms of representing the various health issues faced by people with CP. Despite the noise in the online reviews, we see potential in leveraging the data to capture certain consumer-reported outcomes or to identify shortcomings of the available products.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
List of Top Schools of Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery sorted by citations.
Facebook
TwitterThis dataset was created by Rafay
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and model checkpoints for paper "Weakly Supervised Concept Map Generation through Task-Guided Graph Translation" by Jiaying Lu, Xiangjue Dong, and Carl Yang. The paper has been accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE).
GT-D2G-*.tar.gz are model checkpoints for GT-D2G variants. These models are trained by seed=27.
nyt/dblp/yelp.*.win5.pickle.gz are initial graphs generated by NLP pipelines.
glove.840B.restaurant.400d.vec.gz is the pre-trained embedding for the Yelp dataset.
For more instructions, please refer to our GitHub repo.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This comprehensive dataset offers a rich collection of over 5 million customer reviews for hotels and accommodations listed on Booking.com, specifically sourced from the United States. It provides invaluable insights into guest experiences, preferences, and sentiment across various properties and locations within the USA. This dataset is ideal for market research, sentiment analysis, hospitality trend identification, and building advanced recommendation systems.
Key Features:
Dive into a sample of 1,000+ records to experience the dataset's quality. For full access to this comprehensive data, submit your request at Booking reviews data.
Use Cases:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mining opinions from reviews has been a field of ever-growing research. These include mining opinions on document level, sentence-level, and even aspect level of a review. While explicitly mentioned aspects in a review have been widely researched, very little work has been done in gathering opinions on aspects that are implied and not explicitly mentioned. E.g. “the flight was spacious and there was plenty of legroom”. This gives an opinion on the entities of the cabin and seat of an airline. Words like “spacious” and phrases like “plenty of legroom” help identify these implied entities and the opinions attached to them. Not much research has been done for gathering such implicit aspects and opinions for airline reviews. The present dataset is a manually annotated domain-specific aspect-based corpus that helps a study to extract and analyze opinions about such implied aspects and entities of airlines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Argument Mining in Scientific Reviews (AMSR)
We release a new dataset of peer-reviews from different computer science conferences with annotated arguments, called AMSR (Argument Mining in Scientific Reviews).
The dataset has been crawled by the OpenReview platform (https://openreview.net/) and the OpenReviewCrawler (https://openreview-py.readthedocs.io/en/latest/getting data.html)
From 12,135 collected papers and reviews, we sample 77 for the annotation. We use a simple argumentation scheme, which distinguishes between non-arguments, supporting arguments, and attacking arguments, which we denote as NON/PRO/CON accordingly.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains TripAdvisor and Yelp review data, and tweets related to points of interest in Florida and New York. twitter, yelp, Florida, New York, data mining