This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
More reviews:
New reviews:
Metadata: - We have added transaction metadata for each review shown on the review page.
If you publish articles based on this dataset, please cite the following paper:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models. Dataset Statistics
%Fraud Nodes (Class=1) 11,944 9.5 Relation
U-P-U 175,608 U-S-U 3,566,479 U-V-U 1,036,737 All 4,398,392 Graph Construction The Amazon dataset includes product reviews under the Musical Instruments category. Similar to this paper, we label users with more than 80% helpful votes as benign entities and users with less than 20% helpful votes as fraudulent entities. we conduct a fraudulent user detection task on the Amazon-Fraud dataset, which is a binary classification task. We take 25 handcrafted features from this paper as the raw node features for Amazon-Fraud. We take users as nodes in the graph and design three relations: 1) U-P-U: it connects users reviewing at least one same product; 2) U-S-V: it connects users having at least one same star rating within one week; 3) U-V-U: it connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users. To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.
This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. The goal of quantification is not to predict the class label of each individual instance, but the distribution of labels in unlabeled sets of data.
The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to three protocols that are suited for quantification research.
The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ(50%), is a variant thereof, where only the smoothest 50% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task. The third protocol considers "real" distributions of labels. These distributions stem from actual products in the original data set.
The data is represented by a RoBERTa embedding. In our experience, logistic regression classifiers work well with this representation.
You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.
Extraction scripts and experiments: https://github.com/mirkobunse/regularized-oq
Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
More reviews:
New reviews:
Metadata: - We have added transaction metadata for each review shown on the review page.
If you publish articles based on this dataset, please cite the following paper: