This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
More reviews:
New reviews:
Metadata: - We have added transaction metadata for each review shown on the review page.
If you publish articles based on this dataset, please cite the following paper:
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.
Dataset Statistics
# Nodes | %Fraud Nodes (Class=1) |
---|---|
11,944 | 9.5 |
Relation | # Edges |
---|---|
U-P-U | |
U-S-U | |
U-V-U | 1,036,737 |
All |
Graph Construction
The Amazon dataset includes product reviews under the Musical Instruments category. Similar to this paper, we label users with more than 80% helpful votes as benign entities and users with less than 20% helpful votes as fraudulent entities. we conduct a fraudulent user detection task on the Amazon-Fraud dataset, which is a binary classification task. We take 25 handcrafted features from this paper as the raw node features for Amazon-Fraud. We take users as nodes in the graph and design three relations: 1) U-P-U: it connects users reviewing at least one same product; 2) U-S-V: it connects users having at least one same star rating within one week; 3) U-V-U: it connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users.
To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.
Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.
Dataset Statistics
# Nodes | %Fraud Nodes (Class=1) |
---|---|
45,954 | 14.5 |
Relation | # Edges |
---|---|
R-U-R | |
R-T-R | |
R-S-R | 3,402,743 |
All |
Graph Construction
The Yelp spam review dataset includes hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp. We conduct a spam review detection task on the Yelp-Fraud dataset which is a binary classification task. We take 32 handcrafted features from SpEagle paper as the raw node features for Yelp-Fraud. Based on previous studies which show that opinion fraudsters have connections in user, product, review text, and time, we take reviews as nodes in the graph and design three relations: 1) R-U-R: it connects reviews posted by the same user; 2) R-S-R: it connects reviews under the same product with the same star rating (1-5 stars); 3) R-T-R: it connects two reviews under the same product posted in the same month.
To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.
This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. It appears in our research paper "Ordinal Quantification Through Regularization", which we have published at ECML-PKDD 2022.
The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to two protocols that are suited for quantification research. The goal of quantification is not to predict the star rating of each individual instance, but the distribution of ratings in sets of textual reviews. More generally speaking, quantification aims at estimating the distribution of labels in unlabeled samples of data.
The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task.
This data set comprises two representations of the McAuley data. The first representation consists of TF-IDF features. The second representation is a RoBERTa embedding. This second representation is dense, while the first is sparse. In our experience, logistic regression classifiers work well with both representations. RoBERTa embeddings yield more accurate predictors than the TF-IDF features.
You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.
Extraction scripts and experiments: https://github.com/mirkobunse/ecml22
Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
More reviews:
New reviews:
Metadata: - We have added transaction metadata for each review shown on the review page.
If you publish articles based on this dataset, please cite the following paper: