63 datasets found

Language Generation Dataset: 200M Samples
kaggle.com
zip
Updated Sep 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Chatterjee (2019). Language Generation Dataset: 200M Samples [Dataset]. https://www.kaggle.com/datasets/imdeepmind/language-generation-dataset-200m-samples
Explore at:
zip(3416608411 bytes)Available download formats
Dataset updated
Sep 7, 2019
Authors
Abhishek Chatterjee
Description
Context

Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.

This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.

To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset

Content

The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.

The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.

The next column contains the next character after the sequence.

There are about 200 million samples are in the dataset.

Acknowledgements

Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Inspiration

This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.
IMDB movie review dataset
kaggle.com
zip
Updated Apr 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renan Machado (2019). IMDB movie review dataset [Dataset]. https://www.kaggle.com/renanmav/imdb-movie-review-dataset
Explore at:
zip(26909839 bytes)Available download formats
Dataset updated
Apr 7, 2019
Authors
Renan Machado
Description
Large dataset of movie reviews from the Internet Movie Database (IMDb)
A
‘Amazon Product Reviews Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Amazon Product Reviews Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-amazon-product-reviews-dataset-7933/latest
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Amazon Product Reviews Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/amazon-product-reviews-datasete on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

This dataset contains 30K records of product reviews from amazon.com.

This dataset was created by PromptCloud and DataStock
Content

This dataset contains the following:

Total Records Count: 43729

Domain Name: amazon.com

Date Range: 01st Jan 2020 - 31st Mar 2020

File Extension: CSV

Available Fields:
-- Uniq Id,
-- Crawl Timestamp,
-- Billing Uniq Id,
-- Rating,
-- Review Title,
-- Review Rating,
-- Review Date,
-- User Id,
-- Brand,
-- Category,
-- Sub Category,
-- Product Description,
-- Asin,
-- Url,
-- Review Content,
-- Verified Purchase,
-- Helpful Review Count,
-- Manufacturer Response

Acknowledgements

We wouldn't be here without the help of our in house teams at PromptCloud and DataStock. Who has put their heart and soul into this project like all other projects? We want to provide the best quality data and we will continue to do so.

Inspiration

The inspiration for these datasets came from research. Reviews are something that is important wit everybody across the globe. So we decided to come up with this dataset that shows us exactly how the user reviews help companies to better their products.

This dataset was created by PromptCloud and contains around 0 samples along with Billing Uniq Id, Verified Purchase, technical information and other features such as: - Crawl Timestamp - Manufacturer Response - and more.

How to use this dataset

Analyze Helpful Review Count in relation to Sub Category

Study the influence of Review Date on Product Description

More datasets

Acknowledgements

If you use this dataset in your research, please credit PromptCloud

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Amazon Product Reviews Dataset
kaggle.com
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gözde Kızılkaya Atik (2025). Amazon Product Reviews Dataset [Dataset]. https://www.kaggle.com/datasets/gzdekzlkaya/amazon-product-reviews-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gözde Kızılkaya Atik
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🛍️ Dataset Overview

This dataset contains over 4,900 customer reviews from Amazon, including text-based feedback, star ratings, and helpfulness votes.

It can be used for:

📊 Sentiment Analysis

🧠 Text Classification (Positive/Negative)

🔍 Review Score Prediction (based on reviewText)

🤖 Building Recommendation Systems

🧮 Helpfulness Scoring Models

📌 Key Columns

reviewText: Full written review

overall: Star rating (1 to 5)

summary: Short summary of the review

helpful_yes: Number of users who found the review helpful

total_vote: Total votes on helpfulness

day_diff: Days since the review was written

This dataset is suitable for natural language processing (NLP) and supervised learning tasks.

📎 Note

This is a publicly available dataset for educational and research use.
A
‘App Store Reviews’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘App Store Reviews’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-app-store-reviews-5101/0b9dd0ab/?iid=005-006&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘App Store Reviews’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/app-store-reviews on 28 January 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

The dataset contains scraped written reviews from the App store. This dataset was created by CrawlFeeds and contains around 10K reviews along with Country & Date and other features such as:

User Name

Is Edited?

Date of crawl

And more.

How to use this dataset

Analyze the sentiment of the review, try to isolate the phrases associated with positive/negative reviews.

Study the connection between country and review sentiment

Study the connection between the time of day and sentiment

More datasets

Acknowledgements

If you use this dataset in your research, please credit CrawlFeeds

--- Original source retains full ownership of the source dataset ---
u
Goodreads Book Reviews
cseweb.ucsd.edu
json
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Goodreads Book Reviews [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.

Metadata includes

reviews

add-to-shelf, read, review actions

book attributes: title, isbn

graph of similar books

Basic Statistics:

Items: 1,561,465

Users: 808,749

Interactions: 225,394,930
IMDb Review Dataset - ebD
kaggle.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enam Biswas (2021). IMDb Review Dataset - ebD [Dataset]. http://doi.org/10.34740/kaggle/dsv/1836923
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1836923
Dataset updated
Jan 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Enam Biswas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Reviews are a way to gain insight into a product/service. In machine learning tasks, text reviews play an important role in predicting/gaining insights. IMDb, one of the largest databases for films, tv-shows, and similar content, people provide their valuable opinion on them every year. These reviews eventually play a huge role in the viewer's community. However, these reviews can sometime contain spoilers, and the user also marks them considering their functionality.

Content

The dataset is collected from IMDb and each individual data is publically available. Below is the table of content of each element dictionary element - | Content| Details| | --- | --- | | review_id | It is generated by IMBb and unique to each review | | reviewer | Public identity or username of the reviewer | | movie | It represents the name of the show (can be - movie, tv-series, etc.) | | rating | Rating of movie out of 10, can be None for older reviews | | review_summary | Plain summary of the review | | review_date | Date of the posted review | | spoiler_tag | If 1 = spoiler & 0 = not spoiler | | review_detail | Details of the review | | helpful | list[0] people find the review helpful out of list[1]|

Dataset Stats: # total records = 5, 571, 499 # total shows = 453, 528 # users = 1, 699, 310 # spoilers = 1, 186, 611

Acknowledgements

All rights reserved to IMDb and the user who spent valuable time providing reviewers.

Inspiration

Can you predict spoiler given show plot, scripts?

Use transfer learning in predicting review sentiment on another domain.

Predict review usefulness/quality or impactful review.

Citation

If you intend to use this dataset, please cite the following - @misc{enam biswas_2021, title={IMDb Review Dataset - ebD}, url={https://www.kaggle.com/dsv/1836923}, DOI={10.34740/KAGGLE/DSV/1836923}, publisher={Kaggle}, author={Enam Biswas}, year={2021} } Please feel free to contact - Enam Biswas if you have any kind of questions.

Other datasets by me

Bangla Largest Newspaper Dataset - Almost 1.7M Bangla news articles.

Place Review Dataset - Niche (USA) - Over 700, 000 reviews of USA neighborhood.
A
‘Travel Review Rating Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Travel Review Rating Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-travel-review-rating-dataset-d315/6c6ad6b1/?iid=003-929&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Travel Review Rating Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/wirachleelakiatiwong/travel-review-rating-dataset on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Context

This data set has been sourced from the Machine Learning Repository of University of California, Irvine (UC Irvine) : Travel Review Ratings Data Set. This data set is populated by capturing user ratings from Google reviews. Reviews on attractions from 24 categories across Europe are considered. Google user rating ranges from 1 to 5 and average user rating per category is calculated.

Content

Attribute 1 : Unique user id Attribute 2 : Average ratings on churches Attribute 3 : Average ratings on resorts Attribute 4 : Average ratings on beaches Attribute 5 : Average ratings on parks Attribute 6 : Average ratings on theatres Attribute 7 : Average ratings on museums Attribute 8 : Average ratings on malls Attribute 9 : Average ratings on zoo Attribute 10 : Average ratings on restaurants Attribute 11 : Average ratings on pubs/bars Attribute 12 : Average ratings on local services Attribute 13 : Average ratings on burger/pizza shops Attribute 14 : Average ratings on hotels/other lodgings Attribute 15 : Average ratings on juice bars Attribute 16 : Average ratings on art galleries Attribute 17 : Average ratings on dance clubs Attribute 18 : Average ratings on swimming pools Attribute 19 : Average ratings on gyms Attribute 20 : Average ratings on bakeries Attribute 21 : Average ratings on beauty & spas Attribute 22 : Average ratings on cafes Attribute 23 : Average ratings on view points Attribute 24 : Average ratings on monuments Attribute 25 : Average ratings on gardens

Acknowledgements

This data set has been sourced from the Machine Learning Repository of University of California, Irvine (UC Irvine) : Travel Review Ratings Data Set

The UCI page mentions the following publication as the original source of the data set: Renjith, Shini, A. Sreekumar, and M. Jathavedan. 2018. Evaluation of Partitioning Clustering Algorithms for Processing Social Media Data in Tourism Domain. In 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 12731. IEEE

Inspiration

I'm kind of people who love traveling. But sometimes I've problems like where should I visit? Are there somewhere interesting places matched with my lifestyle? Often I spent hours to search for interesting place to go out. Such a waste of time.

What if we can build a recommender system which can recommend you several interesting venue based on your preferences. With information from Google review, I'll try to divide Google review user into cluster of similar interest for further work of building recommender system based on thier preference.

--- Original source retains full ownership of the source dataset ---
Place Review Dataset - Niche (USA)
kaggle.com
Updated Jan 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enam Biswas (2021). Place Review Dataset - Niche (USA) [Dataset]. http://doi.org/10.34740/kaggle/dsv/1842046
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1842046
Dataset updated
Jan 13, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Enam Biswas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Context

Reviews are a way to gain insight into a product/service. In machine learning tasks, text reviews play an important role in predicting/gaining insights. User-generated place reviews are extremely handy when it comes to choosing a neighborhood to live in. Niche has got a huge amount of review-rating for American neighborhood, which is perfect for several NLP tasks.

Content

The dataset is collected from Niche and each individual data is publically available. Below is the overall dataset stats - # total records = 712, 107 # total places = 56, 800

Some insight about data: # guid Generated by Niche and unique to place/entity. # body Actual review data. # rating Rating on a scale of 0 to 5. # author Provider of the review/rating. (aka Niche user) # created Timestamp. # categories Experience type (about the entity).

Acknowledgements

All rights reserved to Niche and the user who spent valuable time providing reviewers-ratings.

Inspiration

Can you predict a rating for reviews which has no rating?

Use transfer learning in predicting review sentiment on another domain.

Predict review usefulness/quality or impactful review.

Citation

If you intend to use this dataset, please cite the following - @misc{enam biswas_2021, title={Place Review Dataset - Niche (USA)}, url={https://www.kaggle.com/dsv/1842046}, DOI={10.34740/KAGGLE/DSV/1842046}, publisher={Kaggle}, author={Enam Biswas}, year={2021} } Please feel free to contact - Enam Biswas if you have any kind of questions.

Other datasets by me

IMDb Largest Review Dataset - Over 5.5M reviews/ 1.2M spoilers.

Bangla Largest Newspaper Dataset - Almost 1.7M Bangla news articles.
T
imdb_reviews
tensorflow.org
kaggle.com
Updated Sep 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews
Explore at:
Dataset updated
Sep 20, 2024
Description
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imdb_reviews', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
A
‘⭐ McDonalds Review Sentiment’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘⭐ McDonalds Review Sentiment’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-mcdonalds-review-sentiment-6d6c/9da444f4/?iid=000-968&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘⭐ McDonalds Review Sentiment’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/mcdonalds-review-sentimente on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

A sentiment analysis of negative McDonald's reviews. Contributors were given reviews culled from low-rated McDonald's from random metro areas and asked to classify why the locations received low reviews. Options given were: * Rude Service

Slow Service

Problem with Order

Bad Food

Bad Neighborhood

Dirty Location

Cost

Missing Item Added: March 6, 2015 by CrowdFlower | Data Rows: 1500 Download Now

Source: https://www.crowdflower.com/data-for-everyone/

This dataset was created by CrowdFlower and contains around 2000 samples along with Unit State, Policies Violated, technical information and other features such as: - Review - Policies Violated Gold - and more.

How to use this dataset

Analyze Policies Violated:confidence in relation to City

Study the influence of Last Judgment At on Trusted Judgments

More datasets

Acknowledgements

If you use this dataset in your research, please credit CrowdFlower

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
f
Financial Fraud Alert Review Dataset
springernature.figshare.com
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jean V. Alves,; Diogo Leitão; Sérgio Jesus; Marco O. P. Sampaio; Javier Liébana; Pedro Saleiro; Mário A. T. Figueiredo; Pedro Bizarro (2025). Financial Fraud Alert Review Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28351172.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28351172.v1
Dataset updated
Apr 24, 2025
Dataset provided by
figshare
Authors
Jean V. Alves,; Diogo Leitão; Sérgio Jesus; Marco O. P. Sampaio; Javier Liébana; Pedro Saleiro; Mário A. T. Figueiredo; Pedro Bizarro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The FiFAR dataset, is comprised of 30K bank account opening application instances, accompanied by the judgments of a team of 50 synthetic fraud analysts with realistic decision-making properties on whether or not each instance is a fraudulent application. Each instance contains information regarding the bank account opening application and the applicant, as well as the ground truth label: 0 - legitimate, 1 - fraudulent. Furthermore, each instance contains the prediction of each of the 50 experts, following the same convention as the label. We provide every expert’s prediction for every 30K instances in the Bank Account Fraud dataset (https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022/versions/1?select=Base.csv) deemed fraudulent by a fraud detection model, thus simulating an “alert-review” scenario, where experts are tasked with reviewing high-risk bank account opening applications.
u
PDMX
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
u
Pinterest Fashion Compatibility
cseweb.ucsd.edu
beta.data.urbandatacentre.ca
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Pinterest Fashion Compatibility [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.

Metadata includes

product IDs

bounding boxes

Basic Statistics:

Scenes: 47,739

Products: 38,111

Scene-Product Pairs: 93,274
u
Product Exchange/Bartering Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Product Exchange/Bartering Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain peer-to-peer trades from various recommendation platforms.

Metadata includes

peer-to-peer trades

have and want lists

image data (tradesy)
Distributed peer review anonymized dataset
kaggle.com
Updated May 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sirisha Siri (2021). Distributed peer review anonymized dataset [Dataset]. https://www.kaggle.com/ishadss/distributed-peer-review-anonymized-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sirisha Siri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

While ancient scientists often had patrons to fund their work, peer review of proposals for the allocation of resources is a foundation of modern science

Content

This is the anonymized dataset obtained from the DPR Experiment run at ESO in Fall 2018

Acknowledgements

previous work available at 10.1038/s41550-020-1038-y
d
Replication Data for: \"A Topic-based Segmentation Model for Identifying...
search.dataone.org
dataverse.harvard.edu
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EE3DE2
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
A
‘uHack Sentiments 2.0: Decode Code Words’ analyzed by Analyst-2
analyst-2.ai
Updated Dec 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘uHack Sentiments 2.0: Decode Code Words’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-uhack-sentiments-2-0-decode-code-words-ce3a/88e2b3fd/?iid=004-194&v=presentation
Explore at:
Dataset updated
Dec 28, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘uHack Sentiments 2.0: Decode Code Words’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/manishtripathi86/uhack-sentiments-20-decode-code-words on 28 January 2022.

--- Dataset description provided by original source is as follows ---

The challenge here is to analyze and deep dive into the natural language text (reviews) and bucket them based on their topics of discussion. Furthermore, analyzing the overall sentiment will also help the business to make tangible decisions.

The data set provided to you has a mix of customer reviews for products across categories and retailers. We would like you to model on the data

to bucket the future reviews in their respective topics (Note: A review can talk about multiple topics)

Overall polarity (positive/negative sentiment)

Train: 6136 rows x 14 columns

Test: 2631 rows x 14 columns

Topics (Components, Delivery and Customer Support, Design and Aesthetics, Dimensions, Features, Functionality, Installation, Material, Price, Quality and Usability) Polarity (Positive/Negative) Note: The target variables are all encoded in the train dataset for convenience. Please submit the test results in the similar encoded fashion for us to evaluate your results.

| | Field Name Data Type Purpose Variable type Id Integer Unique identifier for each review Input Review String Review written by customers on a retail website Input Components String 1: aspects related to components Target 0: None Delivery and Customer Support String 1: some aspects related to delivery, return, exchange and customer support Target 0: None Design and Aesthetics String 1: some aspects related to components Target 0: None Dimensions String 1: related to product dimension and size Target 0: None Features String 1: related to product features Target 0 : None
Functionality String 1: related to working of a product Target 0: None Installation String 1: related to installation of the product Target 0: None Material String 1: related to material of the product Target 0: None Price String 1: related to pricing details of a product Target 0: None Quality String 1: related to quality aspects of a product Target 0: None Usability String 1: related to usability of a product Target 0: None Polarity Integer 1: Positive sentiment; Target 0: Negative Sentiment | | | --- | --- | | | | | | | --- | --- | | | |

Skills: Text Pre-processing – Lemmatization , Tokenization, N-Grams and other relevant methods Multi-Class Classification, Multi-label Classification Optimizing Log Loss

Overview Ugam, a Merkle company, is a leading analytics and technology services company. Our customer-centric approach delivers impactful business results for large corporations by leveraging data, technology, and expertise.

We consistently deliver superior, impactful results through the right blend of human intelligence and AI. With 3300+ people spread across locations worldwide, we successfully deploy our services to create success stories across industries like Retail & Consumer Brands, High Tech, BFSI, Distribution, and Market Research & Consulting. Over the past 21 years, Ugam has been recognized by several firms including Forrester and Gartner, named the No.1 data science company in India by Analytics Insight, and certified as a Great Place to Work®.

Problem Statement: The last two decades have witnessed a significant change in how consumers purchase products and express their experience/opinions in reviews, posts, and content across platforms. These online reviews are not only useful to reflect customers’ sentiment towards a product but also help businesses fix gaps and find potential opportunities which could further influence future purchases.

Participants need develop a machine learning model that can analyse customers’ sentiments based on their reviews and feedback.

NOTE: The prize money will be for the interested candidates who are willing to get interviewed or hired by Ugam. Winner are requested to come to the Machine Leaning Developers Summit2022, happening at Bangalore, for receiving the prize money.

dataset link: https://machinehack.com/hackathon/uhack_sentiments_20_decode_code_words/overview

--- Original source retains full ownership of the source dataset ---
Data from: Bag of Words Meets Bags of Popcorn
kaggle.com
zip
Updated May 18, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rocha (2017). Bag of Words Meets Bags of Popcorn [Dataset]. https://www.kaggle.com/rochachan/bag-of-words-meets-bags-of-popcorn
Explore at:
zip(13788314 bytes)Available download formats
Dataset updated
May 18, 2017
Authors
rocha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The competition is over 2 yrs ago. I just wanna play around the dataset.

Content

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

id - Unique ID of each review

sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews

review - Text of the review

Acknowledgements

The origin place is here. Awesome tutorial is here, we can play with it.

Inspiration

Just for study and learning

Facebook

Twitter

Click to copy link

Link copied

Cite

Abhishek Chatterjee (2019). Language Generation Dataset: 200M Samples [Dataset]. https://www.kaggle.com/datasets/imdeepmind/language-generation-dataset-200m-samples

Language Generation Dataset: 200M Samples

A processed Amazon Review Dataset for Language Generation

Explore at:

zip(3416608411 bytes)Available download formats

Dataset updated

Sep 7, 2019

Authors

Abhishek Chatterjee

Description

Context

Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.

This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.

To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset

Content

The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.

The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.

The next column contains the next character after the sequence.

There are about 200 million samples are in the dataset.

Acknowledgements

Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Inspiration

This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.

Clear search

Close search

Google apps

Main menu

Language Generation Dataset: 200M Samples

Context

Content

Acknowledgements

Inspiration

IMDB movie review dataset

‘Amazon Product Reviews Dataset’ analyzed by Analyst-2

About this dataset

Content

Acknowledgements

Inspiration

How to use this dataset

Acknowledgements

Start A New Notebook!

Datasets for Sentiment Analysis

Amazon Product Reviews Dataset

🛍️ Dataset Overview

📌 Key Columns

📎 Note

‘App Store Reviews’ analyzed by Analyst-2

About this dataset

How to use this dataset

Acknowledgements

Goodreads Book Reviews

IMDb Review Dataset - ebD

Context

Content

Acknowledgements

Inspiration

Citation

Other datasets by me

‘Travel Review Rating Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Place Review Dataset - Niche (USA)

Context

Content

Acknowledgements

Inspiration

Citation

Other datasets by me

imdb_reviews

‘⭐ McDonalds Review Sentiment’ analyzed by Analyst-2

About this dataset

How to use this dataset

Acknowledgements

Start A New Notebook!

Financial Fraud Alert Review Dataset

PDMX

Pinterest Fashion Compatibility

Product Exchange/Bartering Data

Distributed peer review anonymized dataset

Context

Content

Acknowledgements

Replication Data for: \"A Topic-based Segmentation Model for Identifying...

‘uHack Sentiments 2.0: Decode Code Words’ analyzed by Analyst-2

Train: 6136 rows x 14 columns

Test: 2631 rows x 14 columns

Data from: Bag of Words Meets Bags of Popcorn

Context

Content

Acknowledgements

Inspiration

Language Generation Dataset: 200M Samples

A processed Amazon Review Dataset for Language Generation

Context

Content

Acknowledgements

Inspiration