https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Welcome. This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.
This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
Anonymous but real source
I look forward to come quality NLP! There is also some great opportunities for feature engineering, and multivariate analysis.
Statistical Analysis on E-Commerce Reviews, with Sentiment Classification using Bidirectional Recurrent Neural Network
by Abien Fred Agarap - Github
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MultiEmo, a new benchmark data set for the multilingual sentiment analysis task including 11 languages. The collection contains consumer reviews from four domains: medicine, hotels, products and university. The original reviews in Polish contained 8,216 documents consisting of 57,466 sentences. The reviews were manually annotated with sentiment at the level of the whole document and at the level of a sentence (3 annotators per element). We achieved a high Positive Specific Agreement value of 0.91 for texts and 0.88 for sentences. The collection was then translated automatically into English, Chinese, Italian, Japanese, Russian, German, Spanish, French, Dutch and Portuguese. MultiEmo is publicly available under a Creative Commons Attribution 4.0 International Licence.
More information: https://github.com/CLARIN-PL/multiemo
This dataset consists of a few million Amazon customer reviews (input text) and star ratings (output labels) for learning how to train fastText for sentiment analysis.
The idea here is a dataset is more than a toy - real business data on a reasonable scale - but can be trained in minutes on a modest laptop.
The fastText supervised learning tutorial requires data in the following format:
_label_
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is automatically generated by webscraping from sites such as Tripadvisor or Google Maps reviews. In these sites, the users post comments with ratings, allowing us to have tagged data. The code that generated this dataset can be found at the following URL:
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This package contains the raw open data for the study
Marco Ortu, Giuseppe Destefanis, Daniel Graziotin, Michele Marchesi, Roberto Tonelli. 2020. How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub. Under Review.
The dataset is based on GHTorrent dataset:
Georgios Gousios. 2013. The GHTorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR ’13). IEEE Press, 233–236
And released with the same license (CC BY-SA 4.0).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
*Also find Metacritic Movies and Metacritic TV Shows datasets.*
This dataset contains a collection of video games and their corresponding reviews from Metacritic, a popular aggregate review site. The data provides insights into various video games across different platforms, including PC, PlayStation, Xbox, and others. Each game entry includes critical reviews, user reviews, ratings, and other relevant information that can be used for analysis, natural language processing, machine learning, and predictive modeling.
Important Note: *The games in this collection are selected from Metacritic's Best Games of All Time list, which only includes titles that have received at least 7 reviews, ensuring a minimum level of critical and user input.*
Up-to-dateness: *This dataset is accurate as of March 14, 2025, and includes the most current rankings and game details available at that time.*
The dataset contains general information and scores of 13K+ games and their corresponding 1.6M+ user/critic reviews collected by sending automated requests to Metacritic's public backend API using Python's requests and pandas libraries.
This dataset is perfect for researchers, game enthusiasts, and data scientists who are interested in exploring the gaming industry through data analysis.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This dataset is a Romanian Sentiment Analysis dataset.
It is present in a processed form, as used by the authors of Romanian Transformers
in their examples and based on the original data present in
https://github.com/katakonst/sentiment-analysis-tensorflow
. The original dataset is collected
from product and movie reviews in Romanian.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Animal Crossing Reviews’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jessemostipak/animal-crossing on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The data this week comes from the VillagerDB and Metacritic. VillagerDB brings info about villagers, items, crafting, accessories, including links to their images. Metacritic brings user and critic reviews of the game (scores and raw text).
Per Wikipedia:
Animal Crossing: New Horizons is a 2020 life simulation video game developed and published by Nintendo for the Nintendo Switch. It is the fifth main series title in the Animal Crossing series. New Horizons was released in all regions on March 20, 2020.
New Horizons sees the player assuming the role of a customizable character who moves to a deserted island after purchasing a package from Tom Nook, a tanuki character who has appeared in every entry in the Animal Crossing series. Taking place in real-time, the player can explore the island in a nonlinear fashion, gathering and crafting items, catching insects and fish, and developing the island into a community of anthropomorphic animals.
Animal Crossing as explained by a Polygon opinion piece.
With just a few design twists, the work behind collecting hundreds or even thousands of items over weeks and months becomes an exercise of mindfulness, predictability, and agency that many players find soothing instead of annoying.
Games that feature gentle progression give us a sense of progress and achievability, teaching us that putting in a little work consistently while taking things one step at a time can give us some fantastic results. It’s a good life lesson, as well as a way to calm yourself and others, and it’s all achieved through game design.
Some potential context for user_reviews.tsv from 538 and a point of potential strife via Animal Crossing World, and lastly a spoiler article analyzing the reviews in R by Boon Tan.
PS there is an easter egg somewhere in the readme - something to do with... turnips.
The data was downloaded and cleaned by Thomas Mock for #TidyTuesday during the week of May 4th, 2020. You can see the code used to clean the data in the #TidyTuesday GitHub repository.
Potential Analyses:
--- Original source retains full ownership of the source dataset ---
Replication materials for "A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App)". You can also find these materials on GitHub repo (https://github.com/wesslen/text-analysis-org-science) as well as the Shiny app in the GitHub repo (https://github.com/wesslen/topicApp).
Few Arabic datasets are available for classification comparison and other NLP tasks. This dataset is mainly a compilation of several available datasets and a sampling of 100k rows (99999 to be exact).
The dataset combines reviews from hotels, books, movies, products and a few airlines. It has three classes (Mixed, Negative and Positive). Most were mapped from reviewers' ratings with 3 being mixed, above 3 positive and below 3 negative. Each row has a label and text separated by a tab (tsv). Text (reviews) were cleaned by removing Arabic diacritics and non-Arabic characters. The dataset has no duplicate reviews.
The hotels and book reviews are a subset of HARD and BRAD. The rest were selected from hadyelsahar with a little over 100 airlines reviews collected manually.
Let's jump in and use your best tools to beat the SOTA! Don't forget to show and share your work.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
While the NoReC dataset was primarily created for training and evaluating models for document-level sentiment analysis, many other use cases are of course possible. The corpus comprises more than 35,000 full-text reviews extracted from eight different major Norwegian news sources: Dagbladet, VG, Aftenposten, Bergens Tidende, Fædrelandsvennen, Stavanger Aftenblad, DinSide.no and P3.no. The reviews cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author. The texts have been pre-processed using UDPipe and are distributed in the CoNLL-U format. However, we also provide HTML files with the raw texts. Documentation and an accompanying Python package are provided through the following git repository: https://github.com/ltgoslo/norec
This dataset contains the reviews and ratings of Pink Floyd's The Dark Side of the Moon from users of rateyourmusic.com.
The dataset was acquired by scraping on 15 October 2021. It contains 1544 reviews and ratings (if the user rated the album).
The scraper can be found at this GitHub Repo.
The reviews can be found here.
This dataset can be used to practice data cleaning, performing exploratory data analyses, and using sentiment analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset: Towards Trustworthy Sentiment Analysis in Software Engineering — Dataset Characteristics and Tool SelectionAuthorsMartin Obaidi, Marc Herrmann, Jil Klünder, Kurt SchneiderDescriptionThis dataset accompanies the publication:Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool SelectionThe dataset contains all coded data and annotation results from a comprehensive analysis of sentiment and linguistic characteristics in software engineering communication. The study benchmarks 14 sentiment analysis tools across 10 datasets from five major SE platforms and investigates how dataset characteristics impact tool performance and selection. The coded data underpins the development of a practical questionnaire-based recommendation approach for trustworthy and context-sensitive sentiment analysis in SE.ContentsThe dataset includes the following file:All_Sample_Sets_Coded-v04.xlsxContains manually coded sample sets from five platforms (App Reviews, Code Reviews, GitHub, Jira, Stack Overflow).Each worksheet corresponds to one platform and provides:The raw text of the communication sample (“Text”).Gold-standard sentiment labels (“oracle”): -1 = Negative, 0 = Neutral, 1 = Positive.Annotations for 13 linguistic characteristics:For each characteristic, x = present, n = not present, and an empty cell = not applicable for this item (e.g., if a characteristic is only relevant for positive statements).Enables detailed cross-platform analysis of both sentiment polarity and linguistic features in developer communication.Column details:Text: Communication/document text.oracle: Gold-standard sentiment label.Characteristic 1 – 13: See accompanying paper for definitions. Annotation can be x, n, or empty (not applicable).If you use this dataset, please cite:Obaidi, M., Herrmann, M., Klünder, J., Schneider, K. (2025).Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection.In: 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW).LicenseThis dataset is provided under the Creative Commons Attribution 4.0 International License (CC BY 4.0).ContactFor questions regarding the dataset, please contact the corresponding author as listed in the publication.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MovieReviewSentimentClassification An MTEB dataset Massive Text Embedding Benchmark
The Allociné dataset is a French-language dataset for sentiment analysis that contains movie reviews produced by the online community of the Allociné.fr website.
Task category t2c
Domains Reviews, Written
Reference https://github.com/TheophileBlard/french-sentiment-analysis-with-bert
How to evaluate on this task
You can evaluate an embedding model on this dataset using… See the full description on the dataset page: https://huggingface.co/datasets/mteb/MovieReviewSentimentClassification.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is scraped form imdb using a python module called Scrapset. Here is the link to doc : https://github.com/ibrahim-string/Scrapset
There are two csv files in this dataset, one is cleaned and the other is cleaned. You can explore and do some kind of sentiment analysis after cleaning the data "YOUR WAY" or you can use the cleaned csv file.
A set of 19 ASC datasets (reviews of 19 products) producing a sequence of 19 tasks. Each dataset represents a task. The datasets are from 4 sources: (1) HL5Domains (Hu and Liu, 2004) with reviews of 5 products; (2) Liu3Domains (Liu et al., 2015) with reviews of 3 products; (3) Ding9Domains (Ding et al., 2008) with reviews of 9 products; and (4) SemEval14 with reviews of 2 products - SemEval 2014 Task 4 for laptop and restaurant. For (1), (2) and (3), we split about 10% of the original data as the validate data, another about 10% of the original data as the testing data. For (4), We use 150 examples from the training set for validation. To be consistent with existing research(Tang et al., 2016), examples belonging to the conflicting polarity (both positive and negative sentiments are expressed about an aspect term) are not used. Statistics and details of the 19 datasets are given on Page https://github.com/ZixuanKe/PyContinual.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Welcome. This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.
This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
Anonymous but real source
I look forward to come quality NLP! There is also some great opportunities for feature engineering, and multivariate analysis.
Statistical Analysis on E-Commerce Reviews, with Sentiment Classification using Bidirectional Recurrent Neural Network
by Abien Fred Agarap - Github