Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset
This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:
The Data has been gathered from multiple websites such as :
Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset
Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis
https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.
| Column | Description |
|---|---|
Comment | User-generated text content |
Sentiment | Sentiment label (0=Negative, 1=Neutral, 2=Positive) |
Comment: "apple pay is so convenient secure and easy to use"
Sentiment: 2 (Positive)
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for amazon reviews for sentiment analysis
Dataset Summary
One of the most important problems in e-commerce is the correct calculation of the points given to after-sales products. The solution to this problem is to provide greater customer satisfaction for the e-commerce site, product prominence for sellers, and a seamless shopping experience for buyers. Another problem is the correct ordering of the comments given to the products. The prominence of misleading… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/amazon-reviews-sentiment-analysis.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains automated sentiment and emotionality annotations of 23 million headlines from 47 popular news media outlets popular in the United States.
The set of 47 news media outlets analysed (listed in Figure 1 of the main manuscript) was derived from the AllSides organization 2019 Media Bias Chart v1.1. The human ratings of outlets’ ideological leanings were also taken from this chart and are listed in Figure 2 of the main manuscript.
News articles headlines from the set of outlets analyzed in the manuscript are available in the outlets’ online domains and/or public cache repositories such as The Internet Wayback Machine, Google cache and Common Crawl. Articles headlines were located in articles’ HTML raw data using outlet-specific XPath expressions.
The temporal coverage of headlines across news outlets is not uniform. For some media organizations, news articles availability in online domains or Internet cache repositories becomes sparse for earlier years. Furthermore, some news outlets popular in 2019, such as The Huffington Post or Breitbart, did not exist in the early 2000’s. Hence, our data set is sparser in headlines sample size and representativeness for earlier years in the 2000-2019 timeline. Nevertheless, 20 outlets in our data set have chronologically continuous partial or full headline data availability since the year 2000. Figure S 1 in the SI reports the number of headlines per outlet and per year in our analysis.
In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the headline due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. After manual testing, we determined that the percentage of headlines following in this category is very small. Additionally, our method might miss detecting some articles in the online domains of news outlets. To conclude, in a data analysis of over 23 million headlines, we cannot manually check the correctness of every single data instance and hundred percent accuracy at capturing headlines’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our headlines set is representative of headlines in print news media content for the studied time period and outlets analyzed.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript as well as aggregated data of sentiment and emotionality automated annotations of the headlines and human annotations of a subset of headlines sentiment and emotionality used as ground truth.
-models.rar contains the Transformer sentiment and emotion annotation models used in the analysis. Namely:
Siebert/sentiment-roberta-large-english from https://huggingface.co/siebert/sentiment-roberta-large-english. This model is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). See more information from the original authors at https://huggingface.co/siebert/sentiment-roberta-large-english
DistilbertSST2.rar is the default sentiment classification model of the HuggingFace Transformer library https://huggingface.co/ This model is only used to replicate the results of the sentiment analysis with sentiment-roberta-large-english
DistilRoberta j-hartmann/emotion-english-distilroberta-base from https://huggingface.co/j-hartmann/emotion-english-distilroberta-base. The model is a fine-tuned checkpoint of DistilRoBERTa-base. The model allows annotation of English text with Ekman's 6 basic emotions, plus a neutral class. The model was trained on 6 diverse datasets. Please refer to the original author at https://huggingface.co/j-hartmann/emotion-english-distilroberta-base for an overview of the data sets used for fine tuning. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
-headlinesDataWithSentimentLabelsAnnotationsFromSentimentRobertaLargeModel.rar URLs of headlines analyzed and the sentiment annotations of the siebert/sentiment-roberta-large-english Transformer model. https://huggingface.co/siebert/sentiment-roberta-large-english
-headlinesDataWithSentimentLabelsAnnotationsFromDistilbertSST2.rar URLs of headlines analyzed and the sentiment annotations of the default HuggingFace sentiment analysis model fine-tuned on the SST-2 dataset. https://huggingface.co/
-headlinesDataWithEmotionLabelsAnnotationsFromDistilRoberta.rar URLs of headlines analyzed and the emotion categories annotations of the j-hartmann/emotion-english-distilroberta-base Transformer model. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
Facebook
Twittersjyuxyz/financial-sentiment-analysis dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterbtwitssayan/sentiment-analysis-for-mental-health dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterSentiment140 consists of Twitter messages with emoticons, which are used as noisy labels for sentiment classification. For more detailed information please refer to the paper.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
The Rotten Tomatoes Movie Review Sentiment Analysis Dataset contains a set of 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. Bo Pang and Lillian Lee first used this data in their paper Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, which was published in Proceedings of the ACL in 2005. All of the data fields are identical in every single one of the splits.The text column contains the review itself, and the label column indicates whether the review is positive or negative
The Performance of Sentiment Analysis In this post we take a look at the performance of different sentiment analysis systems on a movie review dataset from Rotten Tomatoes. This data was first used in Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales., Proceedings of the ACL, 2005. The data fields are the same among all splits
We will be using three different libraries for this post: 1) Scikit-learn, 2) NLTK, and 3) TextBlob. We will also compare the results of these systems with those from human raters. Each library takes different amounts of time and resources to run, so we will also be considering these factors in our comparisons.
NLTK
NLTK is a popular library for working with text data in Python. It includes many useful features for pre-processing text data, including tokenization, lemmatization, and part-of-speech tagging. NLTK also includes a number of helpful classes for building and evaluating predictive models (such as decision trees and maximum entropy classifiers).
TextBlob
TextBlob is a relatively new library that attempts to provide an easy-to-use interface for common text processing tasks (such as part-of-speech tagging, sentence parsing, spelling correction, etc). TextBlob is built on top of NLTK and Pattern, another Python library for web mining (see below).
Scikit-learn
Scikit-learn is a popular machine learning library for Python that provides efficient implementations of common algorithms such as support vector machines, random forests, and k-nearest neighbors classifiers. It also includes helpful utilities for pre-processing data and assessing model performance
- Identify positive and negative sentiment in movie reviews
- Categorize movie reviews by rating
- Cluster movie reviews to group together similar reviews
Huggingface Hub: link
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |
File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |
File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Multilingual Sentiments Dataset
A collection of multilingual sentiments datasets grouped into 3 classes -- positive, neutral, negative. Most multilingual sentiment datasets are either 2-class positive or negative, 5-class ratings of products reviews (e.g. Amazon multilingual dataset) or multiple classes of emotions. However, to an average person, sometimes positive, negative and neutral classes suffice and are more straightforward to perceive and annotate. Also, a positive/negative… See the full description on the dataset page: https://huggingface.co/datasets/tyqiangz/multilingual-sentiments.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Tweet Sentiment Multilingual consists of sentiment analysis dataset on Twitter in 8 different lagnuages.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "imdb"
Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
Facebook
Twitter
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
sweatSmile/news-sentiment-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Three news sources have been used in creating this dataset. 1. Sun, J. (2016, August). Daily News for Stock Market Prediction, Version 1. Retrieved (2024, August) from https://www.kaggle.com/aaron7sun/stocknews. 2. ARYAN SINGH. NYT Articles: 2.1M+ (2000-Present) Daily Updated. https://www.kaggle.com/datasets/aryansingh0909/nyt-articles-21m-2000-present. 3. GABRIEL PREDA. BBC News. https://www.kaggle.com/datasets/gpreda/bbc-news.
The first source covers from 2008-06-08 to 2016-07-01, the top 25 news of each day from Reddit World News. The second source is a direct import of the abstract column from New York Times articles from 2016-07-01 to 2017-09-05. The third is also a direct import of the description column from BBC News from 2017-09-05 to 2024-08-03. Thus, the whole coverage is from 2008-06-08 to 2024-08-03.
Three models have been used for sentiment results. NLTK VADER is applied first as it is the most lightweight and fastest to apply on large amounts of data. But, as news is mostly neural, NLTK vader gave a 1.0 neutral score for around 25% of the data. Therefore, two more advanced models, NLTK RoBERTa and HUGGING FACE distilbert-base-uncased-finetuned-sst-2-english, are applied to these neutral articles to identify them accurately.
Part of my school project for Nanyang Polytechnic | AI & Data Engineering
Facebook
Twitterskibastepan/sentiment-analysis-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterchillies/course-review-multilabel-sentiment-analysis dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By daily_dialog (From Huggingface) [source]
The DailyDialog dataset is a meticulously curated collection of multi-turn dialogues that aims to accurately represent the way we communicate in our daily lives. It covers a wide range of topics that are relevant to our everyday experiences. What sets this dataset apart is that it includes human-written conversations, which means the language used is more natural and realistic, resulting in less noise and higher quality data.
Each dialogue in the dataset consists of two or more participants engaging in a conversation. The conversations are provided in textual form, allowing for easy analysis and processing. Alongside the dialogues, there are also corresponding labels for communication intention and emotion attached to each utterance.
The communication intention labels categorize each utterance based on its intended purpose or goal within the conversation. These categories provide valuable insights into how different participants express their intentions through speech.
In addition to the communication intention labels, there are also emotion labels assigned to each utterance in the dialogues. These emotion labels capture the emotional state or sentiment expressed by participants during various points in the conversation.
To facilitate model evaluation and testing, DailyDialog provides three separate files: validation.csv, train.csv, and test.csv. The validation set (validation.csv) contains dialogues with their respective communication intention and emotion labels for assessing model performance during development stages. The train set (train.csv) includes dialogues paired with corresponding communication intention and emotion labels for training purposes. Lastly, test.csv serves as an independent test set that enables evaluating models' proficiency by providing unseen dialogues along with their associated communication intention and emotion labels.
Overall, DailyDialog stands out as a high-quality dataset due to its accurate representation of daily life conversations paired with comprehensive labeling of both communication intentions and emotions expressed throughout these dialogues. This makes it an invaluable resource for developing robust dialogue systems capable of understanding human interactions on a deeper level while being able to identify diverse intentions behind speech acts alongside various emotional states encountered during daily life exchanges
Welcome to the DailyDialog dataset! This high-quality multi-turn dialog dataset has been curated to reflect our daily communication style and covers a wide range of topics related to our everyday lives. The dataset consists of human-written conversations, making it less noisy and more realistic. Each conversation in the dataset has been manually labeled with communication intention and emotion information, providing valuable insights into the dialogues.
To make the most of this dataset, here is a step-by-step guide on how you can use it effectively:
Understanding the columns:
- dialog: This column contains the actual conversation between two or more participants. It is in text format.
- act: The act column represents the communication intention labels for each utterance in the dialogue. These labels categorize each utterance based on its intention.
- emotion: The emotion column contains emotion labels for each utterance in the dialogue. These labels represent the emotions expressed during that particular utterance.
Familiarize yourself with validation.csv:
- The validation.csv file serves as a validation set for evaluating your model's performance. It contains pre-labeled conversations along with their corresponding communication intentions and emotion labels.
Explore train.csv for training purposes:
- The train.csv file is meant for training purposes and provides conversations along with their communication intentions and emotion labels.
Test your model using test.csv:
- Test.csv file has conversation along ithentensions or emotional label which can be addressed once program is recreated.
Finally, remember that this DailyDialog dataset offers an excellent opportunity to develop models capable of understanding multi-turn dialogues in a wide range of everyday scenarios. By utilizing both communication intention and emotion information provided, you can gain valuable insights into analyzing human conversations.
So dive into this rich resource, experiment with different techn...
Facebook
TwitterorYx-models/Leadership-sentiment-analysis dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for wisesight_sentiment
Dataset Summary
Wisesight Sentiment Corpus: Social media messages in Thai language with sentiment label (positive, neutral, negative, question)
Released to public domain under Creative Commons Zero v1.0 Universal license. Labels: {"pos": 0, "neu": 1, "neg": 2, "q": 3} Size: 26,737 messages Language: Central Thai Style: Informal and conversational. With some news headlines and advertisement. Time period: Around 2016 to early 2019. With… See the full description on the dataset page: https://huggingface.co/datasets/pythainlp/wisesight_sentiment.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset
This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.