49 datasets found

BBC datasets for sentiment analysis
kaggle.com
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Turner (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alan Turner
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Name: BBC Articles Sentiment Analysis Dataset

Source: BBC News

Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.
Forex News Annotated Dataset for Sentiment Analysis
zenodo.org
paperswithcode.com
+1more
csv
Updated Nov 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali (2023). Forex News Annotated Dataset for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.7976208
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7976208
Dataset updated
Nov 11, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.

We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.

Examples of Annotated Headlines Forex Pair Headline Sentiment Explanation GBPUSD Diminishing bets for a move to 12400 Neutral Lack of strong sentiment in either direction GBPUSD No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft Positive Positive sentiment towards GBPUSD (Cable) in the near term GBPUSD When are the UK jobs and how could they affect GBPUSD Neutral Poses a question and does not express a clear sentiment JPYUSD Appropriate to continue monetary easing to achieve 2% inflation target with wage growth Positive Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply USDJPY Dollar rebounds despite US data. Yen gains amid lower yields Neutral Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other USDJPY USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains Negative USDJPY is expected to reach a lower value, with the USD losing value against the JPY AUDUSD <p>RBA Governor Lowe’s Testimony High inflation is damaging and corrosive </p> Positive Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD.

Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.
h
my_dataset
huggingface.co
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neil Rainsforth (2025). my_dataset [Dataset]. https://huggingface.co/datasets/wkdnev/my_dataset
Explore at:
Dataset updated
Mar 23, 2025
Authors
Neil Rainsforth
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Test Sentiment Dataset

A small sample dataset for text classification tasks, specifically binary sentiment analysis (positive or negative). Useful for testing, demos, or building and validating pipelines with Hugging Face Datasets.

Dataset Summary

This dataset contains short text samples labeled as either positive or negative. It is intended for testing purposes and includes:

10 training samples 4 test samples

Each example includes:

text: A short sentence or review… See the full description on the dataset page: https://huggingface.co/datasets/wkdnev/my_dataset.
Sentiment Analytics Software Market Analysis North America, Europe, APAC,...
technavio.com
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Sentiment Analytics Software Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Germany, China, UK, India, Canada, France, Japan, Brazil, South Korea - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/sentiment-analytics-software-market-industry-analysis
Explore at:
Dataset updated
Dec 23, 2024
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United Kingdom, United States, Germany, Global
Description
Snapshot img

What is the Sentiment Analytics Software Market Size?

The sentiment analytics software market size is forecast to increase by USD 2.34 billion, at a CAGR of 16.6% between 2024 and 2029. The market is experiencing significant growth due to the increasing use of social media and the rising internet penetration in North America. Businesses are leveraging sentiment analysis to gain insights into customer opinions and feedback. A key trend in the market is the integration of generative AI to improve the accuracy and context-dependence of sentiment analysis. However, challenges such as context-dependent errors and the need for large amounts of data to train AI models persist. To stay competitive, market participants must focus on addressing these challenges and continuously improving the accuracy and reliability of their sentiment analysis solutions. This market analysis report provides an in-depth examination of the growth drivers, trends, and challenges shaping the sentiment analytics software market.

What will be the size of Market during the forecast period?

Request Free Sentiment Analytics Software Market Sample

Market Segmentation

The market report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019 - 2023 for the following segments.

Deployment On-premises Cloud-based End-user Retail BFSI Healthcare Others Geography North America US Europe Germany UK APAC China India South America Middle East and Africa

Which is the largest segment driving market growth?

The on-premises segment is estimated to witness significant growth during the forecast period. In the realm of data analysis, sentiment analytics software plays a pivotal role in understanding public perception toward brands, services, and entities. For organizations in the healthcare sector, reputation management is of utmost importance. Sentiment analytics software deployed on-premises offers several benefits. With on-premises deployment, organizations retain complete control over their data, ensuring privacy and compliance with healthcare regulations. This setup allows for customization to meet specific business needs and seamless integration with existing systems.

Get a glance at the market share of various regions. Download the PDF Sample

The on-premises segment was valued at USD 788.40 million in 2019. Furthermore, the use of dedicated infrastructure results in superior performance and faster processing times. Government institutions, media, telecom, and other industries also reap the benefits of on-premises sentiment analytics software. Data from surveys, social media, and other sources undergoes text analysis to uncover valuable insights. By staying informed of public sentiment, organizations can make data-driven decisions, respond to crises, and improve their offerings. Sentiment analysis is not limited to text data from surveys and social media. Media mentions and customer interactions through phone and email are also valuable sources of data. By harnessing the power of on-premises sentiment analytics software, organizations can gain a competitive edge and maintain a strong reputation.

Which region is leading the market?

For more insights on the market share of various regions, Request Free Sample

North America is estimated to contribute 38% to the growth of the global market during the forecast period. Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period. In North America, sentiment analytics software has gained significant traction due to the region's high internet penetration and prioritization of enhancing customer experiences. By 2024, internet usage in North America reached nearly 97%, creating a solid base for the implementation of sentiment analysis tools. Companies in the US and Canada are investing heavily in advanced technologies to personalize customer interactions and improve overall satisfaction.

Further, Natural Language Processing (NLP) plays a crucial role in sentiment analysis, enabling businesses to understand and respond effectively to customer opinions. By staying attuned to customer sentiments, North American businesses can foster brand reputation, enhance customer satisfaction, and make data-driven decisions.

How do company ranking index and market positioning come to your aid?

Companies are implementing various strategies, such as strategic alliances, partnerships, mergers and acquisitions, geographical expansion, and product/service launches, to enhance their presence in the market.

Alphabet Inc.: The company offers sentiment analytics software that supports multiple languages and can be integrated into various applications for real-time analysis.
E
News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian...
live.european-language-grid.eu
clarin.si
binary format
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian and Estonian SADEmma 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23729
Explore at:
binary formatAvailable download formats
Dataset updated
Nov 12, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
We provide annotated datasets on a three-point sentiment scale (positive, neutral and negative) for Serbian, Bosnian, Macedonian, Albanian, and Estonian. For all languages except Estonian, we include pairs of source URL (where corresponding text can be found) and sentiment label.

For Estonian, we randomly sampled 100 articles from "Ekspress news article archive (in Estonian and Russian) 1.0" (http://hdl.handle.net/11356/1408).

The data is organized in Tab-Separated Values (TSV) format. For Serbian, Bosnian, Macedonian, and Albanian, the dataset contains two columns: sourceURL and sentiment. For Estonian, the dataset consists of three columns: text ID (from the CLARIN.SI reference above), body text, and sentiment label.
Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News...
zenodo.org
data.niaid.nih.gov
bin
Updated Sep 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymized; Anonymized (2022). Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News Media Headlines Using Automated Labelling with Transformer Language Models" [Dataset]. http://doi.org/10.5281/zenodo.5144113
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5144113
Dataset updated
Sep 13, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymized; Anonymized
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains automated sentiment and emotionality annotations of 23 million headlines from 47 popular news media outlets popular in the United States.

The set of 47 news media outlets analysed (listed in Figure 1 of the main manuscript) was derived from the AllSides organization 2019 Media Bias Chart v1.1. The human ratings of outlets’ ideological leanings were also taken from this chart and are listed in Figure 2 of the main manuscript.

News articles headlines from the set of outlets analyzed in the manuscript are available in the outlets’ online domains and/or public cache repositories such as The Internet Wayback Machine, Google cache and Common Crawl. Articles headlines were located in articles’ HTML raw data using outlet-specific XPath expressions.

The temporal coverage of headlines across news outlets is not uniform. For some media organizations, news articles availability in online domains or Internet cache repositories becomes sparse for earlier years. Furthermore, some news outlets popular in 2019, such as The Huffington Post or Breitbart, did not exist in the early 2000’s. Hence, our data set is sparser in headlines sample size and representativeness for earlier years in the 2000-2019 timeline. Nevertheless, 20 outlets in our data set have chronologically continuous partial or full headline data availability since the year 2000. Figure S 1 in the SI reports the number of headlines per outlet and per year in our analysis.

In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the headline due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. After manual testing, we determined that the percentage of headlines following in this category is very small. Additionally, our method might miss detecting some articles in the online domains of news outlets. To conclude, in a data analysis of over 23 million headlines, we cannot manually check the correctness of every single data instance and hundred percent accuracy at capturing headlines’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our headlines set is representative of headlines in print news media content for the studied time period and outlets analyzed.

The list of compressed files in this data set is listed next:

-analysisScripts.rar contains the analysis scripts used in the main manuscript as well as aggregated data of sentiment and emotionality automated annotations of the headlines and human annotations of a subset of headlines sentiment and emotionality used as ground truth.

-models.rar contains the Transformer sentiment and emotion annotation models used in the analysis. Namely:

Siebert/sentiment-roberta-large-english from https://huggingface.co/siebert/sentiment-roberta-large-english. This model is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). See more information from the original authors at https://huggingface.co/siebert/sentiment-roberta-large-english

DistilbertSST2.rar is the default sentiment classification model of the HuggingFace Transformer library https://huggingface.co/ This model is only used to replicate the results of the sentiment analysis with sentiment-roberta-large-english

DistilRoberta j-hartmann/emotion-english-distilroberta-base from https://huggingface.co/j-hartmann/emotion-english-distilroberta-base. The model is a fine-tuned checkpoint of DistilRoBERTa-base. The model allows annotation of English text with Ekman's 6 basic emotions, plus a neutral class. The model was trained on 6 diverse datasets. Please refer to the original author at https://huggingface.co/j-hartmann/emotion-english-distilroberta-base for an overview of the data sets used for fine tuning. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

-headlinesDataWithSentimentLabelsAnnotationsFromSentimentRobertaLargeModel.rar URLs of headlines analyzed and the sentiment annotations of the siebert/sentiment-roberta-large-english Transformer model. https://huggingface.co/siebert/sentiment-roberta-large-english

-headlinesDataWithSentimentLabelsAnnotationsFromDistilbertSST2.rar URLs of headlines analyzed and the sentiment annotations of the default HuggingFace sentiment analysis model fine-tuned on the SST-2 dataset. https://huggingface.co/

-headlinesDataWithEmotionLabelsAnnotationsFromDistilRoberta.rar URLs of headlines analyzed and the emotion categories annotations of the j-hartmann/emotion-english-distilroberta-base Transformer model. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
P
SB10k Dataset
paperswithcode.com
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SB10k Dataset [Dataset]. https://paperswithcode.com/dataset/sb10k
Explore at:
Dataset updated
Apr 7, 2024
Authors
Mark Cieliebak; Jan Milan Deriu; Dominic Egger; Fatih Uzdilli
Description
The SB10k dataset is a valuable resource for sentiment analysis in German. Here are the key details:

Corpus Size: It contains approximately 10,000 German tweets¹. Language: German. Task: Text classification, specifically sentiment analysis. Multilinguality: Monolingual (German only). Size Category: Falls within the range of 1K to 10K examples. Tags: Sentiment analysis. License: CC-BY-4.0.

The dataset was created by annotating German tweets, with each tweet labeled by three annotators. Researchers have used SB10k to benchmark various machine learning classifiers, including convolutional neural networks (CNNs) and feature-based support vector machines (SVMs) for sentiment analysis²³.

(1) Alienmaster/SB10k · Datasets at Hugging Face. https://huggingface.co/datasets/Alienmaster/SB10k. (2) A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. https://aclanthology.org/W17-1106/. (3) A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. https://aclanthology.org/W17-1106.pdf. (4) undefined. http://t.co/9rhta65MSx. (5) undefined. http://t.co/G84qcIGk7k. (6) undefined. http://t.co/LvwyZgew4Q.
E
Data from: Facebook Data for Sentiment Analysis
live.european-language-grid.eu
lindat.mff.cuni.cz
+1more
binary format
Updated Jul 16, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2013). Facebook Data for Sentiment Analysis [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1057
Explore at:
binary formatAvailable download formats
Dataset updated
Jul 16, 2013
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
E
Broad-Coverage German Sentiment Classification Model and Dataset for Dialog...
live.european-language-grid.eu
data.niaid.nih.gov
+1more
Updated Oct 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Broad-Coverage German Sentiment Classification Model and Dataset for Dialog Systems [Dataset]. https://live.european-language-grid.eu/catalogue/ld/7957
Explore at:
Dataset updated
Oct 10, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems.
This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.
You can find the code for training testing the models, that was published along with the paper in this repository.
The germansentiment Python package contains a easy to use interface for the model that was published with this paper.
Sentiment analysis of OCR text!!
kaggle.com
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Somnath chatterjee (2020). Sentiment analysis of OCR text!! [Dataset]. https://www.kaggle.com/somnath796/sentiment-analysis-of-ocr-text/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Somnath chatterjee
Description
You work as a social media moderator for your firm. Your key responsibility is to tag uploaded content (images) during Pride Month based on its sentiment (positive, negative, or random) and categorize them for internal reference and SEO optimization.

*****Task***** Your task is to build an engine that combines the concepts of OCR and NLP that accepts a .jpg file as input, extracts the text, if any, and classifies sentiment as positive or negative. If the text sentiment is neutral or an image file does not have any text, then it is classified as random.

*****Data***** You must use an external dataset to train your model. The attached dataset link contains the sample data of each category [Positive | Negative | Random] and test data.
d
Replication Data for: Sentiment is Not Stance: Target-Aware Opinion...
dataone.org
dataverse.harvard.edu
Updated Nov 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bestvater, Samuel; Monroe, Burt (2023). Replication Data for: Sentiment is Not Stance: Target-Aware Opinion Classification for Political Text Analysis [Dataset]. http://doi.org/10.7910/DVN/MUYYG4
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/MUYYG4
Dataset updated
Nov 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Bestvater, Samuel; Monroe, Burt
Description
Sentiment analysis techniques have a long history in natural language processing and have become a standard tool in the analysis of political texts, promising a conceptually straightforward automated method of extracting meaning from textual data by scoring documents on a scale from positive to negative. However, while these kinds of sentiment scores can capture the overall tone of a document, the underlying concept of interest for political analysis is often actually the document's stance with respect to a given target--how positively or negatively it frames a specific idea, individual, or group--as this reflects the author's underlying political attitudes. In this paper we question the validity of approximating author stance through sentiment scoring in the analysis of political texts, and advocate for greater attention to be paid to the conceptual distinction between a document's sentiment and its stance. Using examples from open-ended survey responses and from political discussions on social media, we demonstrate that in many political text analysis applications, sentiment and stance do not necessarily align, and therefore sentiment analysis methods fail to reliably capture ground-truth document stance, amplifying noise in the data and leading to faulty conclusions.
SAT Questions and Answers for LLM 🏛️
kaggle.com
Updated Oct 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). SAT Questions and Answers for LLM 🏛️ [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/sat-history-questions-and-answers/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
SAT History Questions and Answers 🏛️ - Text Classification Dataset

This dataset contains a collection of questions and answers for the SAT Subject Test in World History and US History. Each question is accompanied by a corresponding answers and the correct response.

The dataset includes questions from various topics, time periods, and regions on both World History and US History.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

OTHER DATASETS FOR THE TEXT ANALYSIS:

Google Play Messengers - 6,000 Reviews ⭐️

20,000 Customers Reviews on Banks ⭐️

Amazon Reviews Dataset

Content

For each question, we extracted: - id: number of the question, - subject: SAT subject (World History or US History), - prompt: text of the question, - A: answer A, - B: answer B, - C: answer C, - D: answer D, - E: answer E, - answer: letter of the correct answer to the question

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: answer questions, sat, gpa, university, school, exam, college, web scraping, parsing, online database, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data, machine learning
Sentiment Analysis outputs based on the combination of three classifiers for...
zenodo.org
data.niaid.nih.gov
bin
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caio Mello; Caio Mello; Gullal S. Cheema; Gullal S. Cheema (2022). Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text [Dataset]. http://doi.org/10.5281/zenodo.6326348
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6326348
Dataset updated
Mar 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Caio Mello; Caio Mello; Gullal S. Cheema; Gullal S. Cheema
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text covering the Olympic legacy of Rio 2016 and London 2012. Data was searched via Google search engine. It is composed of sentiment labels assigned to 1271 news articles in total.

News outlets:

BBC

Daily Mail

The Telegraph

The Guardian

Globo

Estadao

Folha de S. Paulo

Events covered by the articles:

London 2012 Olympic legacy

Rio 2016 Olympic legacy

All classifiers were used in texts in English. Text originally published in Portuguese by the Brazilian media were automatically translated.

Sentiment classifiers used:

Vader

BERT (Trained on Amazon data)

BERT (Trained on twitter data - 140)

Each document (spreadsheet - xlsx) refers to one outlet and one event (London 2012 or Rio 2016).

How were labels assigned to the texts?

These labels are a combination of the three sentiment classifiers listed above. If two of them agree with the same label, then this label would be considered as right. Otherwise, the label ‘other’ was assigned.

For news article body text: the proportion of sentences of each sentiment type was used to assign labels to the whole article instead of averaging the sentence scores. For example, if the proportion of sentences with negative labels is greater than 50%, then the article is assigned a negative label.

The documents are composed of the following columns:

Rank: the position of the article on Google search ranking

Date: date of article's publication (DD/MM/YYYY)

Link: article's link

Title: article's title

Sentiment_Title: final sentiment for article headline

Sentiment_Text: final sentiment for article's body text

PS: Documents do not include articles' body text.

Sentiment is presented in labels as follows:

Pos: Positive

Neg: Negative

Neutral: Neutral

other: inconclusive - if each of the 3 classifiers assigned a different label to the article, the label 'other' was used. Therefore, 'other' identifies contradictory results.
Sample of Malay sentiment lexicon.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Al-Saffar; Suryanti Awang; Hai Tao; Nazlia Omar; Wafaa Al-Saiagh; Mohammed Al-bared (2023). Sample of Malay sentiment lexicon. [Dataset]. http://doi.org/10.1371/journal.pone.0194852.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0194852.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ahmed Al-Saffar; Suryanti Awang; Hai Tao; Nazlia Omar; Wafaa Al-Saiagh; Mohammed Al-bared
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample of Malay sentiment lexicon.
f
Sample of the Malay stop words.
plos.figshare.com
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Al-Saffar; Suryanti Awang; Hai Tao; Nazlia Omar; Wafaa Al-Saiagh; Mohammed Al-bared (2023). Sample of the Malay stop words. [Dataset]. http://doi.org/10.1371/journal.pone.0194852.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0194852.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Ahmed Al-Saffar; Suryanti Awang; Hai Tao; Nazlia Omar; Wafaa Al-Saiagh; Mohammed Al-bared
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample of the Malay stop words.
Product sentiment analysis
kaggle.com
zip
Updated Sep 4, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ask9 (2020). Product sentiment analysis [Dataset]. https://www.kaggle.com/arbazkhan971/product-sentiment-analysis
Explore at:
zip(406932 bytes)Available download formats
Dataset updated
Sep 4, 2020
Authors
ask9
Description
**Overview Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this weekend hackathon, we challenge the machinehackers community to develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help us serve the customers better but can also reveal lot of customer traits present/hidden in the reviews.

The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.

In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.

Dataset Description:

Train.csv - 6364 rows x 4 columns (Inlcudes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

Attribute Description:

Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment Skills:

NLP, Sentiment Analysis Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing multi-class log loss to generalize well on unseen data
T
Pierce County Sentiment
open.piercecountywa.gov
internal.open.piercecountywa.gov
application/rdfxml +5
Updated Feb 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle (2020). Pierce County Sentiment [Dataset]. https://open.piercecountywa.gov/dataset/Pierce-County-Sentiment/nqup-7zcn
Explore at:
tsv, csv, application/rssxml, json, xml, application/rdfxmlAvailable download formats
Dataset updated
Feb 13, 2020
Dataset authored and provided by
Kyle
Area covered
Pierce County
Description
The fundamental task in brand sentiment analysis is text classification – classifying the separation of a given text or whether the expressed opinion in a document is positive, negative, or neutral. Around 800 documents pass through our platform per second from different media sources and providers. We use Natural Language Processing (NLP) to judge which group (positive, negative or neutral) the content belongs to.

Meltwater’s Natural Language Processing model is supported by AI and machine learning algorithms. Using this model, we take individual words into account. Each document, for example, a tweet, is analysed based on the words it contains. Then we map the words to a set of predefined data to see the number of occurrences where they match up.
f
Data_Sheet_4_Topic evolution and sentiment comparison of user reviews on an...
frontiersin.figshare.com
docx
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chaoyang Li; Shengyu Li; Jianfeng Yang; Jingmei Wang; Yiqing Lv (2023). Data_Sheet_4_Topic evolution and sentiment comparison of user reviews on an online medical platform in response to COVID-19: taking review data of Haodf.com as an example.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2023.1088119.s004
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2023.1088119.s004
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Chaoyang Li; Shengyu Li; Jianfeng Yang; Jingmei Wang; Yiqing Lv
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThroughout the COVID-19 pandemic, many patients have sought medical advice on online medical platforms. Review data have become an essential reference point for supporting users in selecting doctors. As the research object, this study considered Haodf.com, a well-known e-consultation website in China.MethodsThis study examines the topics and sentimental change rules of user review texts from a temporal perspective. We also compared the topics and sentimental change characteristics of user review texts before and after the COVID-19 pandemic. First, 323,519 review data points about 2,122 doctors on Haodf.com were crawled using Python from 2017 to 2022. Subsequently, we employed the latent Dirichlet allocation method to cluster topics and the ROST content mining software to analyze user sentiments. Second, according to the results of the perplexity calculation, we divided text data into five topics: diagnosis and treatment attitude, medical skills and ethics, treatment effect, treatment scheme, and treatment process. Finally, we identified the most important topics and their trends over time.ResultsUsers primarily focused on diagnosis and treatment attitude, with medical skills and ethics being the second-most important topic among users. As time progressed, the attention paid by users to diagnosis and treatment attitude increased—especially during the COVID-19 outbreak in 2020, when attention to diagnosis and treatment attitude increased significantly. User attention to the topic of medical skills and ethics began to decline during the COVID-19 outbreak, while attention to treatment effect and scheme generally showed a downward trend from 2017 to 2022. User attention to the treatment process exhibited a declining tendency before the COVID-19 outbreak, but increased after. Regarding sentiment analysis, most users exhibited a high degree of satisfaction for online medical services. However, positive user sentiments showed a downward trend over time, especially after the COVID-19 outbreak.DiscussionThis study has reference value for assisting user choice regarding medical treatment, decision-making by doctors, and online medical platform design.
P
ACOS Dataset
paperswithcode.com
Updated Dec 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ACOS Dataset [Dataset]. https://paperswithcode.com/dataset/acos
Explore at:
Dataset updated
Dec 4, 2020
Authors
Hongjie Cai; Yaofeng Tu; Xiangsheng Zhou; Jianfei Yu; Rui Xia
Description
Most of the aspect based sentiment analysis research aims at identifying the sentiment polarities toward some explicit aspect terms while ignores implicit aspects in text. To capture both explicit and implicit aspects, we focus on aspect-category based sentiment analysis, which involves joint aspect category detection and category-oriented sentiment classification. However, currently only a few simple studies have focused on this problem. The shortcomings in the way they defined the task make their approaches difficult to effectively learn the inner-relations between categories and the inter-relations between categories and sentiments. In this work, we re-formalize the task as a category-sentiment hierarchy prediction problem, which contains a hierarchy output structure to first identify multiple aspect categories in a piece of text, and then predict the sentiment for each of the identified categories. Specifically, we propose a Hierarchical Graph Convolutional Network (Hier-GCN), where a lower-level GCN is to model the inner-relations among multiple categories, and the higher-level GCN is to capture the inter-relations between aspect categories and sentiments. Extensive evaluations demonstrate that our hierarchy output structure is superior over existing ones, and the Hier-GCN model can consistently achieve the best results on four benchmarks.
Z
MuSe-Sent: Multimodal Sentiment Classification in-the-Wild (MuSe2021)
data.niaid.nih.gov
zenodo.org
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schuller, Björn (2022). MuSe-Sent: Multimodal Sentiment Classification in-the-Wild (MuSe2021) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4654370
Explore at:
Dataset updated
Apr 8, 2022
Dataset provided by
Stappen, Lukas
Baird, Alice
Schuller, Björn
Description
MuSe-Sent of the 2nd Multimodal Sentiment in-the-Wild Challenge! Predicting five advanced intensity classes for each of the emotional dimensions (valence, arousal) for segments of audio-video-text data. This package includes only MuSe-Sent features (all partitions) and labels of the training and development set (test scoring via the MuSe website). More: https://www.muse-challenge.org/muse2021

General: The purpose of the Multimodal Sentiment Analysis in Real-life media Challenge and Workshop (MuSe) is to bring together communities from different disciplines. We introduce the novel dataset MuSe-CAR that covers the range of aforementioned desiderata. MuSe-CAR is a large (>36h), multimodal dataset which has been gathered in-the-wild with the intention of further understanding Multimodal Sentiment Analysis in-the-wild, e.g., the emotional engagement that takes place during product reviews (i.e., automobile reviews) where a sentiment is linked to a topic or entity.

We have designed MuSe-CAR to be of high voice and video quality, as informative video social media content, as well as everyday recording devices have improved in recent years. This enables robust learning, even with a high degree of novel, in-the-wild characteristics, for example as related to: i) Video: Shot size (a mix of close-up, medium, and long shots), face-angle (side, eye, low, high), camera motion (free, free but stable, and free but unstable, switch, e.g., zoom, fixed), reviewer visibility (full body, half-body, face only, and hands only), highly varying backgrounds, and people interacting with objects (car parts). ii) Audio: Ambient noises (car noises, music), narrator and host diarisation, diverse microphone types, and speaker locations. iii) Text: Colloquialisms, and domain-specific terms.

Facebook

Twitter

Click to copy link

Link copied

Cite

Alan Turner (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2/suggestions

BBC datasets for sentiment analysis

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 15, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Alan Turner

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Name: BBC Articles Sentiment Analysis Dataset

Source: BBC News

Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.

Clear search

Close search

Google apps

Main menu

BBC datasets for sentiment analysis

Forex News Annotated Dataset for Sentiment Analysis

my_dataset

Sentiment Analytics Software Market Analysis North America, Europe, APAC,...

Snapshot img

News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian...

Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News...

SB10k Dataset

Data from: Facebook Data for Sentiment Analysis

Broad-Coverage German Sentiment Classification Model and Dataset for Dialog...

Sentiment analysis of OCR text!!

Replication Data for: Sentiment is Not Stance: Target-Aware Opinion...

SAT Questions and Answers for LLM 🏛️

SAT History Questions and Answers 🏛️ - Text Classification Dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

OTHER DATASETS FOR THE TEXT ANALYSIS:

Content

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

Sentiment Analysis outputs based on the combination of three classifiers for...

Sample of Malay sentiment lexicon.

Sample of the Malay stop words.

Product sentiment analysis

Pierce County Sentiment

Data_Sheet_4_Topic evolution and sentiment comparison of user reviews on an...

ACOS Dataset

MuSe-Sent: Multimodal Sentiment Classification in-the-Wild (MuSe2021)

BBC datasets for sentiment analysis

BBC datasets for sentiment analysis