49 datasets found
  1. BBC datasets for sentiment analysis

    • kaggle.com
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Turner (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alan Turner
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Name: BBC Articles Sentiment Analysis Dataset

    Source: BBC News

    Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

    Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

    Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

    Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

    Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

    Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

    Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.

  2. Forex News Annotated Dataset for Sentiment Analysis

    • zenodo.org
    • paperswithcode.com
    • +1more
    csv
    Updated Nov 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali (2023). Forex News Annotated Dataset for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.7976208
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 11, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Georgios Fatouros; Georgios Fatouros; Kalliopi Kouroumali; Kalliopi Kouroumali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

    To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.

    We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.

    Examples of Annotated Headlines
    
    
        Forex Pair
        Headline
        Sentiment
        Explanation
    
    
    
    
        GBPUSD 
        Diminishing bets for a move to 12400 
        Neutral
        Lack of strong sentiment in either direction
    
    
        GBPUSD 
        No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft 
        Positive
        Positive sentiment towards GBPUSD (Cable) in the near term
    
    
        GBPUSD 
        When are the UK jobs and how could they affect GBPUSD 
        Neutral
        Poses a question and does not express a clear sentiment
    
    
        JPYUSD
        Appropriate to continue monetary easing to achieve 2% inflation target with wage growth 
        Positive
        Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply
    
    
        USDJPY
        Dollar rebounds despite US data. Yen gains amid lower yields 
        Neutral
        Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other
    
    
        USDJPY
        USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains 
        Negative
        USDJPY is expected to reach a lower value, with the USD losing value against the JPY
    
    
        AUDUSD
    
        <p>RBA Governor Lowe’s Testimony High inflation is damaging and corrosive </p>
    
        Positive
        Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD.
    

    Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.

  3. h

    my_dataset

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neil Rainsforth (2025). my_dataset [Dataset]. https://huggingface.co/datasets/wkdnev/my_dataset
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    Neil Rainsforth
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Test Sentiment Dataset

    A small sample dataset for text classification tasks, specifically binary sentiment analysis (positive or negative). Useful for testing, demos, or building and validating pipelines with Hugging Face Datasets.

      Dataset Summary
    

    This dataset contains short text samples labeled as either positive or negative. It is intended for testing purposes and includes:

    10 training samples 4 test samples

    Each example includes:

    text: A short sentence or review… See the full description on the dataset page: https://huggingface.co/datasets/wkdnev/my_dataset.

  4. Sentiment Analytics Software Market Analysis North America, Europe, APAC,...

    • technavio.com
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2024). Sentiment Analytics Software Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Germany, China, UK, India, Canada, France, Japan, Brazil, South Korea - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/sentiment-analytics-software-market-industry-analysis
    Explore at:
    Dataset updated
    Dec 23, 2024
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    United Kingdom, United States, Germany, Global
    Description

    Snapshot img

    What is the Sentiment Analytics Software Market Size?

    The sentiment analytics software market size is forecast to increase by USD 2.34 billion, at a CAGR of 16.6% between 2024 and 2029. The market is experiencing significant growth due to the increasing use of social media and the rising internet penetration in North America. Businesses are leveraging sentiment analysis to gain insights into customer opinions and feedback. A key trend in the market is the integration of generative AI to improve the accuracy and context-dependence of sentiment analysis. However, challenges such as context-dependent errors and the need for large amounts of data to train AI models persist. To stay competitive, market participants must focus on addressing these challenges and continuously improving the accuracy and reliability of their sentiment analysis solutions. This market analysis report provides an in-depth examination of the growth drivers, trends, and challenges shaping the sentiment analytics software market.

    What will be the size of Market during the forecast period?

    Request Free Sentiment Analytics Software Market Sample

    Market Segmentation

    The market report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019 - 2023 for the following segments.

    Deployment
    
      On-premises
      Cloud-based
    
    
    End-user
    
      Retail
      BFSI
      Healthcare
      Others
    
    
    Geography
    
      North America
    
        US
    
    
      Europe
    
        Germany
        UK
    
    
      APAC
    
        China
        India
    
    
      South America
    
    
    
      Middle East and Africa
    

    Which is the largest segment driving market growth?

    The on-premises segment is estimated to witness significant growth during the forecast period. In the realm of data analysis, sentiment analytics software plays a pivotal role in understanding public perception toward brands, services, and entities. For organizations in the healthcare sector, reputation management is of utmost importance. Sentiment analytics software deployed on-premises offers several benefits. With on-premises deployment, organizations retain complete control over their data, ensuring privacy and compliance with healthcare regulations. This setup allows for customization to meet specific business needs and seamless integration with existing systems.

    Get a glance at the market share of various regions. Download the PDF Sample

    The on-premises segment was valued at USD 788.40 million in 2019. Furthermore, the use of dedicated infrastructure results in superior performance and faster processing times. Government institutions, media, telecom, and other industries also reap the benefits of on-premises sentiment analytics software. Data from surveys, social media, and other sources undergoes text analysis to uncover valuable insights. By staying informed of public sentiment, organizations can make data-driven decisions, respond to crises, and improve their offerings. Sentiment analysis is not limited to text data from surveys and social media. Media mentions and customer interactions through phone and email are also valuable sources of data. By harnessing the power of on-premises sentiment analytics software, organizations can gain a competitive edge and maintain a strong reputation.

    Which region is leading the market?

    For more insights on the market share of various regions, Request Free Sample

    North America is estimated to contribute 38% to the growth of the global market during the forecast period. Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period. In North America, sentiment analytics software has gained significant traction due to the region's high internet penetration and prioritization of enhancing customer experiences. By 2024, internet usage in North America reached nearly 97%, creating a solid base for the implementation of sentiment analysis tools. Companies in the US and Canada are investing heavily in advanced technologies to personalize customer interactions and improve overall satisfaction.

    Further, Natural Language Processing (NLP) plays a crucial role in sentiment analysis, enabling businesses to understand and respond effectively to customer opinions. By staying attuned to customer sentiments, North American businesses can foster brand reputation, enhance customer satisfaction, and make data-driven decisions.

    How do company ranking index and market positioning come to your aid?

    Companies are implementing various strategies, such as strategic alliances, partnerships, mergers and acquisitions, geographical expansion, and product/service launches, to enhance their presence in the market.

    Alphabet Inc.: The company offers sentiment analytics software that supports multiple languages and can be integrated into various applications for real-time analysis.

  5. E

    News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian...

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian and Estonian SADEmma 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23729
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 12, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    We provide annotated datasets on a three-point sentiment scale (positive, neutral and negative) for Serbian, Bosnian, Macedonian, Albanian, and Estonian. For all languages except Estonian, we include pairs of source URL (where corresponding text can be found) and sentiment label.

    For Estonian, we randomly sampled 100 articles from "Ekspress news article archive (in Estonian and Russian) 1.0" (http://hdl.handle.net/11356/1408).

    The data is organized in Tab-Separated Values (TSV) format. For Serbian, Bosnian, Macedonian, and Albanian, the dataset contains two columns: sourceURL and sentiment. For Estonian, the dataset consists of three columns: text ID (from the CLARIN.SI reference above), body text, and sentiment label.

  6. Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymized; Anonymized (2022). Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News Media Headlines Using Automated Labelling with Transformer Language Models" [Dataset]. http://doi.org/10.5281/zenodo.5144113
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymized; Anonymized
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains automated sentiment and emotionality annotations of 23 million headlines from 47 popular news media outlets popular in the United States.

    The set of 47 news media outlets analysed (listed in Figure 1 of the main manuscript) was derived from the AllSides organization 2019 Media Bias Chart v1.1. The human ratings of outlets’ ideological leanings were also taken from this chart and are listed in Figure 2 of the main manuscript.

    News articles headlines from the set of outlets analyzed in the manuscript are available in the outlets’ online domains and/or public cache repositories such as The Internet Wayback Machine, Google cache and Common Crawl. Articles headlines were located in articles’ HTML raw data using outlet-specific XPath expressions.

    The temporal coverage of headlines across news outlets is not uniform. For some media organizations, news articles availability in online domains or Internet cache repositories becomes sparse for earlier years. Furthermore, some news outlets popular in 2019, such as The Huffington Post or Breitbart, did not exist in the early 2000’s. Hence, our data set is sparser in headlines sample size and representativeness for earlier years in the 2000-2019 timeline. Nevertheless, 20 outlets in our data set have chronologically continuous partial or full headline data availability since the year 2000. Figure S 1 in the SI reports the number of headlines per outlet and per year in our analysis.

    In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the headline due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. After manual testing, we determined that the percentage of headlines following in this category is very small. Additionally, our method might miss detecting some articles in the online domains of news outlets. To conclude, in a data analysis of over 23 million headlines, we cannot manually check the correctness of every single data instance and hundred percent accuracy at capturing headlines’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our headlines set is representative of headlines in print news media content for the studied time period and outlets analyzed.

    The list of compressed files in this data set is listed next:

    -analysisScripts.rar contains the analysis scripts used in the main manuscript as well as aggregated data of sentiment and emotionality automated annotations of the headlines and human annotations of a subset of headlines sentiment and emotionality used as ground truth.

    -models.rar contains the Transformer sentiment and emotion annotation models used in the analysis. Namely:

    Siebert/sentiment-roberta-large-english from https://huggingface.co/siebert/sentiment-roberta-large-english. This model is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). See more information from the original authors at https://huggingface.co/siebert/sentiment-roberta-large-english

    DistilbertSST2.rar is the default sentiment classification model of the HuggingFace Transformer library https://huggingface.co/ This model is only used to replicate the results of the sentiment analysis with sentiment-roberta-large-english

    DistilRoberta j-hartmann/emotion-english-distilroberta-base from https://huggingface.co/j-hartmann/emotion-english-distilroberta-base. The model is a fine-tuned checkpoint of DistilRoBERTa-base. The model allows annotation of English text with Ekman's 6 basic emotions, plus a neutral class. The model was trained on 6 diverse datasets. Please refer to the original author at https://huggingface.co/j-hartmann/emotion-english-distilroberta-base for an overview of the data sets used for fine tuning. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

    -headlinesDataWithSentimentLabelsAnnotationsFromSentimentRobertaLargeModel.rar URLs of headlines analyzed and the sentiment annotations of the siebert/sentiment-roberta-large-english Transformer model. https://huggingface.co/siebert/sentiment-roberta-large-english

    -headlinesDataWithSentimentLabelsAnnotationsFromDistilbertSST2.rar URLs of headlines analyzed and the sentiment annotations of the default HuggingFace sentiment analysis model fine-tuned on the SST-2 dataset. https://huggingface.co/

    -headlinesDataWithEmotionLabelsAnnotationsFromDistilRoberta.rar URLs of headlines analyzed and the emotion categories annotations of the j-hartmann/emotion-english-distilroberta-base Transformer model. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

  7. P

    SB10k Dataset

    • paperswithcode.com
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SB10k Dataset [Dataset]. https://paperswithcode.com/dataset/sb10k
    Explore at:
    Dataset updated
    Apr 7, 2024
    Authors
    Mark Cieliebak; Jan Milan Deriu; Dominic Egger; Fatih Uzdilli
    Description

    The SB10k dataset is a valuable resource for sentiment analysis in German. Here are the key details:

    Corpus Size: It contains approximately 10,000 German tweets¹. Language: German. Task: Text classification, specifically sentiment analysis. Multilinguality: Monolingual (German only). Size Category: Falls within the range of 1K to 10K examples. Tags: Sentiment analysis. License: CC-BY-4.0.

    The dataset was created by annotating German tweets, with each tweet labeled by three annotators. Researchers have used SB10k to benchmark various machine learning classifiers, including convolutional neural networks (CNNs) and feature-based support vector machines (SVMs) for sentiment analysis²³.

    (1) Alienmaster/SB10k · Datasets at Hugging Face. https://huggingface.co/datasets/Alienmaster/SB10k. (2) A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. https://aclanthology.org/W17-1106/. (3) A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. https://aclanthology.org/W17-1106.pdf. (4) undefined. http://t.co/9rhta65MSx. (5) undefined. http://t.co/G84qcIGk7k. (6) undefined. http://t.co/LvwyZgew4Q.

  8. E

    Data from: Facebook Data for Sentiment Analysis

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    • +1more
    binary format
    Updated Jul 16, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2013). Facebook Data for Sentiment Analysis [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1057
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 16, 2013
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.

  9. E

    Broad-Coverage German Sentiment Classification Model and Dataset for Dialog...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    Updated Oct 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Broad-Coverage German Sentiment Classification Model and Dataset for Dialog Systems [Dataset]. https://live.european-language-grid.eu/catalogue/ld/7957
    Explore at:
    Dataset updated
    Oct 10, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems.

    This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.

    You can find the code for training testing the models, that was published along with the paper in this repository.

    The germansentiment Python package contains a easy to use interface for the model that was published with this paper.

  10. Sentiment analysis of OCR text!!

    • kaggle.com
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Somnath chatterjee (2020). Sentiment analysis of OCR text!! [Dataset]. https://www.kaggle.com/somnath796/sentiment-analysis-of-ocr-text/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Somnath chatterjee
    Description

    You work as a social media moderator for your firm. Your key responsibility is to tag uploaded content (images) during Pride Month based on its sentiment (positive, negative, or random) and categorize them for internal reference and SEO optimization.

    *****Task***** Your task is to build an engine that combines the concepts of OCR and NLP that accepts a .jpg file as input, extracts the text, if any, and classifies sentiment as positive or negative. If the text sentiment is neutral or an image file does not have any text, then it is classified as random.

    *****Data***** You must use an external dataset to train your model. The attached dataset link contains the sample data of each category [Positive | Negative | Random] and test data.

  11. d

    Replication Data for: Sentiment is Not Stance: Target-Aware Opinion...

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bestvater, Samuel; Monroe, Burt (2023). Replication Data for: Sentiment is Not Stance: Target-Aware Opinion Classification for Political Text Analysis [Dataset]. http://doi.org/10.7910/DVN/MUYYG4
    Explore at:
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Bestvater, Samuel; Monroe, Burt
    Description

    Sentiment analysis techniques have a long history in natural language processing and have become a standard tool in the analysis of political texts, promising a conceptually straightforward automated method of extracting meaning from textual data by scoring documents on a scale from positive to negative. However, while these kinds of sentiment scores can capture the overall tone of a document, the underlying concept of interest for political analysis is often actually the document's stance with respect to a given target--how positively or negatively it frames a specific idea, individual, or group--as this reflects the author's underlying political attitudes. In this paper we question the validity of approximating author stance through sentiment scoring in the analysis of political texts, and advocate for greater attention to be paid to the conceptual distinction between a document's sentiment and its stance. Using examples from open-ended survey responses and from political discussions on social media, we demonstrate that in many political text analysis applications, sentiment and stance do not necessarily align, and therefore sentiment analysis methods fail to reliably capture ground-truth document stance, amplifying noise in the data and leading to faulty conclusions.

  12. SAT Questions and Answers for LLM 🏛️

    • kaggle.com
    Updated Oct 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2023). SAT Questions and Answers for LLM 🏛️ [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/sat-history-questions-and-answers/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    SAT History Questions and Answers 🏛️ - Text Classification Dataset

    This dataset contains a collection of questions and answers for the SAT Subject Test in World History and US History. Each question is accompanied by a corresponding answers and the correct response.

    The dataset includes questions from various topics, time periods, and regions on both World History and US History.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

    OTHER DATASETS FOR THE TEXT ANALYSIS:

    Content

    For each question, we extracted: - id: number of the question, - subject: SAT subject (World History or US History), - prompt: text of the question, - A: answer A, - B: answer B, - C: answer C, - D: answer D, - E: answer E, - answer: letter of the correct answer to the question

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: answer questions, sat, gpa, university, school, exam, college, web scraping, parsing, online database, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data, machine learning

  13. Sentiment Analysis outputs based on the combination of three classifiers for...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Mar 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caio Mello; Caio Mello; Gullal S. Cheema; Gullal S. Cheema (2022). Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text [Dataset]. http://doi.org/10.5281/zenodo.6326348
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 4, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Caio Mello; Caio Mello; Gullal S. Cheema; Gullal S. Cheema
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text covering the Olympic legacy of Rio 2016 and London 2012. Data was searched via Google search engine. It is composed of sentiment labels assigned to 1271 news articles in total.

    News outlets:

    • BBC
    • Daily Mail
    • The Telegraph
    • The Guardian
    • Globo
    • Estadao
    • Folha de S. Paulo

    Events covered by the articles:

    • London 2012 Olympic legacy
    • Rio 2016 Olympic legacy

    All classifiers were used in texts in English. Text originally published in Portuguese by the Brazilian media were automatically translated.

    Sentiment classifiers used:

    • Vader
    • BERT (Trained on Amazon data)
    • BERT (Trained on twitter data - 140)

    Each document (spreadsheet - xlsx) refers to one outlet and one event (London 2012 or Rio 2016).

    How were labels assigned to the texts?

    These labels are a combination of the three sentiment classifiers listed above. If two of them agree with the same label, then this label would be considered as right. Otherwise, the label ‘other’ was assigned.

    For news article body text: the proportion of sentences of each sentiment type was used to assign labels to the whole article instead of averaging the sentence scores. For example, if the proportion of sentences with negative labels is greater than 50%, then the article is assigned a negative label.

    The documents are composed of the following columns:

    • Rank: the position of the article on Google search ranking
    • Date: date of article's publication (DD/MM/YYYY)
    • Link: article's link
    • Title: article's title
    • Sentiment_Title: final sentiment for article headline
    • Sentiment_Text: final sentiment for article's body text

    PS: Documents do not include articles' body text.

    Sentiment is presented in labels as follows:

    • Pos: Positive
    • Neg: Negative
    • Neutral: Neutral
    • other: inconclusive - if each of the 3 classifiers assigned a different label to the article, the label 'other' was used. Therefore, 'other' identifies contradictory results.

  14. Sample of Malay sentiment lexicon.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Al-Saffar; Suryanti Awang; Hai Tao; Nazlia Omar; Wafaa Al-Saiagh; Mohammed Al-bared (2023). Sample of Malay sentiment lexicon. [Dataset]. http://doi.org/10.1371/journal.pone.0194852.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ahmed Al-Saffar; Suryanti Awang; Hai Tao; Nazlia Omar; Wafaa Al-Saiagh; Mohammed Al-bared
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample of Malay sentiment lexicon.

  15. f

    Sample of the Malay stop words.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Al-Saffar; Suryanti Awang; Hai Tao; Nazlia Omar; Wafaa Al-Saiagh; Mohammed Al-bared (2023). Sample of the Malay stop words. [Dataset]. http://doi.org/10.1371/journal.pone.0194852.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ahmed Al-Saffar; Suryanti Awang; Hai Tao; Nazlia Omar; Wafaa Al-Saiagh; Mohammed Al-bared
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample of the Malay stop words.

  16. Product sentiment analysis

    • kaggle.com
    zip
    Updated Sep 4, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ask9 (2020). Product sentiment analysis [Dataset]. https://www.kaggle.com/arbazkhan971/product-sentiment-analysis
    Explore at:
    zip(406932 bytes)Available download formats
    Dataset updated
    Sep 4, 2020
    Authors
    ask9
    Description

    **Overview Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this weekend hackathon, we challenge the machinehackers community to develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help us serve the customers better but can also reveal lot of customer traits present/hidden in the reviews.

    The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.

    In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.

    Dataset Description:

    Train.csv - 6364 rows x 4 columns (Inlcudes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

    Attribute Description:

    Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment Skills:

    NLP, Sentiment Analysis Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing multi-class log loss to generalize well on unseen data

  17. T

    Pierce County Sentiment

    • open.piercecountywa.gov
    • internal.open.piercecountywa.gov
    application/rdfxml +5
    Updated Feb 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle (2020). Pierce County Sentiment [Dataset]. https://open.piercecountywa.gov/dataset/Pierce-County-Sentiment/nqup-7zcn
    Explore at:
    tsv, csv, application/rssxml, json, xml, application/rdfxmlAvailable download formats
    Dataset updated
    Feb 13, 2020
    Dataset authored and provided by
    Kyle
    Area covered
    Pierce County
    Description

    The fundamental task in brand sentiment analysis is text classification – classifying the separation of a given text or whether the expressed opinion in a document is positive, negative, or neutral. Around 800 documents pass through our platform per second from different media sources and providers. We use Natural Language Processing (NLP) to judge which group (positive, negative or neutral) the content belongs to.

    Meltwater’s Natural Language Processing model is supported by AI and machine learning algorithms. Using this model, we take individual words into account. Each document, for example, a tweet, is analysed based on the words it contains. Then we map the words to a set of predefined data to see the number of occurrences where they match up.

  18. f

    Data_Sheet_4_Topic evolution and sentiment comparison of user reviews on an...

    • frontiersin.figshare.com
    docx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chaoyang Li; Shengyu Li; Jianfeng Yang; Jingmei Wang; Yiqing Lv (2023). Data_Sheet_4_Topic evolution and sentiment comparison of user reviews on an online medical platform in response to COVID-19: taking review data of Haodf.com as an example.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2023.1088119.s004
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Chaoyang Li; Shengyu Li; Jianfeng Yang; Jingmei Wang; Yiqing Lv
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThroughout the COVID-19 pandemic, many patients have sought medical advice on online medical platforms. Review data have become an essential reference point for supporting users in selecting doctors. As the research object, this study considered Haodf.com, a well-known e-consultation website in China.MethodsThis study examines the topics and sentimental change rules of user review texts from a temporal perspective. We also compared the topics and sentimental change characteristics of user review texts before and after the COVID-19 pandemic. First, 323,519 review data points about 2,122 doctors on Haodf.com were crawled using Python from 2017 to 2022. Subsequently, we employed the latent Dirichlet allocation method to cluster topics and the ROST content mining software to analyze user sentiments. Second, according to the results of the perplexity calculation, we divided text data into five topics: diagnosis and treatment attitude, medical skills and ethics, treatment effect, treatment scheme, and treatment process. Finally, we identified the most important topics and their trends over time.ResultsUsers primarily focused on diagnosis and treatment attitude, with medical skills and ethics being the second-most important topic among users. As time progressed, the attention paid by users to diagnosis and treatment attitude increased—especially during the COVID-19 outbreak in 2020, when attention to diagnosis and treatment attitude increased significantly. User attention to the topic of medical skills and ethics began to decline during the COVID-19 outbreak, while attention to treatment effect and scheme generally showed a downward trend from 2017 to 2022. User attention to the treatment process exhibited a declining tendency before the COVID-19 outbreak, but increased after. Regarding sentiment analysis, most users exhibited a high degree of satisfaction for online medical services. However, positive user sentiments showed a downward trend over time, especially after the COVID-19 outbreak.DiscussionThis study has reference value for assisting user choice regarding medical treatment, decision-making by doctors, and online medical platform design.

  19. P

    ACOS Dataset

    • paperswithcode.com
    Updated Dec 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ACOS Dataset [Dataset]. https://paperswithcode.com/dataset/acos
    Explore at:
    Dataset updated
    Dec 4, 2020
    Authors
    Hongjie Cai; Yaofeng Tu; Xiangsheng Zhou; Jianfei Yu; Rui Xia
    Description

    Most of the aspect based sentiment analysis research aims at identifying the sentiment polarities toward some explicit aspect terms while ignores implicit aspects in text. To capture both explicit and implicit aspects, we focus on aspect-category based sentiment analysis, which involves joint aspect category detection and category-oriented sentiment classification. However, currently only a few simple studies have focused on this problem. The shortcomings in the way they defined the task make their approaches difficult to effectively learn the inner-relations between categories and the inter-relations between categories and sentiments. In this work, we re-formalize the task as a category-sentiment hierarchy prediction problem, which contains a hierarchy output structure to first identify multiple aspect categories in a piece of text, and then predict the sentiment for each of the identified categories. Specifically, we propose a Hierarchical Graph Convolutional Network (Hier-GCN), where a lower-level GCN is to model the inner-relations among multiple categories, and the higher-level GCN is to capture the inter-relations between aspect categories and sentiments. Extensive evaluations demonstrate that our hierarchy output structure is superior over existing ones, and the Hier-GCN model can consistently achieve the best results on four benchmarks.

  20. Z

    MuSe-Sent: Multimodal Sentiment Classification in-the-Wild (MuSe2021)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schuller, Björn (2022). MuSe-Sent: Multimodal Sentiment Classification in-the-Wild (MuSe2021) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4654370
    Explore at:
    Dataset updated
    Apr 8, 2022
    Dataset provided by
    Stappen, Lukas
    Baird, Alice
    Schuller, Björn
    Description

    MuSe-Sent of the 2nd Multimodal Sentiment in-the-Wild Challenge! Predicting five advanced intensity classes for each of the emotional dimensions (valence, arousal) for segments of audio-video-text data. This package includes only MuSe-Sent features (all partitions) and labels of the training and development set (test scoring via the MuSe website). More: https://www.muse-challenge.org/muse2021

    General: The purpose of the Multimodal Sentiment Analysis in Real-life media Challenge and Workshop (MuSe) is to bring together communities from different disciplines. We introduce the novel dataset MuSe-CAR that covers the range of aforementioned desiderata. MuSe-CAR is a large (>36h), multimodal dataset which has been gathered in-the-wild with the intention of further understanding Multimodal Sentiment Analysis in-the-wild, e.g., the emotional engagement that takes place during product reviews (i.e., automobile reviews) where a sentiment is linked to a topic or entity.

    We have designed MuSe-CAR to be of high voice and video quality, as informative video social media content, as well as everyday recording devices have improved in recent years. This enables robust learning, even with a high degree of novel, in-the-wild characteristics, for example as related to: i) Video: Shot size (a mix of close-up, medium, and long shots), face-angle (side, eye, low, high), camera motion (free, free but stable, and free but unstable, switch, e.g., zoom, fixed), reviewer visibility (full body, half-body, face only, and hands only), highly varying backgrounds, and people interacting with objects (car parts). ii) Audio: Ambient noises (car noises, music), narrator and host diarisation, diverse microphone types, and speaker locations. iii) Text: Colloquialisms, and domain-specific terms.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alan Turner (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2/suggestions
Organization logo

BBC datasets for sentiment analysis

BBC datasets for sentiment analysis

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alan Turner
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Name: BBC Articles Sentiment Analysis Dataset

Source: BBC News

Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.

Search
Clear search
Close search
Google apps
Main menu