100+ datasets found
  1. Twitter Tweets Sentiment Dataset

    • kaggle.com
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  2. P

    Sentiment Analysis for Social Media Monitoring Dataset

    • paperswithcode.com
    Updated Mar 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Sentiment Analysis for Social Media Monitoring Dataset [Dataset]. https://paperswithcode.com/dataset/sentiment-analysis-for-social-media
    Explore at:
    Dataset updated
    Mar 6, 2025
    Description

    Problem Statement

    👉 Download the case studies here

    A global consumer goods company struggled to understand customer sentiment across various social media platforms. With millions of posts, reviews, and comments generated daily, manually tracking and analyzing public opinion was inefficient. The company needed an automated solution to monitor brand perception, address negative feedback promptly, and leverage insights for marketing strategies.

    Challenge

    Analyzing social media sentiment posed the following challenges:

    Processing vast amounts of unstructured text data from multiple platforms like Twitter, Facebook, and Instagram.

    Accurately interpreting slang, emojis, and nuanced language used by social media users.

    Identifying trends and actionable insights in real-time to respond to potential crises or opportunities effectively.

    Solution Provided

    An advanced sentiment analysis system was developed using Natural Language Processing (NLP) and sentiment analysis algorithms. The solution was designed to:

    Classify social media posts into positive, negative, and neutral sentiments.

    Extract key topics and trends related to the brand and its products.

    Provide real-time dashboards for monitoring customer sentiment and identifying areas of improvement.

    Development Steps

    Data Collection

    Aggregated data from major social media platforms using APIs, focusing on brand mentions, hashtags, and product keywords.

    Preprocessing

    Cleaned and normalized text data, including handling slang, emojis, and misspellings, to prepare it for analysis.

    Model Training

    Trained NLP models for sentiment classification using supervised learning. Implemented topic modeling algorithms to identify recurring themes and discussions.

    Validation

    Tested the sentiment analysis models on labeled datasets to ensure high accuracy and relevance in classifying social media posts.

    Deployment

    Integrated the sentiment analysis system with a real-time analytics dashboard, enabling the marketing and customer support teams to track trends and respond proactively.

    Monitoring & Improvement

    Established a continuous feedback mechanism to refine models based on evolving language patterns and new social media trends.

    Results

    Gained Actionable Insights

    The system provided detailed insights into customer opinions, helping the company identify strengths and areas for improvement.

    Improved Brand Reputation Management

    Real-time monitoring enabled swift responses to negative feedback, mitigating potential reputation risks.

    Informed Marketing Strategies

    Insights from sentiment analysis guided targeted marketing campaigns, resulting in higher engagement and ROI.

    Enhanced Customer Relationships

    Proactive engagement with customers based on sentiment analysis improved customer satisfaction and loyalty.

    Scalable Monitoring Solution

    The system scaled efficiently to analyze data across multiple languages and platforms, broadening the company’s reach and understanding.

  3. c

    Sentiment Analysis Dataset

    • cubig.ai
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Sentiment Analysis Dataset [Dataset]. https://cubig.ai/store/products/270/sentiment-analysis-dataset
    Explore at:
    Dataset updated
    May 20, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Sentiment Analysis Dataset is a dataset for emotional analysis, including large-scale tweet text collected from Twitter and emotional polarity (0=negative, 2=neutral, 4=positive) labels for each tweet, featuring automatic labeling based on emoticons.

    2) Data Utilization (1) Sentiment Analysis Dataset has characteristics that: • Each sample consists of six columns: emotional polarity, tweet ID, date of writing, search word, author, and tweet body, and is suitable for training natural language processing and classification models using tweet text and emotion labels. (2) Sentiment Analysis Dataset can be used to: • Emotional Classification Model Development: Using tweet text and emotional polarity labels, we can build positive, negative, and neutral emotional automatic classification models with various machine learning and deep learning models such as logistic regression, SVM, RNN, and LSTM. • Analysis of SNS public opinion and trends: By analyzing the distribution of emotions by time series and keywords, you can explore changes in public opinion on specific issues or brands, positive and negative trends, and key emotional keywords.

  4. P

    Twitter Sentiment Analysis Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Twitter Sentiment Analysis Dataset [Dataset]. https://paperswithcode.com/dataset/twitter-sentiment-analysis
    Explore at:
    Description

    This is an entity-level Twitter Sentiment Analysis dataset. For each message, the task is to judge the sentiment of the entire sentence towards a given entity. For example, A outperforms B is positive for entity A but negative for entity B. The dataset contains ~70K labeled training messages and 1K labeled validation messages. It is available online for free on Kaggle.

  5. d

    A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

    • search.dataone.org
    Updated Sep 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thakur, Nirmalya; Su, Vanessa; Shao, Mingchen; Patel, Kesha A.; Jeong, Hongseok; Knieling, Victoria; Bian, Andrew (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles [Dataset]. http://doi.org/10.7910/DVN/QTJ9HC
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Thakur, Nirmalya; Su, Vanessa; Shao, Mingchen; Patel, Kesha A.; Jeong, Hongseok; Knieling, Victoria; Bian, Andrew
    Time period covered
    Jan 1, 2024 - May 31, 2024
    Area covered
    YouTube
    Description

    Please cite the following paper when using this dataset: N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A.Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” arXiv [cs.CY], 2024. Available: http://arxiv.org/abs/2406.07693 Abstract This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.

  6. d

    AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of...

    • datarade.ai
    .mp3, .wav
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2023). AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of the calls between consumers and companies [Dataset]. https://datarade.ai/data-products/ai-training-data-audio-data-unique-consumer-sentiment-data-wiserbrand-com
    Explore at:
    .mp3, .wavAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    WiserBrand.com
    Area covered
    United States of America
    Description

    WiserBrand offers a unique dataset of real consumer-to-business phone conversations. These high-quality audio recordings capture authentic interactions between consumers and support agents across industries. Unlike synthetic data or scripted samples, our dataset reflects natural speech patterns, emotion, intent, and real-world phrasing — making it ideal for:

    Training ASR (Automatic Speech Recognition) systems

    Improving voice assistants and LLM audio understanding

    Enhancing call center AI tools (e.g., sentiment analysis, intent detection)

    Benchmarking conversational AI performance with real-world noise and context

    We ensure strict data privacy: all personally identifiable information (PII) is removed before delivery. Recordings are produced on demand and can be tailored by vertical (e.g., telecom, finance, e-commerce) or use case.

    Whether you're building next-gen voice technology or need realistic conversational datasets to test models, this dataset provides what synthetic corpora lack — realism, variation, and authenticity.

  7. h

    CustomerTicketSyntheticData

    • huggingface.co
    Updated Sep 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bharath Varma Kantheti (2024). CustomerTicketSyntheticData [Dataset]. https://huggingface.co/datasets/BharathBOLT/CustomerTicketSyntheticData
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2024
    Authors
    Bharath Varma Kantheti
    Description

    Customer Support Ticket Sentiment Analysis (Synthetic Data)

      Overview
    

    This dataset contains synthetically generated customer support tickets with corresponding sentiment labels. It is designed to simulate real-world customer interactions across various industries, providing a balanced distribution of sentiment classes. This dataset is ideal for training and testing machine learning models for sentiment analysis in customer support scenarios.

      Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/BharathBOLT/CustomerTicketSyntheticData.
    
  8. e

    Training Data for German Sentiment Analysis of Political Communication (SUF...

    • b2find.eudat.eu
    Updated Nov 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Nov 26, 2020
    Description

    Full edition for scientific use. The dataset contains 125871 sentences extracted from Austrian parliamentary debates and party press releases. Press releases were collected under the auspices of the Austrian National Election Study (AUTNES) and cover 6 weeks prior to each national election 1995-2013. Data from parliamentary debates stem from a random sample of sentences drawn from sessions of the Austrian National Council (1995-2013). The sentiment of the sentences was crowdcoded on a five-point-scale ranging from 0 “Not negative” to 5 “Very strongly negative”. As each sentence has been coded by ten coders, there are multiple codingids for each unitid (sentence).

  9. E

    Broad-Coverage German Sentiment Classification Model and Dataset for Dialog...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    Updated Oct 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Broad-Coverage German Sentiment Classification Model and Dataset for Dialog Systems [Dataset]. https://live.european-language-grid.eu/catalogue/ld/7957
    Explore at:
    Dataset updated
    Oct 10, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems.

    This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.

    You can find the code for training testing the models, that was published along with the paper in this repository.

    The germansentiment Python package contains a easy to use interface for the model that was published with this paper.

  10. Twitter Sentiment Analysis

    • kaggle.com
    Updated Aug 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    passionate-nlp (2021). Twitter Sentiment Analysis [Dataset]. https://www.kaggle.com/jp797498e/twitter-entity-sentiment-analysis/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    passionate-nlp
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Twitter Sentiment Analysis Dataset

    Overview

    This is an entity-level sentiment analysis dataset of twitter. Given a message and an entity, the task is to judge the sentiment of the message about the entity. There are three classes in this dataset: Positive, Negative and Neutral. We regard messages that are not relevant to the entity (i.e. Irrelevant) as Neutral.

    Usage

    Please use twitter_training.csv as the training set and twitter_validation.csv as the validation set. Top 1 classification accuracy is used as the metric.

  11. h

    Sentiment-Analysis-Over-sampled

    • huggingface.co
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syed Khalid Hussain (2024). Sentiment-Analysis-Over-sampled [Dataset]. https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis-Over-sampled
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2024
    Authors
    Syed Khalid Hussain
    Description

    Sentiment Analysis Dataset

      Overview
    

    This dataset is designed for sentiment analysis tasks, offering a balanced and pre-processed collection of labeled text data. The dataset includes three sentiment labels:

    0: Negative
    1: Neutral
    2: Positive

    The training dataset has been oversampled to ensure balanced label distribution, making it suitable for training robust sentiment analysis models. The validation and test datasets remain unaltered to preserve the original… See the full description on the dataset page: https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis-Over-sampled.

  12. h

    custom_sentiment_analysis_dataset

    • huggingface.co
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    choi hyun woo (2024). custom_sentiment_analysis_dataset [Dataset]. https://huggingface.co/datasets/t7439/custom_sentiment_analysis_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Authors
    choi hyun woo
    Description

    Dataset Card for Custom Text Dataset

      Dataset Name
    

    Custom Text Dataset

      Overview
    

    This dataset contains text data for training sentiment analysis models. The data is collected from various sources, including books, articles, and web pages.

      Composition
    

    Number of records: 50,000 Fields: text, label Size: 134 MB

      Collection Process
    

    The data was collected using web scraping and manual extraction from public domain sources.… See the full description on the dataset page: https://huggingface.co/datasets/t7439/custom_sentiment_analysis_dataset.

  13. Pakistani Traffic Sentiment Analysis

    • kaggle.com
    Updated Feb 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Altaf Khan (2023). Pakistani Traffic Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/altafk/pakistani-traffic-sentiment-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Altaf Khan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Pakistan
    Description

    The dataset with two columns: "Text" and "Label". The "Text" column contains sentiments of Pakistani traffic, which includes both positive and negative reviews. The "Label" column is used to classify each sentiment as either positive or negative, where positive reviews are labeled with "0" and negative reviews are labeled with "1". This dataset can be used for sentiment analysis tasks, which involve using natural language processing techniques to analyze and classify text data based on the emotions and opinions expressed within the text. By training a machine learning model on this dataset, you can create a system that can automatically classify new traffic sentiments as either positive or negative. Some possible applications of this type of sentiment analysis include monitoring public opinion about traffic-related issues, identifying areas where improvements are needed, and evaluating the effectiveness of traffic-related policies and initiatives. Additionally, businesses in the transportation industry could use this type of analysis to understand customer feedback and improve their services accordingly.

  14. d

    AI Training Data | US Transcription Data| Unique Consumer Sentiment Data:...

    • datarade.ai
    Updated Jan 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2025). AI Training Data | US Transcription Data| Unique Consumer Sentiment Data: Transcription of the calls to the companies [Dataset]. https://datarade.ai/data-products/wiserbrand-ai-training-data-us-transcription-data-unique-wiserbrand-com
    Explore at:
    .csv, .xls, .txt, .jsonAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    WiserBrand.com
    Area covered
    United States
    Description

    WiserBrand's Comprehensive Customer Call Transcription Dataset: Tailored Insights

    WiserBrand offers a customizable dataset comprising transcribed customer call records, meticulously tailored to your specific requirements. This extensive dataset includes:

    User ID and Firm Name: Identify and categorize calls by unique user IDs and company names. Call Duration: Analyze engagement levels through call lengths. Geographical Information: Detailed data on city, state, and country for regional analysis. Call Timing: Track peak interaction times with precise timestamps. Call Reason and Group: Categorised reasons for calls, helping to identify common customer issues. Device and OS Types: Information on the devices and operating systems used for technical support analysis. Transcriptions: Full-text transcriptions of each call, enabling sentiment analysis, keyword extraction, and detailed interaction reviews.

    Our dataset is designed for businesses aiming to enhance customer service strategies, develop targeted marketing campaigns, and improve product support systems. Gain actionable insights into customer needs and behavior patterns with this comprehensive collection, particularly useful for Consumer Data, Consumer Behavior Data, Consumer Sentiment Data, Consumer Review Data, AI Training Data, Textual Data, and Transcription Data applications.

    WiserBrand's dataset is essential for companies looking to leverage Consumer Data and B2B Marketing Data to drive their strategic initiatives in the English-speaking markets of the USA, UK, and Australia. By accessing this rich dataset, businesses can uncover trends and insights critical for improving customer engagement and satisfaction.

    Cases:

    1. Training Speech Recognition (Speech-to-Text) and Speech Synthesis (Text-to-Speech) Models WiserBrand's Comprehensive Customer Call Transcription Dataset is an excellent resource for training and improving speech recognition models (Speech-to-Text, STT) and speech synthesis systems (Text-to-Speech, TTS). Here’s how this dataset can contribute to these tasks:

    Enriching STT Models: The dataset includes a wide variety of real-world customer service calls with diverse accents, tones, and terminologies. This makes it highly valuable for training speech-to-text models to better recognize different dialects, regional speech patterns, and industry-specific jargon. It could help improve accuracy in transcribing conversations in customer service, sales, or technical support.

    Contextualized Speech Recognition: Given the contextual information (e.g., reasons for calls, call categories, etc.), it can help models differentiate between various types of conversations (technical support vs. sales queries), which would improve the model’s ability to transcribe in a more contextually relevant manner.

    Improving TTS Systems: The transcriptions, along with their associated metadata (such as call duration, timing, and call reason), can aid in training Text-to-Speech models that mimic natural conversation patterns, including pauses, tone variation, and proper intonation. This is especially beneficial for developing conversational agents that sound more natural and human-like in their responses.

    Noise and Speech Quality Handling: Real-world customer service calls often contain background noise, overlapping speech, and interruptions, which are crucial elements for training speech models to handle real-life scenarios more effectively.

    1. Training AI Agents for Replacing Customer Service Representatives WiserBrand’s dataset can be incredibly valuable for businesses looking to develop AI-powered customer support agents that can replace or augment human customer service representatives. Here’s how this dataset supports AI agent training:

    Customer Interaction Simulation: The transcriptions provide a comprehensive view of real customer interactions, including common queries, complaints, and support requests. By training AI models on this data, businesses can equip their virtual agents with the ability to understand customer concerns, follow up on issues, and provide meaningful solutions, all while mimicking human-like conversational flow.

    Sentiment Analysis and Emotional Intelligence: The full-text transcriptions, along with associated call metadata (e.g., reason for the call, call duration, and geographical data), allow for sentiment analysis, enabling AI agents to gauge the emotional tone of customers. This helps the agents respond appropriately, whether it’s providing reassurance during frustrating technical issues or offering solutions in a polite, empathetic manner. Such capabilities are essential for improving customer satisfaction in automated systems.

    Customizable Dialogue Systems: The dataset allows for categorizing and identifying recurring call patterns and issues. This means AI agents can be trained to recognize the types of queries that come up frequently, allowing them to automate routine tasks such as ...

  15. Z

    Sentiment analysis in Galaxy with IMDB movie review dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaivan Kamali (2022). Sentiment analysis in Galaxy with IMDB movie review dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4477880
    Explore at:
    Dataset updated
    Aug 4, 2022
    Dataset authored and provided by
    Kaivan Kamali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IMDB movie review sentiment classification dataset (Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011)). For more information please refer to: https://ai.stanford.edu/~amaas/data/sentiment/

    The IMDB dataset was modified as follows to prepare it for use in a Galaxy Training Tutorial (https://training.galaxyproject.org/):

    The top 50 words are excluded (mostly stop words). Included the next 10,000 top words. Reviews are limited to 500 words max (Longer reviews trimmed and shorter reviews are padded). 25,000 reviews are used for training and testing each. Files are in tsv (tab separated value) format to be consumed by Galaxy (www.usegalaxy.org).

  16. P

    Capriccio Dataset

    • paperswithcode.com
    Updated Sep 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jie You; Jae-Won Chung; Mosharaf Chowdhury (2022). Capriccio Dataset [Dataset]. https://paperswithcode.com/dataset/capriccio
    Explore at:
    Dataset updated
    Sep 15, 2022
    Authors
    Jie You; Jae-Won Chung; Mosharaf Chowdhury
    Description

    Capriccio is a sentiment classification dataset on tweets that simulates data drift. It is created by slicing the Sentiment140 dataset (homepage, Huggingface datasets) with a sliding window of 500,000 tweets, resulting in 38 slices. Thus, each slice can be used to represent the training/validation dataset of a sentiment classification model that is re-trained every day. Each slice has 425,000 tweets for training (file named %d_train.json) and 75,000 tweets for validation (file named %d_val.json).

    The name comes from the adjective capricious.

  17. A

    Training Data for German Sentiment Analysis of Political Communication (SUF...

    • data.aussda.at
    bin, pdf, tsv
    Updated Nov 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Haselmayer; Martin Haselmayer; Marcelo Jenny; Marcelo Jenny (2020). Training Data for German Sentiment Analysis of Political Communication (SUF edition) [Dataset]. http://doi.org/10.11587/EOPCOB
    Explore at:
    pdf(86728), tsv(448), tsv(23371430), bin(9134185)Available download formats
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    AUSSDA
    Authors
    Martin Haselmayer; Martin Haselmayer; Marcelo Jenny; Marcelo Jenny
    License

    https://data.aussda.at/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11587/EOPCOBhttps://data.aussda.at/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11587/EOPCOB

    Area covered
    Austria
    Dataset funded by
    Austrian Science Fund
    Vienna Anniversary Foundation for Higher Education
    Description

    Full edition for scientific use. The dataset contains 125871 sentences extracted from Austrian parliamentary debates and party press releases. Press releases were collected under the auspices of the Austrian National Election Study (AUTNES) and cover 6 weeks prior to each national election 1995-2013. Data from parliamentary debates stem from a random sample of sentences drawn from sessions of the Austrian National Council (1995-2013). The sentiment of the sentences was crowdcoded on a five-point-scale ranging from 0 “Not negative” to 5 “Very strongly negative”. As each sentence has been coded by ten coders, there are multiple codingids for each unitid (sentence).

  18. h

    financial-tweets-sentiment

    • huggingface.co
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Koornstra (2023). financial-tweets-sentiment [Dataset]. https://huggingface.co/datasets/TimKoornstra/financial-tweets-sentiment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2023
    Authors
    Tim Koornstra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Financial Sentiment Analysis Dataset

      Overview
    

    This dataset is a comprehensive collection of tweets focused on financial topics, meticulously curated to assist in sentiment analysis in the domain of finance and stock markets. It serves as a valuable resource for training machine learning models to understand and predict sentiment trends based on social media discourse, particularly within the financial sector.

      Data Description
    

    The dataset comprises tweets… See the full description on the dataset page: https://huggingface.co/datasets/TimKoornstra/financial-tweets-sentiment.

  19. Tweets_Sentiments

    • kaggle.com
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    amir meymandi (2024). Tweets_Sentiments [Dataset]. https://www.kaggle.com/datasets/amirmeymandi/tweets-sentiments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    amir meymandi
    Description

    Overview

    This dataset contains text samples labeled with sentiment categories, including positive, negative, and neutral sentiments. It is designed for sentiment analysis and can be used to train and evaluate machine learning models aimed at understanding the emotional tone of text data.

    Content

    • Text Data: The dataset include tweets.
    • Sentiment Labels: Each text sample is annotated with a sentiment label. The labels are categorized as Positive, Negative, or Neutral.
    • Number of Records: The dataset consists of 543 number of text samples.

    Usage

    This dataset can be used for training sentiment analysis models, evaluating model performance, and conducting research on natural language processing (NLP) and sentiment classification. It is suitable for machine learning projects focusing on sentiment detection, opinion mining, and text classification.

    Format

    The dataset is provided in CSV format with the following columns: - tweet: I’m so proud of my team for finishing the project ahead of schedule! - label: Positive, Negative, Neutral

  20. d

    Deeply Vocal Characterizer Dataset - AI & ML Training Data, South Korea

    • datarade.ai
    Updated Dec 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deeply (2020). Deeply Vocal Characterizer Dataset - AI & ML Training Data, South Korea [Dataset]. https://datarade.ai/data-products/vocal-characterizer-dataset-deeply
    Explore at:
    Dataset updated
    Dec 30, 2020
    Dataset authored and provided by
    Deeply
    Area covered
    South Korea
    Description

    The Vocal Characterizer Dataset is a human nonverbal vocal sound dataset consisting of 56.7 hours of short clips from 1419 speakers, crowdsourced by the general public in South Korea and validated by the AI data platform. Also, the dataset includes metadata such as age, sex, noise level, and quality of utterance. 16 classes of Included human nonverbal sound contain ‘teeth-chattering’, ‘teeth-grinding’, ‘tongue-clicking’, ‘nose-blowing’, ‘coughing’, ‘yawning’, ‘throat-clearing’, ‘sighing’, ‘lip-popping’, ‘lip-smacking’, ‘panting’, ’crying’, ‘laughing’, ‘sneezing’, ‘moaning’, and ‘screaming’.

    The dataset is the first dataset to the world due to its large volume, various types of nonverbal vocal cues, and various participants.

    We expect that the utilization of this dataset would bring precise detection of the nonverbal vocal cues, and a better understanding of the human conversation.

    We're ready to deliver further information, statistics, or samples upon request. Don't hesitate to reach out!

    The dataset can be delivered as either original wav files(44,100Hz, 16-bit PCM, 1-channel) or a single compressed h5 file(resampled to 16,000Hz).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Organization logo

Twitter Tweets Sentiment Dataset

Twitter Tweets Sentiment Analysis for Natural Language Processing

Explore at:
37 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

  1. textID - unique ID for each piece of text
  2. text - the text of the tweet
  3. sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

  • Understand the Dataset & cleanup (if required).
  • Build classification models to predict the twitter sentiments.
  • Compare the evaluation metrics of vaious classification algorithms.
Search
Clear search
Close search
Google apps
Main menu