48 datasets found
  1. Twitter Tweets Sentiment Dataset

    • kaggle.com
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  2. A

    ‘Sentiment Analysis Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Oct 17, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘Sentiment Analysis Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-sentiment-analysis-dataset-caeb/f26f1fc2/?iid=004-932&v=presentation
    Explore at:
    Dataset updated
    Oct 17, 2016
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Sentiment Analysis Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sonaam1234/sentimentdata on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Data for sentiment analysis

    --- Original source retains full ownership of the source dataset ---

  3. A

    ‘Financial Sentiment Analysis’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Financial Sentiment Analysis’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-financial-sentiment-analysis-5b39/latest
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Financial Sentiment Analysis’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sbhatti/financial-sentiment-analysis on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Data

    The following data is intended for advancing financial sentiment analysis research. It's two datasets (FiQA, Financial PhraseBank) combined into one easy-to-use CSV file. It provides financial sentences with sentiment labels.

    Citations

    Malo, Pekka, et al. "Good debt or bad debt: Detecting semantic orientations in economic texts." Journal of the Association for Information Science and Technology 65.4 (2014): 782-796.

    --- Original source retains full ownership of the source dataset ---

  4. Twitter Sentiment Analysis Dataset

    • kaggle.com
    zip
    Updated Feb 13, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zohair Ahmed (2021). Twitter Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/zohairahmed007/twitter-sentiment-analysis-dataset
    Explore at:
    zip(38737743 bytes)Available download formats
    Dataset updated
    Feb 13, 2021
    Authors
    Zohair Ahmed
    Description

    Dataset

    This dataset was created by Zohair Ahmed

    Contents

  5. A

    ‘Product Reviews and Ratings (Sentiment Analysis)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Product Reviews and Ratings (Sentiment Analysis)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-product-reviews-and-ratings-sentiment-analysis-fb82/latest
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Product Reviews and Ratings (Sentiment Analysis)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mafaisal007/product-reviews-and-ratings-sentiment-analysis on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This dataset is from a toy store in Europe that contains customer reviews about a particular prodcut it is to be used for text mining and sentiment anlaysis.

    --- Original source retains full ownership of the source dataset ---

  6. Human Written Text

    • kaggle.com
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youssef Elebiary (2025). Human Written Text [Dataset]. https://www.kaggle.com/datasets/youssefelebiary/human-written-text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2025
    Dataset provided by
    Kaggle
    Authors
    Youssef Elebiary
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset contains 20000 pieces of text collected from Wikipedia, Gutenberg, and CNN/DailyMail. The text is cleaned by replacing symbols such as (.*?/) with a white space using automatic scripts and regex.

    Data Source Distribution

    1. 10,000 Wikipedia Articles: From the 20220301 dump.
    2. 3,000 Gutenberg Books: Via the GutenDex API.
    3. 7,000 CNN/DailyMail News Articles: From the CNN/DailyMail 3.0.0 dataset.

    Why These Sources

    The data was collected from these source to ensure the highest level of integrity against AI generated text. * Wikipedia: The 20220301 dataset was chosen to minimize the chance of including articles generated or heavily edited by AI. * Gutenberg: Books from this source are guaranteed to be written by real humans and span various genres and time periods. * CNN/DailyMail: These news articles were written by professional journalists and cover a variety of topics, ensuring diversity in writing style and subject matter.

    Dataset Structure

    The dataset consists of 5 CSV files. 1. CNN_DailyMail.csv: Contains all processed news articles. 2. Gutenberg.csv: Contains all processed books. 3. Wikipedia.csv: Contains all processed Wikipedia articles. 4. Human.csv: Combines all three datasets in order. 5. Shuffled_Human.csv: This is the randomly shuffled version of Human.csv.

    Each file has 2 columns: - Title: The title of the item. - Text: The content of the item.

    Uses

    This dataset is suitable for a wide range of NLP tasks, including: - Training models to distinguish between human-written and AI-generated text (Human/AI classifiers). - Training LSTMs or Transformers for chatbots, summarization, or topic modeling. - Sentiment analysis, genre classification, or linguistic research.

    Disclaimer

    While the data was collected from such sources, the data may not be 100% pure from AI generated text. Wikipedia articles may reflect systemic biases in contributor demographics. CNN/DailyMail articles may focus on specific news topics or regions.

    For details on how the dataset was created, click here to view the Kaggle notebook used.

    Licensing

    This dataset is published under the MIT License, allowing free use for both personal and commercial purposes. Attribution is encouraged but not required.

  7. f

    Twitter dataset

    • figshare.com
    csv
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreyas Poojary; Mohammed Riza; Rashmi Laxmikant Malghan (2025). Twitter dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28390334.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    figshare
    Authors
    Shreyas Poojary; Mohammed Riza; Rashmi Laxmikant Malghan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains tweets labeled for sentiment analysis, categorized into Positive, Negative, and Neutral sentiments. The dataset includes tweet IDs, user metadata, sentiment labels, and tweet text, making it suitable for Natural Language Processing (NLP), machine learning, and AI-based sentiment classification research. Originally sourced from Kaggle, this dataset is curated for improved usability in social media sentiment analysis.

  8. A

    ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-sentiment-analysis-of-commodity-news-gold-732f/e3232de2/?iid=002-045&v=presentation
    Explore at:
    Dataset updated
    Sep 27, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Sentiment Analysis of Commodity News (Gold)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-in-commodity-market-gold on 14 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This is a news dataset for the commodity market where we have manually annotated 11,412 news headlines across multiple dimensions into various classes. The dataset has been sampled from a period of 20+ years (2000-2021).

    Content

    The dataset has been collected from various news sources and annotated by three human annotators who were subject experts. Each news headline was evaluated on various dimensions, for instance - if a headline is a price related news then what is the direction of price movements it is talking about; whether the news headline is talking about the past or future; whether the news item is talking about asset comparison; etc.

    Acknowledgements

    Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." In Future of Information and Communication Conference, pp. 589-601. Springer, Cham, 2021.

    https://arxiv.org/abs/2009.04202 Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." arXiv preprint arXiv:2009.04202 (2020)

    We would like to acknowledge the financial support provided by the India Gold Policy Centre (IGPC).

    Inspiration

    Commodity prices are known to be quite volatile. Machine learning models that understand the commodity news well, will be able to provide an additional input to the short-term and long-term price forecasting models. The dataset will also be useful in creating news-based indicators for commodities.

    Apart from researchers and practitioners working in the area of news analytics for commodities, the dataset will also be useful for researchers looking to evaluate their models on classification problems in the context of text-analytics. Some of the classes in the dataset are highly imbalanced and may pose challenges to the machine learning algorithms.

    --- Original source retains full ownership of the source dataset ---

  9. Article Dataset (Mini)

    • kaggle.com
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sani Kamal (2024). Article Dataset (Mini) [Dataset]. https://www.kaggle.com/datasets/sanikamal/article-50
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sani Kamal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    This dataset contains 50 articles sourced from Medium, focusing on AI-related content. It is designed for business owners, content creators, and AI developers looking to analyze successful articles, improve engagement, and fine-tune AI language models (LLMs). The data can be used to explore what makes articles perform well, including sentiment analysis, follower counts, and headline effectiveness.

    Dataset Contents

    • articles_50.db - Sample database with 50 articles(Free Version)

    The database includes pre-analyzed data such as sentiment scores, follower counts, and headline metadata, helping users gain insights into high-performing content.

    Use Cases

    • Content Strategy Optimization: Identify trends in successful AI-related articles to enhance your content approach.
    • Headline Crafting: Study patterns in top-performing headlines to create more compelling article titles.
    • LLM Fine-Tuning: Utilize the dataset to fine-tune AI models with real-world data on content performance.
    • Sentiment-Driven Content: Create content that resonates with readers by aligning with sentiment insights.

    This dataset is a valuable tool for anyone aiming to harness the power of data-driven insights to enhance their content or AI models.

  10. Stress Analysis Dataset

    • kaggle.com
    Updated Mar 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shubham agarwal (2025). Stress Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/shubham803/stress-level-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    shubham agarwal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset contains 10,000 records and 7 columns, aimed at analyzing stress levels among employees based on their text messages and roles. Here's a brief overview:

    Employee_ID: Unique identifier for each employee.

    Message: Text message or expression shared by the employee.

    Word_Count: Number of words in the message.

    Sentiment_Score: Numerical value indicating the sentiment of the message (higher = more positive).

    Employee_Role: Job title or role of the employee.

    Department: Department where the employee works (e.g., IT, HR, Sales).

    Stress_Level: Categorical label of the employee's stress (e.g., Low, Medium, High).

    The dataset appears suitable for sentiment analysis, stress level prediction, or employee well-being research.

  11. Stock Market Dataset for Predictive Analysis

    • kaggle.com
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WARNER (2025). Stock Market Dataset for Predictive Analysis [Dataset]. https://www.kaggle.com/datasets/s3programmer/stock-market-dataset-for-predictive-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    WARNER
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This Stock Market Dataset is designed for predictive analysis and machine learning applications in financial markets. It includes 13647 records of simulated stock trading data with features commonly used in stock price forecasting.

    🔹 Key Features Date – Trading day timestamps (business days only) Open, High, Low, Close – Simulated stock prices Volume – Trading volume per day RSI (Relative Strength Index) – Measures market momentum MACD (Moving Average Convergence Divergence) – Trend-following momentum indicator Sentiment Score – Simulated market sentiment from financial news & social media Target – Binary label (1: Price goes up, 0: Price goes down) for next-day prediction This dataset is useful for training hybrid deep learning models such as LSTM, CNN, and Attention-based networks for stock market forecasting. It enables financial analysts, traders, and AI researchers to experiment with market trends, technical analysis, and sentiment-based predictions.

  12. A

    ‘uHack Sentiments 2.0: Decode Code Words’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Dec 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘uHack Sentiments 2.0: Decode Code Words’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-uhack-sentiments-2-0-decode-code-words-ce3a/88e2b3fd/?iid=004-193&v=presentation
    Explore at:
    Dataset updated
    Dec 28, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘uHack Sentiments 2.0: Decode Code Words’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/manishtripathi86/uhack-sentiments-20-decode-code-words on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    The challenge here is to analyze and deep dive into the natural language text (reviews) and bucket them based on their topics of discussion. Furthermore, analyzing the overall sentiment will also help the business to make tangible decisions.

    The data set provided to you has a mix of customer reviews for products across categories and retailers. We would like you to model on the data

    to bucket the future reviews in their respective topics (Note: A review can talk about multiple topics)

    Overall polarity (positive/negative sentiment)

    Train: 6136 rows x 14 columns

    Test: 2631 rows x 14 columns

    Topics (Components, Delivery and Customer Support, Design and Aesthetics, Dimensions, Features, Functionality, Installation, Material, Price, Quality and Usability) Polarity (Positive/Negative) Note: The target variables are all encoded in the train dataset for convenience. Please submit the test results in the similar encoded fashion for us to evaluate your results.

    | | Field Name Data Type Purpose Variable type Id Integer Unique identifier for each review Input Review String Review written by customers on a retail website Input Components String 1: aspects related to components Target 0: None Delivery and Customer Support String 1: some aspects related to delivery, return, exchange and customer support Target 0: None Design and Aesthetics String 1: some aspects related to components Target 0: None Dimensions String 1: related to product dimension and size Target 0: None Features String 1: related to product features Target 0 : None
    Functionality String 1: related to working of a product Target 0: None Installation String 1: related to installation of the product Target 0: None Material String 1: related to material of the product Target 0: None Price String 1: related to pricing details of a product Target 0: None Quality String 1: related to quality aspects of a product Target 0: None Usability String 1: related to usability of a product Target 0: None Polarity Integer 1: Positive sentiment; Target 0: Negative Sentiment | | | --- | --- | | | | | | | --- | --- | | | |

    Skills: Text Pre-processing – Lemmatization , Tokenization, N-Grams and other relevant methods Multi-Class Classification, Multi-label Classification Optimizing Log Loss

    Overview Ugam, a Merkle company, is a leading analytics and technology services company. Our customer-centric approach delivers impactful business results for large corporations by leveraging data, technology, and expertise.

    We consistently deliver superior, impactful results through the right blend of human intelligence and AI. With 3300+ people spread across locations worldwide, we successfully deploy our services to create success stories across industries like Retail & Consumer Brands, High Tech, BFSI, Distribution, and Market Research & Consulting. Over the past 21 years, Ugam has been recognized by several firms including Forrester and Gartner, named the No.1 data science company in India by Analytics Insight, and certified as a Great Place to Work®.

    Problem Statement: The last two decades have witnessed a significant change in how consumers purchase products and express their experience/opinions in reviews, posts, and content across platforms. These online reviews are not only useful to reflect customers’ sentiment towards a product but also help businesses fix gaps and find potential opportunities which could further influence future purchases.

    Participants need develop a machine learning model that can analyse customers’ sentiments based on their reviews and feedback.

    NOTE: The prize money will be for the interested candidates who are willing to get interviewed or hired by Ugam. Winner are requested to come to the Machine Leaning Developers Summit2022, happening at Bangalore, for receiving the prize money.

    dataset link: https://machinehack.com/hackathon/uhack_sentiments_20_decode_code_words/overview

    --- Original source retains full ownership of the source dataset ---

  13. A

    AI Training Dataset Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). AI Training Dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-training-dataset-1501897
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The AI training dataset market is experiencing robust growth, driven by the increasing adoption of artificial intelligence across diverse sectors. The market's expansion is fueled by the urgent need for high-quality data to train sophisticated AI models capable of handling complex tasks. Key application areas, such as autonomous vehicles in the automotive industry, advanced medical diagnosis in healthcare, and personalized experiences in retail and e-commerce, are significantly contributing to this market's upward trajectory. The prevalence of text, image/video, and audio data types further diversifies the market, offering opportunities for specialized dataset providers. While the market faces challenges like data privacy concerns and the high cost of data annotation, the overall trajectory remains positive, with a projected Compound Annual Growth Rate (CAGR) exceeding 20% for the forecast period (2025-2033). This growth is further supported by advancements in deep learning techniques that demand increasingly larger and more diverse datasets for optimal performance. Leading companies like Google, Amazon, and Microsoft are actively investing in this space, expanding their dataset offerings and fostering competition within the market. Furthermore, the emergence of specialized data annotation providers caters to the specific needs of various industries, ensuring accurate and reliable data for AI model development. The geographic distribution of the market reveals strong presence in North America and Europe, driven by early adoption of AI technologies and the presence of major technology players. However, Asia Pacific is projected to witness significant growth in the coming years, propelled by increasing digitalization and a burgeoning AI ecosystem in countries like China and India. Government initiatives promoting AI development in various regions are also expected to stimulate demand for high-quality training datasets. While challenges related to data security and ethical considerations remain, the long-term outlook for the AI training dataset market is exceptionally promising, fueled by the continued evolution of artificial intelligence and its increasing integration into various aspects of modern life. The market segmentation by application and data type allows for granular analysis and targeted investments for businesses operating in this rapidly expanding sector.

  14. A

    ‘⭐ McDonalds Review Sentiment’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘⭐ McDonalds Review Sentiment’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-mcdonalds-review-sentiment-6d6c/9da444f4/?iid=000-968&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘⭐ McDonalds Review Sentiment’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/mcdonalds-review-sentimente on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    A sentiment analysis of negative McDonald's reviews. Contributors were given reviews culled from low-rated McDonald's from random metro areas and asked to classify why the locations received low reviews. Options given were: * Rude Service

    • Slow Service
    • Problem with Order
    • Bad Food
    • Bad Neighborhood
    • Dirty Location
    • Cost
    • Missing Item Added: March 6, 2015 by CrowdFlower | Data Rows: 1500 Download Now

    Source: https://www.crowdflower.com/data-for-everyone/

    This dataset was created by CrowdFlower and contains around 2000 samples along with Unit State, Policies Violated, technical information and other features such as: - Review - Policies Violated Gold - and more.

    How to use this dataset

    • Analyze Policies Violated:confidence in relation to City
    • Study the influence of Last Judgment At on Trusted Judgments
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit CrowdFlower

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  15. IMDB 50K Movie Reviews (TEST your BERT)

    • kaggle.com
    Updated Dec 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atul Anand {Jha} (2019). IMDB 50K Movie Reviews (TEST your BERT) [Dataset]. https://www.kaggle.com/atulanandjha/imdb-50k-movie-reviews-test-your-bert/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atul Anand {Jha}
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Context

    Large Movie Review Dataset v1.0 . 😃

    https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252" alt="IMDB wall">

    This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

    In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorising movie-unique terms and their associated with observed labels. In the labelled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

    Reference: http://ai.stanford.edu/~amaas/data/sentiment/

    NOTE

    A starter kernel is here : https://www.kaggle.com/atulanandjha/bert-testing-on-imdb-dataset-starter-kernel

    A kernel to expose Dataset collection :

    Content

    Now let’s understand the task in hand: given a movie review, predict whether it’s positive or negative.

    The dataset we use is 50,000 IMDB reviews (25K for train and 25K for test) from the PyTorch-NLP library.

    Each review is tagged pos or neg .

    There are 50% positive reviews and 50% negative reviews both in train and test sets.

    Columns:

    text : Reviews from people.

    Sentiment : Negative or Positive tag on the review/feedback (Boolean).

    Acknowledgements

    When using this Dataset Please Cite this ACL paper using :

    @InProceedings{

    maas-EtAl:2011:ACL-HLT2011,

    author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},

    title = {Learning Word Vectors for Sentiment Analysis},

    booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},

    month = {June},

    year = {2011},

    address = {Portland, Oregon, USA},

    publisher = {Association for Computational Linguistics},

    pages = {142--150},

    url = {http://www.aclweb.org/anthology/P11-1015}

    }

    Link to ref Dataset: https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/imdb.html

    https://www.samyzaf.com/ML/imdb/imdb.html

    Inspiration

    BERT and other Transformer Architecture models have always been on hype recently due to a great breakthrough by introducing Transfer Learning in NLP. So, Let's use this simple yet efficient Data-set to Test these models, and also compare our results with theirs. Also, I invite fellow researchers to try out their State of the Art Algorithms on this data-set.

  16. IMDB Large Movie Reviews Sentiment Dataset

    • kaggle.com
    zip
    Updated Nov 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Christian Blaise Cruz (2019). IMDB Large Movie Reviews Sentiment Dataset [Dataset]. https://www.kaggle.com/jcblaise/imdb-sentiments
    Explore at:
    zip(38677807 bytes)Available download formats
    Dataset updated
    Nov 18, 2019
    Authors
    Jan Christian Blaise Cruz
    Description

    IMDB Movie Reviews Sentiment Dataset

    This dataset contains CSV versions of the Large Movie Review dataset by Maas, et al. (2011) from its original Stanford AI Repository. It contains 50k highly polar movie reviews, evenly split to 25k positives and 25k negatives. Each sample is labeled with a 0 (positive) or 1 (negative). The additional ~11k unlabeled review data has also been included in CSV format for your convenience.

    Citations

    Works using this dataset must use the appropriate citations via this bibtex entry:

    @InProceedings{maas-EtAl:2011:ACL-HLT2011,
     author  = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
     title   = {Learning Word Vectors for Sentiment Analysis},
     booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
     month   = {June},
     year   = {2011},
     address  = {Portland, Oregon, USA},
     publisher = {Association for Computational Linguistics},
     pages   = {142--150},
     url    = {http://www.aclweb.org/anthology/P11-1015}
    }
    
  17. AI-Generated Tech News Summaries

    • kaggle.com
    zip
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parth Tyagi (2025). AI-Generated Tech News Summaries [Dataset]. https://www.kaggle.com/datasets/tyagi586/ai-generated-tech-news-summaries
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2025
    Authors
    Parth Tyagi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains 200+ summarized tech news articles covering AI, machine learning, robotics, cybersecurity, and more. Each entry includes: ✅ Headline (Original news title) ✅ Source & Publication Date ✅ News Summary (AI-generated short version) ✅ Category (AI, Cybersecurity, Startups, etc.) ✅ Sentiment Analysis (Positive, Neutral, Negative) ✅ Keywords (Key topics covered) ✅ Original Article Link

    🔹 Perfect for NLP projects, sentiment analysis, and trend analysis!

  18. A

    ‘Data for Aspect Based Sentimental Analysis (ABSA)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Data for Aspect Based Sentimental Analysis (ABSA)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-data-for-aspect-based-sentimental-analysis-absa-ccb8/010e645a/?iid=000-672&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Data for Aspect Based Sentimental Analysis (ABSA)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kandhalkhandeka/data-for-aspect-based-sentimental-analysis on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    This data consists of reviews about an app along with a feature consisting of a word from the review. We can use aspect based sentimental analysis to check for the sentiment of the word w.r.t the text in the review!

    --- Original source retains full ownership of the source dataset ---

  19. AI vs Human Generated Contents

    • kaggle.com
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asfaq Ahmed 456 (2024). AI vs Human Generated Contents [Dataset]. https://www.kaggle.com/datasets/asfaqahmed456/ai-vs-human-generated-contents
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2024
    Dataset provided by
    Kaggle
    Authors
    Asfaq Ahmed 456
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A dataset with 10 text samples. Each sample is labeled as either AI-generated (1) or human-generated (0). This dataset is suitable for text classification tasks such as detecting AI-generated content.

    This file contains text samples that are either generated by AI models or written by humans. Each entry is labeled to indicate whether the content is AI-generated or human-generated. This dataset can be used for various natural language processing tasks such as text classification, content analysis, and AI content detection. ** Column 1: text** Description: "The actual content (text data), which may be a short paragraph or sentence. This is the primary feature for analysis." Data Type: String (Text) Column 2: label Description: "Binary label indicating whether the content is AI-generated or human-generated. '0' represents human-generated, and '1' represents AI-generated." Data Type: Integer (0 or 1)

    The AI-generated content was created using advanced language models such as GPT-4, which were instructed to write text on various topics. The human-generated content was sourced from publicly available texts, including articles, blogs, and creative writing samples found on the internet. Care has been taken to ensure that all human-generated content is in the public domain or shared with permission, without any identifiable information

    This dataset is static and will not receive regular updates. However, future versions may be released if new data becomes available or if users contribute additional examples to enhance the dataset.

  20. AI medical chatbot

    • kaggle.com
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousef Saeedian (2024). AI medical chatbot [Dataset]. https://www.kaggle.com/datasets/yousefsaeedian/ai-medical-chatbot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yousef Saeedian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description:

    This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.

    Key Features:

    • Doctor and Patient Roles: Each conversation is annotated with the role of the speaker (doctor or patient), making it easy to analyze communication patterns.
    • Medical Context: The dataset includes diverse scenarios, from routine check-ups to more complex medical discussions, offering a broad spectrum of healthcare dialogues.
    • Natural Language: The conversations are presented in natural language, allowing for the development and testing of NLP models focused on healthcare communication.
    • Applications: This dataset can be used for various applications, such as building dialogue systems, analyzing communication efficacy, developing medical NLP models, and enhancing patient care through better understanding of doctor-patient interactions.

    Potential Use Cases:

    • NLP Model Training: Train models to understand and generate medical dialogues.
    • Healthcare Communication Studies: Analyze communication strategies between doctors and patients to improve healthcare delivery.
    • Medical Chatbots: Develop intelligent medical chatbots that can simulate doctor-patient conversations.
    • Patient Experience Enhancement: Identify common patient concerns and doctor responses to enhance patient care strategies.

    This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Organization logo

Twitter Tweets Sentiment Dataset

Twitter Tweets Sentiment Analysis for Natural Language Processing

Explore at:
37 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

  1. textID - unique ID for each piece of text
  2. text - the text of the tweet
  3. sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

  • Understand the Dataset & cleanup (if required).
  • Build classification models to predict the twitter sentiments.
  • Compare the evaluation metrics of vaious classification algorithms.
Search
Clear search
Close search
Google apps
Main menu