68 datasets found

Processed twitter sentiment Dataset | Added Tokens
kaggle.com
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Halemo GPA (2024). Processed twitter sentiment Dataset | Added Tokens [Dataset]. http://doi.org/10.34740/kaggle/ds/5568348
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5568348
Dataset updated
Aug 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Halemo GPA
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:

1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures

Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Pakistan Restaurants Reviews
kaggle.com
Updated Mar 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanwal Zahoor (2021). Pakistan Restaurants Reviews [Dataset]. https://www.kaggle.com/kanwalzahoor/pakistan-restaurants-reviews/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 25, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanwal Zahoor
Area covered
Pakistan
Description
Dataset

This dataset was created by Kanwal Zahoor

Contents
Twitter Tweets Data for Sentiment Analysis
kaggle.com
Updated Oct 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramya Vidiyala (2020). Twitter Tweets Data for Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/ramyavidiyala/twitter-tweets-data-for-sentiment-analysis/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ramya Vidiyala
Description
Dataset

This dataset was created by Ramya Vidiyala

Contents
A
‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-sentiment-analysis-of-commodity-news-gold-732f/e3232de2/?iid=002-045&v=presentation
Explore at:
Dataset updated
Sep 27, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Sentiment Analysis of Commodity News (Gold)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-in-commodity-market-gold on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This is a news dataset for the commodity market where we have manually annotated 11,412 news headlines across multiple dimensions into various classes. The dataset has been sampled from a period of 20+ years (2000-2021).

Content

The dataset has been collected from various news sources and annotated by three human annotators who were subject experts. Each news headline was evaluated on various dimensions, for instance - if a headline is a price related news then what is the direction of price movements it is talking about; whether the news headline is talking about the past or future; whether the news item is talking about asset comparison; etc.

Acknowledgements

Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." In Future of Information and Communication Conference, pp. 589-601. Springer, Cham, 2021.

https://arxiv.org/abs/2009.04202 Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." arXiv preprint arXiv:2009.04202 (2020)

We would like to acknowledge the financial support provided by the India Gold Policy Centre (IGPC).

Inspiration

Commodity prices are known to be quite volatile. Machine learning models that understand the commodity news well, will be able to provide an additional input to the short-term and long-term price forecasting models. The dataset will also be useful in creating news-based indicators for commodities.

Apart from researchers and practitioners working in the area of news analytics for commodities, the dataset will also be useful for researchers looking to evaluate their models on classification problems in the context of text-analytics. Some of the classes in the dataset are highly imbalanced and may pose challenges to the machine learning algorithms.

--- Original source retains full ownership of the source dataset ---
f
Twitter dataset
figshare.com
csv
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreyas Poojary; Mohammed Riza; Rashmi Laxmikant Malghan (2025). Twitter dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28390334.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28390334.v2
Dataset updated
Feb 11, 2025
Dataset provided by
figshare
Authors
Shreyas Poojary; Mohammed Riza; Rashmi Laxmikant Malghan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains tweets labeled for sentiment analysis, categorized into Positive, Negative, and Neutral sentiments. The dataset includes tweet IDs, user metadata, sentiment labels, and tweet text, making it suitable for Natural Language Processing (NLP), machine learning, and AI-based sentiment classification research. Originally sourced from Kaggle, this dataset is curated for improved usability in social media sentiment analysis.
Chat Sentiment Dataset
kaggle.com
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nursyahrina (2023). Chat Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/nursyahrina/chat-sentiment-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nursyahrina
Description
Chat Sentiment Dataset

A Simple but Rich Dataset for Sentiment Analysis of Chat Messages

Description:

This dataset contains a collection of chat messages that can be used to develop a sentiment analysis machine learning model to classify messages into 3 sentiment classes - positive, negative, and neutral. The messages are diverse in nature, containing not only simple text but also special characters, numbers, emoji/emoticons, and URL addresses. The dataset can be used for various natural language processing tasks related to chat analysis.

Column Descriptions:

message: the content of the chat message.

sentiment: the sentiment of the chat message, can be positive, negative, or neutral.
h
ru_sentiment_dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatyana Voloshina, ru_sentiment_dataset [Dataset]. https://huggingface.co/datasets/MonoHime/ru_sentiment_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Tatyana Voloshina
Description
Dataset with sentiment of Russian text

Contains aggregated dataset of Russian texts from 6 datasets.

Labels meaning

0: NEUTRAL 1: POSITIVE 2: NEGATIVE

Datasets

Sentiment Analysis in Russian

Sentiments (positive, negative or neutral) of news in russian language from Kaggle competition.

Russian Language Toxic Comments

Small dataset with labeled comments from 2ch.hk and pikabu.ru.

Dataset of car reviews for machine learning (sentiment analysis)

Glazkova A.… See the full description on the dataset page: https://huggingface.co/datasets/MonoHime/ru_sentiment_dataset.
Analyzing sentiments related to various products
kaggle.com
Updated Sep 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanya (2020). Analyzing sentiments related to various products [Dataset]. https://www.kaggle.com/tanyadayanand/analyzing-sentiments-related-to-various-products
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 11, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tanya
Description
PLEASE UPVOTE if you find it useful !!!!

Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this dataset develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help serve the customers better but can also reveal lolot of customer traits present/hidden in the reviews.

The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.
A
‘Deep-NLP’ analyzed by Analyst-2
analyst-2.ai
Updated Mar 31, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘Deep-NLP’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-deep-nlp-9bf3/820c5e72/?iid=000-602&v=presentation
Explore at:
Dataset updated
Mar 31, 2019
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Deep-NLP’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/samdeeplearning/deepnlp on 28 January 2022.

--- Dataset description provided by original source is as follows ---

What's In The Deep-NLP Dataset?

Sheet_1.csv contains 80 user responses, in the response_text column, to a therapy chatbot. Bot said: 'Describe a time when you have acted as a resource for someone else'. User responded. If a response is 'not flagged', the user can continue talking to the bot. If it is 'flagged', the user is referred to help.

Sheet_2.csv contains 125 resumes, in the resume_text column. Resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. If a resume is 'not flagged', the applicant can submit a modified resume version at a later date. If it is 'flagged', the applicant is invited to interview.

What Do I Do With This?

Classify new resumes/responses as flagged or not flagged.

There are two sets of data here - resumes and responses. Split the data into a train set and a test set to test the accuracy of your classifier. Bonus points for using the same classifier for both problems.

Good luck.

Acknowledgements

Thank you to Parsa Ghaffari (Aylien), without whom these visuals (cover photo is in Parsa Ghaffari's excellent LinkedIn article on English, Spanish and German postive v. negative sentiment analysis) would not exist.

There Is A 'deep natural language processing' Kernel. I will update it. I Hope You Find It Useful.

You can use any of the code in that kernel anywhere, on or off Kaggle. Ping me at @_samputnam for questions.

--- Original source retains full ownership of the source dataset ---
m
The Climate Change Twitter Dataset
data.mendeley.com
kaggle.com
Updated May 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimitrios Effrosynidis (2022). The Climate Change Twitter Dataset [Dataset]. http://doi.org/10.17632/mw8yd7z9wc.2
Explore at:
Unique identifier
https://doi.org/10.17632/mw8yd7z9wc.2
Dataset updated
May 19, 2022
Authors
Dimitrios Effrosynidis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541

The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.

The following columns are in the dataset:

➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.

Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.
oyo-reviews-dataset
kaggle.com
zip
Updated Jun 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepkumar patel (2023). oyo-reviews-dataset [Dataset]. https://www.kaggle.com/datasets/deeppatel9095/oyo-reviews-dataset
Explore at:
zip(32300432 bytes)Available download formats
Dataset updated
Jun 24, 2023
Authors
Deepkumar patel
Description
The inspiration behind creating the OYO Review Dataset for sentiment analysis was to explore the sentiment and opinions expressed in hotel reviews on the OYO Hotels platform. Analyzing the sentiment of customer reviews can provide valuable insights into the overall satisfaction of guests, identify areas for improvement, and assist in making data-driven decisions to enhance the hotel experience. By collecting and curating this dataset, Deep Patel, Nikki Patel, and Nimil aimed to contribute to the field of sentiment analysis in the context of the hospitality industry. Sentiment analysis allows us to classify the sentiment expressed in textual data, such as reviews, into positive, negative, or neutral categories. This analysis can help hotel management and stakeholders understand customer sentiments, identify common patterns, and address concerns or issues that may affect the reputation and customer satisfaction of OYO Hotels. The dataset provides a valuable resource for training and evaluating sentiment analysis models specifically tailored to the hospitality domain. Researchers, data scientists, and practitioners can utilize this dataset to develop and test various machine learning and natural language processing techniques for sentiment analysis, such as classification algorithms, sentiment lexicons, or deep learning models. Overall, the goal of creating the OYO Review Dataset for sentiment analysis was to facilitate research and analysis in the area of customer sentiments and opinions in the hotel industry. By understanding the sentiment of hotel reviews, businesses can strive to improve their services, enhance customer satisfaction, and make data-driven decisions to elevate the overall guest experience.

Deep Patel: https://www.linkedin.com/in/deep-patel-55ab48199/ Nikki Patel: https://www.linkedin.com/in/nikipatel9/ Nimil lathiya: https://www.linkedin.com/in/nimil-lathiya-059a281b1/
Product sentiment analysis
kaggle.com
zip
Updated Sep 4, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ask9 (2020). Product sentiment analysis [Dataset]. https://www.kaggle.com/arbazkhan971/product-sentiment-analysis
Explore at:
zip(406932 bytes)Available download formats
Dataset updated
Sep 4, 2020
Authors
ask9
Description
**Overview Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this weekend hackathon, we challenge the machinehackers community to develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help us serve the customers better but can also reveal lot of customer traits present/hidden in the reviews.

The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.

In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.

Dataset Description:

Train.csv - 6364 rows x 4 columns (Inlcudes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

Attribute Description:

Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment Skills:

NLP, Sentiment Analysis Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing multi-class log loss to generalize well on unseen data
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zip(798357692)Available download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
Tweets_Sentiments
kaggle.com
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
amir meymandi (2024). Tweets_Sentiments [Dataset]. https://www.kaggle.com/datasets/amirmeymandi/tweets-sentiments
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
amir meymandi
Description
Overview

This dataset contains text samples labeled with sentiment categories, including positive, negative, and neutral sentiments. It is designed for sentiment analysis and can be used to train and evaluate machine learning models aimed at understanding the emotional tone of text data.

Content

Text Data: The dataset include tweets.

Sentiment Labels: Each text sample is annotated with a sentiment label. The labels are categorized as Positive, Negative, or Neutral.

Number of Records: The dataset consists of 543 number of text samples.

Usage

This dataset can be used for training sentiment analysis models, evaluating model performance, and conducting research on natural language processing (NLP) and sentiment classification. It is suitable for machine learning projects focusing on sentiment detection, opinion mining, and text classification.

Format

The dataset is provided in CSV format with the following columns: - tweet: I’m so proud of my team for finishing the project ahead of schedule! - label: Positive, Negative, Neutral
f
Data from: Notation guide.
plos.figshare.com
xls
Updated May 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azhar Imran; Jianqiang Li; Ahmad Alshammari (2025). Notation guide. [Dataset]. http://doi.org/10.1371/journal.pone.0317519.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317519.t001
Dataset updated
May 19, 2025
Dataset provided by
PLOS ONE
Authors
Azhar Imran; Jianqiang Li; Ahmad Alshammari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study significantly contributes to the sphere of educational technology by deploying state-of-the-art machine learning and deep learning strategies for meaningful changes in education. The hybrid stacking approach did an excellent implementation using Decision Trees, Random Forest, and XGBoost as base learners with Gradient Boosting as a meta-learner, which managed to record an accuracy of 90%. That indeed puts into great perspective the huge potential it possesses for accuracy measures while predicting in educational setups. The CNN model, which predicted with an accuracy of 89%, showed quite impressive capability in sentiment analysis to acquire further insight into the emotional status of the students. RCNN, Random Forests, and Decision Trees contribute to the possibility of educational data complexity with valuable insight into the complex interrelationships within ML models and educational contexts. The application of the bagging XGBoost algorithm, which attained a high accuracy of 88%, further stamps its utility toward enhancement of academic performance through strong robust techniques of model aggregation. The dataset that was used in this study was sourced from Kaggle, with 1205 entries of 14 attributes concerning adaptability, sentiment, and academic performance; the reliability and richness of the analytical basis are high. The dataset allows rigorous modeling and validation to be done to ensure the findings are considered robust. This study has several implications for education and develops on the key dimensions: teacher effectiveness, educational leadership, and well-being of the students. From the obtained information about student adaptability and sentiment, the developed system helps educators to make modifications in instructional strategy more efficiently for a particular student to enhance effectiveness in teaching. All these aspects could provide critical insights for the educational leadership to devise data-driven strategies that would enhance the overall school-wide academic performance, as well as create a caring learning atmosphere. The integration of sentiment analysis within the structure of education brings an inclusive, responsive attitude toward ensuring students’ well-being and, thus, a caring educational environment. The study is closely aligned with sustainable ICT in education objectives and offers a transformative approach to integrating AI-driven insights with practice in this field. By integrating notorious ML and DL methodologies with educational challenges, the research puts the basis for future innovations and technology in this area. Ultimately, it contributes to sustainable improvement in the educational system.
H
Replication Data for: Movie Scripts Corpus
dataverse.harvard.edu
Updated May 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lance Drouet (2024). Replication Data for: Movie Scripts Corpus [Dataset]. http://doi.org/10.7910/DVN/PZTL2L
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/PZTL2L
Dataset updated
May 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Lance Drouet
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data Source: https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus Data Description : Movie Scripts Corpus This corpus was collected to use for screenplay analysis with machine learning methods. Corpus includes movie scripts, crawled from different sources, their annotations by script structural elements and movies metadata. Corpus description Screenplay data consists of: Movie scripts TXT-documents with raw full text (2858 docs) Movie scripts TXT-documents with full text lemmas (2858 docs) Manual annotation TXT-documents for some movie scripts (33 docs, more than 6000 annotated rows) Movie scripts annotations TXT-documents obtained by BERT Movie scripts annotations json-documents obtained by rule-based annotator ScreenPy Movies metadata consists of: Cut versions of movie reviews and scores from metacritic: Number of reviews: 21025 Number of movies with reviews: 2038 Metadata for movies, including: title, akas, launch year, score from metacritic, imdb user rating and number of votes from imdb.com, movie awards, opening weekend, producers, budget, script department, production companies, writers, directors, cast info, countries involved in production, age restrict, plot (with outline), keywords, genres, taglines, critics' synopsis Screenplay awards information: Academy Awards adapted screenplay, Academy Awards original screenplay, BAFTA, Golden Globe Award for Best Screenplay, Writers Guild Awards Winners & Nominees 2020-2013 nominations information for 462 movies in total. Movie characters data consists of: Script text fragments with dialogs and scene descriptions for characters, gathered with annotators: 2153 movies and text fragments for 32114 characters in total Gender labels for 4792 characters
f
LLM-Assisted Content Analysis (LACA): Coded data and model reasons
figshare.com
txt
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rob Chew; Micahel Wenger; John Bollenbacher; Jessica Speer; Annice Kim (2023). LLM-Assisted Content Analysis (LACA): Coded data and model reasons [Dataset]. http://doi.org/10.6084/m9.figshare.23291147.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23291147.v1
Dataset updated
Jun 22, 2023
Dataset provided by
figshare
Authors
Rob Chew; Micahel Wenger; John Bollenbacher; Jessica Speer; Annice Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This resources consists of calibrations sets (N=100) for 4 publically available datasets using the LLM-Assisted Content Analysis method (LACA). Each dataset contains the following columns:

text_id: Unique ID for each text document code_id: Unique ID for each code category text: Document text that's been coded original_code: Coded response from the original datasets replicated_code: Coded response from independent coding exercise from our study team model_code: Coded response generated from the LLM (GPT-3.5-turbo) reason: LLM generated reason for coding decision

Additional details on methods and definitions of individual code categories are available in the following paper:

Chew, R., Bollenbacher, J., Speer, J., Wenger., M, Kim., A. (2023) LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding.

Trump Tweets

Citation: Coe, Kevin, Berger, Julia, Blumling, Allison , Brooks, Katelyn , Giorgi, Elizabeth , Jackson, Jennifer , … Wellman, Mariah . Quantitative Content Analysis of Donald Trump’s Twitter, 2017-2019. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-04-01. https://doi.org/10.3886/E118603V1 Source: https://www.openicpsr.org/openicpsr/project/118603/version/V1/view

BBC News

Citation: Greene, D., & Cunningham, P. (2006). Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning (pp. 377-384). Source: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification

Contrarian Claims

Citation: Coan, T. G., Boussalis, C., Cook, J., & Nanko, M. O. (2021). Computer-assisted classification of contrarian claims about climate change. Scientific reports, 11(1), 22320. Source: https://socialanalytics.ex.ac.uk/cards/data.zip

Ukraine Water Problems

Citation: Afanasyev S, N. B, Bodnarchuk T, S. V, M. V, T. V, Yu V, K. G, V. D, Konovalenko O, O. K, E. K, Lietytska O, O. L, V. M, Marushevska O, Mokin V, K. M, Osadcha N, O. I (2013) River Basin Management Plan for Pivdenny Bug: river basin analysis and measures Source: https://www.kaggle.com/datasets/vbmokin/nlp-reports-news-classification
Stock Market Dataset for Predictive Analysis
kaggle.com
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WARNER (2025). Stock Market Dataset for Predictive Analysis [Dataset]. https://www.kaggle.com/datasets/s3programmer/stock-market-dataset-for-predictive-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
WARNER
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This Stock Market Dataset is designed for predictive analysis and machine learning applications in financial markets. It includes 13647 records of simulated stock trading data with features commonly used in stock price forecasting.

🔹 Key Features Date – Trading day timestamps (business days only) Open, High, Low, Close – Simulated stock prices Volume – Trading volume per day RSI (Relative Strength Index) – Measures market momentum MACD (Moving Average Convergence Divergence) – Trend-following momentum indicator Sentiment Score – Simulated market sentiment from financial news & social media Target – Binary label (1: Price goes up, 0: Price goes down) for next-day prediction This dataset is useful for training hybrid deep learning models such as LSTM, CNN, and Attention-based networks for stock market forecasting. It enables financial analysts, traders, and AI researchers to experiment with market trends, technical analysis, and sentiment-based predictions.
A-Z Airline Reviews
kaggle.com
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosemeen Shaikh (2025). A-Z Airline Reviews [Dataset]. https://www.kaggle.com/datasets/rosemeenshaikh/a-z-airline-reviews/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rosemeen Shaikh
Description
Airline Passenger Reviews Dataset This dataset contains real-world airline passenger reviews gathered from various flights across different countries and airlines. Each row represents an individual passenger’s experience and feedback on a specific flight. Dataset Overview The dataset includes reviews on several aspects of the airline experience, such as seat comfort, food, cabin crew service, value for money, and more. It can be used for sentiment analysis, NLP-based text classification, airline performance evaluation, and other machine learning or data visualization tasks.

Facebook

Twitter

Click to copy link

Link copied

Cite

Halemo GPA (2024). Processed twitter sentiment Dataset | Added Tokens [Dataset]. http://doi.org/10.34740/kaggle/ds/5568348

Processed twitter sentiment Dataset | Added Tokens

Tokenized and Sentiment-Labeled Tweets for NLP and Machine Learning

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/ds/5568348

Dataset updated

Aug 21, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Halemo GPA

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:

1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures

Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

Clear search

Close search

Google apps

Main menu

Processed twitter sentiment Dataset | Added Tokens

Datasets for Sentiment Analysis

Pakistan Restaurants Reviews

Dataset

Contents

Twitter Tweets Data for Sentiment Analysis

Dataset

Contents

‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Twitter dataset

Chat Sentiment Dataset

Chat Sentiment Dataset

Description:

Column Descriptions:

ru_sentiment_dataset

Analyzing sentiments related to various products

PLEASE UPVOTE if you find it useful !!!!

‘Deep-NLP’ analyzed by Analyst-2

What's In The Deep-NLP Dataset?

What Do I Do With This?

Acknowledgements

There Is A 'deep natural language processing' Kernel. I will update it. I Hope You Find It Useful.

The Climate Change Twitter Dataset

oyo-reviews-dataset

Product sentiment analysis

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

Tweets_Sentiments

Overview

Content

Usage

Format

Data from: Notation guide.

Replication Data for: Movie Scripts Corpus

LLM-Assisted Content Analysis (LACA): Coded data and model reasons

Stock Market Dataset for Predictive Analysis

A-Z Airline Reviews

Processed twitter sentiment Dataset | Added Tokens

Tokenized and Sentiment-Labeled Tweets for NLP and Machine Learning