https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">
Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?
Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.
Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.
You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)
The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Sentiment Analysis Dataset is a dataset for emotional analysis, including large-scale tweet text collected from Twitter and emotional polarity (0=negative, 2=neutral, 4=positive) labels for each tweet, featuring automatic labeling based on emoticons.
2) Data Utilization (1) Sentiment Analysis Dataset has characteristics that: • Each sample consists of six columns: emotional polarity, tweet ID, date of writing, search word, author, and tweet body, and is suitable for training natural language processing and classification models using tweet text and emotion labels. (2) Sentiment Analysis Dataset can be used to: • Emotional Classification Model Development: Using tweet text and emotional polarity labels, we can build positive, negative, and neutral emotional automatic classification models with various machine learning and deep learning models such as logistic regression, SVM, RNN, and LSTM. • Analysis of SNS public opinion and trends: By analyzing the distribution of emotions by time series and keywords, you can explore changes in public opinion on specific issues or brands, positive and negative trends, and key emotional keywords.
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
The Famous Words Twitter Dataset is a comprehensive collection of tweets associated with famous words. The dataset provides valuable insights into the social media engagement and popularity of these words on the Twitter platform. It includes three primary columns: keyword, likes, and tweets.
The keyword
column represents the specific famous word or phrase associated with each tweet. It allows researchers and analysts to explore the dynamics of user interactions and discussions surrounding these popular terms on Twitter.
The likes
column indicates the number of likes received by each tweet. This metric serves as an indicator of the tweet's popularity and resonation among Twitter users.
The tweet
column contains the actual tweet text, capturing the content and context of user-generated messages related to the famous words. This column provides valuable qualitative data for sentiment analysis, topic modeling, and other natural language processing tasks.
Researchers, data scientists, and social media analysts can leverage this dataset to study various aspects, such as tracking trends, sentiment analysis, understanding user engagement patterns, and identifying influential topics associated with famous words on Twitter.
Topics:
"COVID-19",
"Vaccine",
"Zoom",
"Bitcoin",
"Dogecoin",
"NFT",
"Elon Musk",
"Tesla",
"Amazon",
"iPhone 12",
"Remote work",
"TikTok",
"Instagram",
"Facebook",
"YouTube",
"Netflix",
"GameStop",
"Super Bowl",
"Olympics",
"Black Lives Matter"
"India vs England",
"Ukraine",
"Queen Elizabeth",
"World Cup",
"Jeffrey Dahmer",
"Johnny Depp",
"Will Smith",
"Weather",
"xvideo",
"porn",
"nba",
"Macdonald",
Total has 128837
tweets, and here are the plot for each number of tweets for different keyword
https://i.imgur.com/z4xbbyt.png" alt="">
Note: The dataset is carefully curated, anonymized, and stripped of any personally identifiable information to protect user privacy.
"My ridiculous dog is amazing." [sentiment: positive]
With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.
Help build your skills in this important area with this broad dataset of tweets. Work on your technique to grab a top spot in this competition. What words in tweets support a positive, negative, or neutral sentiment? How can you help make that determination using machine learning tools?
In this competition we've extracted support phrases from Figure Eight's Data for Everyone platform. The dataset is titled Sentiment Analysis: Emotion in Text tweets with existing sentiment labels, used here under creative commons attribution 4.0. international licence. Your objective in this competition is to construct a model that can do the same - look at the labeled sentiment for a given tweet and figure out what word or phrase best supports it.
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a collection of tweets, each categorised by its sentiment. It is designed to assist in developing and evaluating machine learning models, particularly for natural language processing tasks. The primary aim is to distinguish between different sentiments expressed in tweets, helping to address issues like harmful content by enabling the creation of robust classifier models. Each entry includes the tweet text and its corresponding sentiment label, with a specific focus on identifying the exact word or phrase within the tweet that encapsulates that sentiment.
The dataset contains approximately 27,500 tweets. It is typically provided in a CSV file format. The textID
and text
columns each contain 27,481 unique values, while the selected_text
column has 22,464 unique values. The sentiment distribution is as follows: 40% are neutral, 31% are positive, and 28% fall into other sentiment categories. When processing the data from the CSV, it is important to remove any beginning or ending quotation marks from the text fields.
This dataset is ideally suited for tasks involving sentiment analysis and text classification. It can be used to build and train classification models that predict the sentiment of Twitter tweets. Furthermore, it allows for the comparison and evaluation of various classification algorithms based on their performance metrics in predicting sentiments. It is particularly useful for developing strong NLP-based classifier models to identify and categorise tweets by sentiment.
The data originates from a global platform, Twitter, and the sentiment analysis is applicable across a wide range of content. The dataset's structure allows for analysis of sentiments in tweets, covering various topics and expressions globally. No specific time range or demographic scope is detailed beyond its global applicability.
CCO
This dataset is suitable for a diverse range of users, including beginners in data science and machine learning. It is especially beneficial for those interested in social network analysis, text classification, and natural language processing. Intended users include data scientists, researchers, and developers looking to build and test models for predicting social media sentiments or for applications like content moderation.
Original Data Source: Twitter Tweets Sentiment Dataset
This database contains training, validation, and test sets created for a Twitter sentiment classification project. The tweets were cleaned and improved with custom-calculated sentiment scores and magnitudes using a word-weighted dictionary. The data is split to support machine learning experiments.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a curated list of Portuguese words along with their corresponding sentiment labels. It enables comparative sentiment analysis for content sourced from both Twitter and Buscapé reviews. Each word has a human-annotated sentiment score, ranging from negative to positive with numeric values, allowing for nuanced categorisation and comparison. It serves as an invaluable resource for tasks like mining social media conversations and analysing customer feedback.
The dataset is provided as a CSV file, specifically named portuguese_lexicon.csv
. It comprises a total of 114 unique words in its lexicon, each with an associated sentiment score. The dataset is derived from 3,457 tweets and 476 Buscapé reviews. Users will need an environment capable of reading CSV files that contain both text and numerical data to utilise this resource effectively.
This dataset is ideal for various applications in natural language processing (NLP) and sentiment analysis, including: * Applying to machine learning models for sentiment analysis, text classification, and automated opinion summarisation. * Comparing words or phrases within texts or across different datasets to understand expressed opinions. * Identifying trends in customer opinions over time by comparing sentiment from Twitter and Buscapé reviews. * Understanding how customer review sentiment compares across different Portuguese languages and dialects. * Utilising customer feedback for analytics purposes and gaining insights into public opinion on products based on textual expressions.
The dataset's scope covers reviews written in Portuguese from both Twitter and Buscapé, originating from Portuguese-speaking areas. It is considered to have global region relevance. No specific time range or demographic scope beyond "Portuguese-speaking areas" is detailed in the sources.
CCO
This dataset is suitable for: * Data scientists and machine learning engineers working on NLP tasks. * Researchers interested in social media analysis and cross-platform sentiment comparison. * Businesses and analysts aiming to mine social media conversations and analyse customer feedback for decision-making. * Anyone requiring a linguistically labeled database for Portuguese text analysis.
Original Data Source: Portuguese Sentiment Corpus for Twitter and
This dataset captures Twitter activity related to Covid-19, focusing on the initial phase of the pandemic from April to June 2020 [1, 2]. It comprises 235,240 worldwide tweets in English, streamed live at a rate of approximately 10,000 tweets per day after the World Health Organisation declared Covid-19 a pandemic [1, 2]. The tweets were collected using relevant hashtags such as #covid-19, #coronavirus, #covid, #covaccine, #lockdown, #homequarantine, #quarantinecenter, #socialdistancing, #stayhome, and #staysafe [1, 2].
The data has undergone pre-processing, which involved converting all tweets to lowercase, removing extra white spaces, numbers, special characters, ASCII characters, URLs, punctuations, and stopwords [2]. Additionally, all instances of 'covid' were converted to 'covid19', and stemming was applied to reduce inflected words to their root forms [2]. Sentiment analysis has been performed on each cleaned tweet using an NLTK-based Sentiment Analyser, providing sentiment scores for positive, negative, and neutral categories, and a compound sentiment score [2]. Tweets are classified as Positive, Negative, or Neutral based on these scores [2].
The dataset consists of 235,240 tweets from the first phase of collection [1, 2]. Data files are typically provided in CSV format [4]. The tweets were collected from 19th April to 20th June 2020 [1].
This dataset is ideal for various data science and analytics applications, including Natural Language Processing (NLP), Deep Learning, Text Classification, and Ensembling [2]. Its pre-processed nature and included sentiment scores make it particularly useful for sentiment analysis research related to public opinion during the Covid-19 pandemic [2].
The dataset covers a time range from 19th April to 20th June 2020 [1]. It includes worldwide tweets [2] and is limited to English language content [2]. Tweet sources are primarily Twitter for Android (31%) and Twitter for iPhone (28%), with 41% originating from other sources [5].
CC-BY-SA
Original Data Source: Covid-19 Twitter Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background
Social media opinion has become a medium to quickly access large, valuable, and rich details of information on any subject matter within a short period. Twitter being a social microblog site, generate over 330 million tweets monthly across different countries. Analysing trending topics on Twitter presents opportunities to extract meaningful insight into different opinions on various issues.
Aim
This study aims to gain insights into the trending yahoo-yahoo topic on Twitter using content analysis of selected historical tweets.
Methodology
The widgets and workflow engine in the Orange Data mining toolbox were employed for all the text mining tasks. 5500 tweets were collected from Twitter using the “yahoo yahoo” hashtag. The corpus was pre-processed using a pre-trained tweet tokenizer, Valence Aware Dictionary for Sentiment Reasoning (VADER) was used for the sentiment and opinion mining, Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI) was used for topic modelling. In contrast, Multidimensional scaling (MDS) was used to visualize the modelled topics.
Results
Results showed that "yahoo" appeared in the corpus 9555 times, 175 unique tweets were returned after duplicate removal. Contrary to expectation, Spain had the highest number of participants tweeting on the 'yahoo yahoo' topic within the period. The result of Vader sentiment analysis returned 35.85%, 24.53%, 15.09%, and 24.53%, negative, neutral, no-zone, and positive sentiment tweets, respectively. The word yahoo was highly representative of the LDA topics 1, 3, 4, 6, and LSI topic 1.
Conclusion
It can be concluded that emojis are even more representative of the sentiments in tweets faster than the textual contents. Also, despite popular belief, a significant number of youths regard cybercrime as a detriment to society.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The zip file contains fourteen Parquet [1] files of two kinds, for each of the seven years between 2015 and 2021 included: - region_counts: for every word found, gives how many times it appeared, regardless of capitalization ("count" column), how many times it appeared with at least one capitalized letter ("count_upper"), in how many different counties it appeared ("nr_cells"), and whether we considered it to be a proper noun ("is_proper") - raw_cell_counts: gives the count for every word by county, regardless of capitalization.
These counts were obtained from geo-tagged Tweets posted those years within the contiguous US, which were collected through the through the streaming API of Twitter, and more specifically using the “statuses/filter” end-point [2]. See the project's paper for more details on methodology, and the code repository to reproduce the analysis.
The two text files are our lists of excluded word forms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spreadsheet of data from the paper, "Can Twitter Give Insights into International Differences in Covid-19 Vaccination? Eight countries’ English tweets to 21 March 2021"Includes the gender analysis and the main analysis. A colour code key is included on the right of the worksheet for the main analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rapid evolution of microblogging and the emergence of sites such as Twitter have propelled online communities to flourish by enabling people to create, share and disseminate free-flowing messages and information globally. The exponential growth of product-based user reviews has become an ever-increasing resource playing a key role in emerging Twitter-based sentiment analysis (SA) techniques and applications to collect and analyse customer trends and reviews. Existing studies on supervised black-box sentiment analysis systems do not provide adequate information, regarding rules as to why a certain review was classified to a class or classification. The accuracy in some ways is less than our personal judgement. To address these shortcomings, alternative approaches, such as supervised white-box classification algorithms, need to be developed to improve the classification of Twitter-based microblogs. The purpose of this study was to develop a supervised white-box microblogging SA system to analyse user reviews on certain products using rough set theory (RST)-based rule induction algorithms. RST classifies microblogging reviews of products into positive, negative, or neutral class using different rules extracted from training decision tables using RST-centric rule induction algorithms. The primary focus of this study is also to perform sentiment classification of microblogs (i.e. also known as tweets) of product reviews using conventional, and RST-based rule induction algorithms. The proposed RST-centric rule induction algorithm, namely Learning from Examples Module version: 2, and LEM2 +" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">++ Corpus-based rules (LEM2 +" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">++ CBR),which is an extension of the traditional LEM2 algorithm, are used. Corpus-based rules are generated from tweets, which are unclassified using other conventional LEM2 algorithm rules. Experimental results show the proposed method, when compared with baseline methods, is excellent, with regard to accuracy, coverage and the number of rules employed. The approach using this method achieves an average accuracy of 92.57% and an average coverage of 100%, with an average number of rules of 19.14.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset, named hate_speech_offensive, is a carefully assembled collection of annotated tweets designed for the purpose of detecting hate speech and offensive language. It consists primarily of English tweets and serves as a vital resource for training machine learning models and algorithms in this domain. Researchers and developers can utilise this dataset to build effective systems for identifying and classifying hateful or offensive content, contributing to safer online environments. The dataset is presented in a CSV file format, specifically 'train.csv', and includes detailed annotations for each tweet.
The dataset is provided in a CSV file format, specifically 'train.csv'. It is structured with each row representing an individual tweet along with its corresponding annotations. The dataset currently comprises a single training split. There are approximately 24,783 unique tweets within the dataset.
This dataset is ideal for various applications and use cases, including: * Training machine learning models or algorithms for automated hate speech and offensive language detection. * Conducting Sentiment Analysis on Twitter data to understand the sentiment behind tweets and identify patterns of negative or offensive language. * Developing and evaluating Hate Speech Detection systems that can identify and flag hate speech in real-time. * Improving Content Moderation systems for social media platforms by automatically detecting and removing offensive or hateful content. * Performing Exploratory Data Analysis (EDA) to gain insights into the distribution of tweet classifications, identify common words associated with each class, and analyse co-occurrences of hate speech and offensive language.
The dataset primarily consists of English tweets. Its scope is global in potential application, aiming to address social issues and advocacy related to online discourse. While no specific time range for data collection is provided, the dataset focuses on general English tweet content.
CCO
This dataset is intended for: * Researchers and developers seeking to create and improve machine learning models for detecting hate speech and offensive language on social media platforms like Twitter. * Data scientists and analysts interested in understanding patterns of online discourse and sentiment. * Social media platforms and their moderation teams aiming to enhance automated content moderation systems.
Original Data Source: Hate Speech and Offensive Language Detection
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The annotated tweet corpus in Arabizi, French and English was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale des Entreprises, France) through the RAPID programme (2017-2020). This project aimed at studying the mechanisms of information and opinion propagation within social networks: identifying influential leaders, detecting channels for disseminating information and opinion. The purpose of the corpus constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism).For the collection, a tool has been developed in Python (based on the “GetOldTweets3” library) which used information such as the language (EN/FR) and a keyword list as input. With this tool, a maximum of 10,000 tweets per (keyword, language) pair were collected for English and French. For Arabizi, a specific process was setup, consisting in creating a vocabulary list in Arabizi from a corpus of Arabizi SMS (for Moroccan and Tunisian) and Training and test data for Arabizi detection and transliteration (available from ELRA under reference ELRA-W0126, ISLRN ID: 986-364-744-303-9) by selecting the 1000 most frequent words, and downloading the tweets containing each word from this vocabulary and keyword list (places = Morocco, Tunisia, Algeria). The tweets that were kept had to contain at least 5 words in Arabizi.For the annotation, a tool running on Django has been developed in order to provide the following annotations for each tweet in a given sequence:•Theme: with 5 possible annotations (Hooliganism, Racism, Terrorism, Others, Incomprehensible)•Topic: the annotator can add a new topic if it does not exist in the proposed list•Opinion: 3 possible annotations (Negative, Neutral, Positive)In total, 17,103 sequences were annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided. The data are provided in CSV format.Remark: this corpus includes only tweet IDs and corresponding annotations. Original tweets may be obtained by using the Twitter API.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides simple Twitter analytics data, focusing on user profiles and tweet content. Its primary purpose is to enable the classification of gender based on tweet characteristics, specifically exploring the likelihood of different genders committing typos on their tweets. It serves as a valuable resource for emerging Natural Language Processing (NLP) enthusiasts looking to apply basic models to real-world social media data. The dataset includes unformatted tweet text, user information, and confidence scores related to various attributes.
The dataset contains the following key columns: * _unit_id: A unique identifier for the unit. * Tweet ID: The unique identifier for a tweet. * _golden: Indicates whether a user is a Golden User. * _unit_state: The state of the tweet. * _trusted_judgments: The level of trust associated with the judgment. * _last_judgment_at: The timestamp of the last judgment. * gender: The declared or inferred sex of the user. * gender:confidence: The confidence level associated with the gender classification. * profile_yn: A boolean indicating whether the user's profile is active or exists. * profile_yn:confidence: The confidence level for the profile's existence. * created: The date and time when the user's account was created. * Label Count: A count related to various labels within the dataset.
The dataset is provided as a single data file, typically in CSV format. It comprises approximately 20,000 records. The structure includes various data types, such as IDs, boolean indicators, numerical confidence scores, and datetime stamps.
This dataset is ideal for: * Classifying user gender based on tweet content and user profile information. * Analysing spelling errors or typos in tweets in relation to user demographics. * Developing and testing Natural Language Processing (NLP) models, particularly for tasks like text classification and sentiment analysis. * Exploring patterns in social media behaviour and user characteristics on Twitter. * Educational purposes for those new to applying machine learning techniques to real-world tweet data.
The dataset offers global geographical coverage as indicated by its region. The time range for tweet activity appears to be concentrated around 26th to 27th October 2015. However, the account creation dates for the users span a much broader period, from 5th August 2006 to 26th October 2015. In terms of demographics, the dataset includes gender distribution, with approximately 33% female, 31% male, and 36% categorised as 'Other'.
CCO
This dataset is primarily intended for: * Data scientists and analysts interested in social media analytics and user behaviour. * Machine learning practitioners, especially those working on classification problems and NLP tasks. * Students and researchers in fields such as computer science, linguistics, and social sciences. * NLP enthusiasts who are developing or looking to test basic linear or naive models on real-world text data.
Original Data Source: Twitter Data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
More and more customers demand online reviews of products and comments on the Web to make decisions about buying a product over another. In this context, sentiment analysis techniques constitute the traditional way to summarize a user’s opinions that criticizes or highlights the positive aspects of a product. Sentiment analysis of reviews usually relies on extracting positive and negative aspects of products, neglecting comparative opinions. Such opinions do not directly express a positive or negative view but contrast aspects of products from different competitors. Here, we present the first effort to study comparative opinions in Portuguese, creating two new Portuguese datasets with comparative sentences marked by three humans. This repository consists of three important files: (1) lexicon that contains words frequently used to make a comparison in Portuguese; (2) Twitter dataset with labeled comparative sentences; and (3) Buscapé dataset with labeled comparative sentences.The lexicon is a set of 176 words frequently used to express a comparative opinion in the Portuguese language. In these contexts, the lexicon is aggregated in a filter and used to build two sets of data with comparative sentences from two important contexts: (1) Social Network Online; and (2) Product reviews.For Twitter, we collected all Portuguese tweets published in Brazil on 2018/01/10 and filtered all tweets that contained at least one keyword present in the lexicon, obtaining 130,459 tweets. Our work is based on the sentence level. Thus, all sentences were extracted and a sample with 2,053 sentences was created, which was labeled for three human manuals, reaching an 83.2% agreement with Fleiss' Kappa coefficient. For Buscapé, a Brazilian website (https://www.buscape.com.br/) used to compare product prices on the web, the same methodology was conducted by creating a set of 2,754 labeled sentences, obtained from comments made in 2013. This dataset was labeled by three humans, reaching an agreement of 83.46% with the Fleiss Kappa coefficient.The Twitter dataset has 2,053 labeled sentences, of which 918 are comparative. The Buscapé dataset has 2,754 labeled sentences, of which 1,282 are comparative.The datasets contain some properties labeled:text: the sentence extracted from the review comment.entity_s1: the first entity compared in the sentence.entity_s2: the second entity compared in the sentence.keyword: the comparative keyword used in the sentence to express comparison.preferred_entity: the preferred entity.id_start: the starting position of the keyword in the phrase.id_end: the keyword's final position in the sentence.type: the sentence label, which specifies whether the phrase is a comparison.Additional information:1 - The sentences were separated using a sentence tokenizer.2 - If the compared entity is not specified, the field will receive a value: "_".3 - The property type can contain different five values, they are:0: Non-comparative (Não Comparativa).1: Non-Equal-Gradable (Gradativa com Predileção).2: Equative (Equitativa).3: Superlative (Superlativa).4: Non-Equal-Gradable (Não Gradativa).If you use this data, please cite our paper as follows: "Daniel Kansaon, Michele A. Brandão, Julio C. S. Reis, Matheus Barbosa,Breno Matos, and Fabrício Benevenuto. 2020. Mining Portuguese Comparative Sentences in Online Reviews. In Brazilian Symposium on Multimedia and the Web (WebMedia ’20), November 30-December 4, 2020, São Luís, Brazil. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3428658.3431081"
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In the last few years, Twitter became one of the most popular social media platforms. From celebrity status to government policies, Twitter can accommodate a diverse range of people and thoughts. In these diverse set of thoughts, there are many Twitter accounts who tweet "self-help" thoughts often. These so-called "wise" thoughts are often related to improving one's life and how to excel at what you're doing. So I went down to the rabbit-hole to search these sorts of tweets. I find many common themes between them. Therefore, I decided to scrap the tweets so that you can explore the words of these "self-help" tweets and understand them much better.
I scraped the data using Tweepy API. I have scraped all the tweets, retweets and retweets with a comment of 40 authors. The data contains more than 40 authors because every retweet from any of the 40 authors is stored as a tweet from the original author. Also, every retweet with a comment contains and
tags. The author's comment is followed by tag and then the content of the retweet comes which is followed by
. The script I used for scrapping can be found here.
I would like to thanks Stack Overflow which helped me at literally every stage of this project from scrapping to data analysis. Also kudos to the Tweepy API which made it far more easier to fetch tweets.
I downloaded this dataset for many reasons. The most important one is that I want to know how similar these tweets are. Also, I like to know what makes some tweets viral and what factors affect a viral tweet. I explore these and many more questions in my kernel which you can find in the kernel section.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bigrams, words frequency, frame and engagement analysis of the tweets published by Greta Thunberg between 2019 and 2022.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Excerpts of the co-occurrence analysis in WordStat (including 2D co-word maps, correspondence maps and heat maps) of two cases: booster-broiler (plofkip) and mega-stable (megastal). Please see the article for more information about the data collection and method.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains word frequency (raw frequency, relative frequency, TF-IDF) and sentiment analysis of tweets on the COVID-19 pandemic in Alberta. It also contains groups of tweets that have been made non-consumptive for copyright reasons by shuffling the words of the tweets Our goal was to capture a representative sample of the discourse on the COVID-19 pandemic in Alberta pandemic taking place on Twitter. More details on the dataset can be found in the readme document
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">
Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?
Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.
Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.
You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)
The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv