Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Preprints, 2022, DOI: 10.20944/preprints202206.0146.v1
Abstract
The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset files contain the raw version that comprises 52,868 Tweet IDs (that correspond to the same number of Tweets) and the cleaned and preprocessed version that contains 46,208 unique Tweet IDs. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description
The dataset comprises 7 .txt files. The raw version of this dataset comprises 6 .txt files (TweetIDs_Corona Virus.txt, TweetIDs_Corona.txt, TweetIDs_Coronavirus.txt, TweetIDs_Covid.txt, TweetIDs_Omicron.txt, and TweetIDs_SARS CoV2.txt) that contain Tweet IDs grouped together based on certain synonyms or terms that were used to refer to online learning and the Omicron variant of COVID-19 in the respective tweets. Table 1 shows the list of all the synonyms or terms that were used for the dataset development. The cleaned and preprocessed version of this dataset is provided in the .txt file - TweetIDs_Duplicates_Removed.txt. A description of these dataset files is provided in Table 2.
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.
Table 1. List of commonly used synonyms, terms, and phrases for online learning and COVID-19 that were used for the dataset development
Terminology |
List of synonyms and terms |
COVID-19 |
Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus |
online learning |
online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures |
Table 2: Description of the dataset files along with the information about the number of Tweet IDs in each of them
Filename |
No. of Tweet IDs |
Description |
TweetIDs_Corona Virus.txt |
321 |
Tweet IDs correspond to tweets that comprise the keywords – "corona virus" and one or more keywords/terms that refer to online learning |
TweetIDs_Corona.txt |
1819 |
Tweet IDs correspond to tweets that comprise the keyword – "corona" or "coronaoutbreak" and one or more keywords/terms that refer to online learning |
TweetIDs_Coronavirus.txt |
1429 |
Tweet IDs correspond to tweets that comprise the keywords – "coronavirus" or "coronaviruspandemic" and one or more keywords/terms that refer to online learning |
TweetIDs_Covid.txt |
41088 |
Tweet IDs correspond to tweets that comprise the keywords – "COVID" or "COVID19" or "COVID-19" and one or more keywords/terms that refer to online learning |
TweetIDs_Omicron.txt |
8198 |
Tweet IDs correspond to tweets that comprise the keywords – "omicron" or "omicron variant" and one or more keywords/terms that refer to online learning |
TweetIDs_SARS CoV2.txt |
13 |
Tweet IDs correspond to tweets that comprise the keyword – "SARS-CoV-2" and one or more keywords/terms that refer to online learning |
TweetIDs_Duplicates_Removed.txt |
46208 |
A collection of unique Tweet IDs from all the 6 .txt files mentioned above after data preprocessing, data clearing, and removal of duplicate tweets |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains two data sheets, summarizing the results of a class activity involving the use of Twitter to share information and to learn. Table S1 is a summary of the tweets produced by user GeoHumana_UV (teacher/subject account) between 2020-09-16 and 2020-12-15, plus interactions recorded for each tweet (including number of impressions, URL clicks, likes, retweets and replies). All data provided by Twitter Analytics. Table S2 is a summary of student activity related to the class activity for the same period of time. It includes anonymized data on tweeting activity, level of interaction, performance and self-assessment by each participating student. Data provided by Twitter Analytics and the participants
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains 15,000 Arabic tweets annotated for depression detection and includes linguistic feature augmentations to support research in natural language processing (NLP), sentiment analysis, and mental health detection. The dataset was curated to enable studies on automatic depression detection in Arabic social media and to support machine learning and deep learning approaches in the domain of computational mental health. Contents The dataset consists of the following columns: tweet: The original Arabic tweet text. label: Binary label indicating whether the tweet expresses signs of depression: 1 = Depression 0 = Non-depression negation_flag: Indicates presence (1) or absence (0) of negation in the tweet. intensifier_flag: Indicates presence (1) or absence (0) of intensifiers (words that strengthen the degree of emotion). Class (redundant but included for convenience): Textual label corresponding to the binary label (Depression or Non-depression). Binary Classification: Contains the count of instances in each class (appears as an artifact in the provided file). Key Features Language: Arabic (varied dialects and Modern Standard Arabic). Source: Publicly available tweets collected from Twitter (X). Annotation: Manual labeling by native Arabic speakers trained in psychology and linguistics. Linguistic augmentation: Flags for negation and intensifier usage are included to support linguistically informed NLP models. Potential Use Cases Depression detection models for Arabic texts. Linguistic analysis of depression expression in Arabic social media. Cross-lingual studies comparing depression signals across languages. Development of clinical decision support systems leveraging social media data. Licensing & Ethical Considerations The dataset consists of public social media posts. Researchers are advised to use it strictly for research purposes, respecting privacy and ethical guidelines. No personally identifiable information (PII) is included. Citation If you use this dataset, please cite it appropriately in your research publications and acknowledge the creators.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description
The Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. This dataset is used to classify finance-related tweets for their topic.
The dataset holds 21,107 documents annotated with 20 labels:
topics = { "LABEL_0": "Analyst Update", "LABEL_1": "Fed | Central Banks", "LABEL_2": "Company | Product News", "LABEL_3": "Treasuries | Corporate Debt", "LABEL_4": "Dividend"… See the full description on the dataset page: https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data was performed by employing standard Twitter API on Arabic tweets and code-mixing datasets. The Data was carried out for a duration of three months, specifically from April 2023 to June 2023. This was done via a combination of keyword, thread-based searches, and profile-based search approaches as. A total of 120 terms, including various versions, which were used to identify tweets containing code-mixing concerning regional hate speech. To conduct a thread-based search, we have incorporated hashtags that are related to contentious subjects that are deemed essential markers for hateful speech. Throughout the data-gathering phase, we kept an eye on Twitter trends and designated ten hashtags for information retrieval. Given that hateful tweets are usually less common than regular tweets, we expanded our dataset and improved the representation of the hate class by incorporating the most impactful terms from a lexicon of religious hate terms (Albadi et al., 2018). We gathered exclusively original Arabic tweets for all queries, excluding retweets and non-Arabic tweets. In all, we obtained 200,000 Twitter data, of which we sampled 35k tweets for annotation.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By hate_speech_offensive (From Huggingface) [source]
This dataset, named hate_speech_offensive, is a meticulously curated collection of annotated tweets with the specific purpose of detecting hate speech and offensive language. The dataset primarily consists of English tweets and is designed to train machine learning models or algorithms in the task of hate speech detection. It should be noted that the dataset has not been divided into multiple subsets, and only the train split is currently available for use.
The dataset includes several columns that provide valuable information for understanding each tweet's classification. The column count represents the total number of annotations provided for each tweet, whereas hate_speech_count signifies how many annotations classified a particular tweet as hate speech. On the other hand, offensive_language_count indicates the number of annotations categorizing a tweet as containing offensive language. Additionally, neither_count denotes how many annotations identified a tweet as neither hate speech nor offensive language.
For researchers and developers aiming to create effective models or algorithms capable of detecting hate speech and offensive language on Twitter, this comprehensive dataset offers a rich resource for training and evaluation purposes
Introduction:
Dataset Overview:
- The dataset is presented in a CSV file format named 'train.csv'.
- It consists of annotated tweets with information about their classification as hate speech, offensive language, or neither.
- Each row represents a tweet along with the corresponding annotations provided by multiple annotators.
- The main columns that will be essential for your analysis are: count (total number of annotations), hate_speech_count (number of annotations classifying a tweet as hate speech), offensive_language_count (number of annotations classifying a tweet as offensive language), neither_count (number of annotations classifying a tweet as neither hate speech nor offensive language).
Data Collection Methodology: The data collection methodology used to create this dataset involved obtaining tweets from Twitter's public API using specific search terms related to hate speech and offensive language. These tweets were then manually labeled by multiple annotators who reviewed them for classification purposes.
Data Quality: Although efforts have been made to ensure the accuracy of the data, it is important to acknowledge that annotations are subjective opinions provided by individual annotators. As such, there may be variations in classifications between annotators.
Preprocessing Techniques: Prior to training machine learning models or algorithms on this dataset, it is recommended to apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.
Exploratory Data Analysis (EDA): Conducting EDA on the dataset will help you gain insights and understand the underlying patterns in hate speech and offensive language. Some potential analysis ideas include:
- Distribution of tweet counts per classification category (hate speech, offensive language, neither).
- Most common words/phrases associated with each class.
- Co-occurrence analysis to identify correlations between hate speech and offensive language.
Building Machine Learning Models: To train models for automatic detection of hate speech and offensive language, you can follow these steps: a) Split the dataset into training and testing sets for model evaluation purposes. b) Choose appropriate features/
- Sentiment Analysis: This dataset can be used to train models for sentiment analysis on Twitter data. By classifying tweets as hate speech, offensive language, or neither, the dataset can help in understanding the sentiment behind different tweets and identifying patterns of negative or offensive language.
- Hate Speech Detection: The dataset can be used to develop models that automatically detect hate speech on Twitter. By training machine learning algorithms on this annotated dataset, it becomes possible to create systems that can identify and flag hate speech in real-time, making social media platforms safer and more inclusive.
- Content Moderation: Social media platforms can use this dataset to improve their content m...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ISCA project compiled this dataset using an annotation portal, which was used to label tweets as either biased or non-biased, among other labels. Note that the annotation was done on live data, including images and context, such as threads. The original data comes from annotationportal.com. They include representative samples of live tweets from the years 2020 and 2021 with the keywords "Asians, Blacks, Jews, Latinos, and Muslims".
A random sample of 600 tweets per year was drawn for each of the keywords. This includes retweets. Due to a sampling error, the sample for the year 2021 for the keyword "Jews" has only 453 tweets from 2021 and 147 from the first eight months of 2022 and it includes some tweets from the query with the keyword "Israel." The tweets were divided into six samples of 100 tweets, which were then annotated by three to seven students in the class "Researching White Supremacism and Antisemitism on Social Media" taught by Gunther Jikeli, Elisha S. Breton, and Seth Moller at Indiana University in the fall of 2022, see this report. Annotators used a scale from 1 to 5 (confident not biased, probably not biased, don't know, probably biased, confident biased). The definitions of bias against each minority group used for annotation are also included in the report.
If a tweet called out or denounced bias against the minority in question, it was labeled as "calling out bias."
The labels of whether a tweet is biased or calls out bias are based on a 75% majority vote. We considered "probably biased" and "confident biased" as biased and "confident not biased," "probably not biased," and "don't know" as not biased.
The types of stereotypes vary widely across the different categories of prejudice. While about a third of all biased tweets were classified as "hate" against the minority, the stereotypes in the tweets often matched common stereotypes about the minority. Asians were blamed for the Covid pandemic. Blacks were seen as inferior and associated with crime. Jews were seen as powerful and held collectively responsible for the actions of the State of Israel. Some tweets denied the Holocaust. Hispanics/Latines were portrayed as being in the country illegally and as "invaders," in addition to stereotypical accusations of being lazy, stupid, or having too many children. Muslims, on the other hand, were often collectively blamed for terrorism and violence, though often in conversations about Muslims in India.
This dataset contains 5880 tweets that cover a wide range of topics common in conversations about Asians, Blacks, Jews, Latines, and Muslims. 357 tweets (6.1 %) are labeled as biased and 5523 (93.9 %) are labeled as not biased. 1365 tweets (23.2 %) are labeled as calling out or denouncing bias. 1180 out of 5880 tweets (20.1 %) contain the keyword "Asians," 590 were posted in 2020 and 590 in 2021. 39 tweets (3.3 %) are biased against Asian people. 370 tweets (31,4 %) call out bias against Asians. 1160 out of 5880 tweets (19.7%) contain the keyword "Blacks," 578 were posted in 2020 and 582 in 2021. 101 tweets (8.7 %) are biased against Black people. 334 tweets (28.8 %) call out bias against Blacks. 1189 out of 5880 tweets (20.2 %) contain the keyword "Jews," 592 were posted in 2020, 451 in 2021, and ––as mentioned above––146 tweets from 2022. 83 tweets (7 %) are biased against Jewish people. 220 tweets (18.5 %) call out bias against Jews. 1169 out of 5880 tweets (19.9 %) contain the keyword "Latinos," 584 were posted in 2020 and 585 in 2021. 29 tweets (2.5 %) are biased against Latines. 181 tweets (15.5 %) call out bias against Latines. 1182 out of 5880 tweets (20.1 %) contain the keyword "Muslims," 593 were posted in 2020 and 589 in 2021. 105 tweets (8.9 %) are biased against Muslims. 260 tweets (22 %) call out bias against Muslims.
The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:
'TweetID': Represents the tweet ID.
'Username': Represents the username who published the tweet (if it is a retweet, it will be the user who retweetet the original tweet.
'Text': Represents the full text of the tweet (not pre-processed).
'CreateDate': Represents the date the tweet was created.
'Biased': Represents the labeled by our annotators if the tweet is biased (1) or not (0).
'Calling_Out': Represents the label by our annotators if the tweet is calling out bias against minority groups (1) or not (0).
'Keyword': Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.
Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)
We are grateful for the technical collaboration with Indiana University's Observatory on Social Media (OSoMe). We thank all class participants for the annotations and contributions, including Kate Baba, Eleni Ballis, Garrett Banuelos, Savannah Benjamin, Luke Bianco, Zoe Bogan, Elisha S. Breton, Aidan Calderaro, Anaye Caldron, Olivia Cozzi, Daj Crisler, Jenna Eidson, Ella Fanning, Victoria Ford, Jess Gruettner, Ronan Hancock, Isabel Hawes, Brennan Hensler, Kyra Horton, Maxwell Idczak, Sanjana Iyer, Jacob Joffe, Katie Johnson, Allison Jones, Kassidy Keltner, Sophia Knoll, Jillian Kolesky, Emily Lowrey, Rachael Morara, Benjamin Nadolne, Rachel Neglia, Seungmin Oh, Kirsten Pecsenye, Sophia Perkovich, Joey Philpott, Katelin Ray, Kaleb Samuels, Chloe Sherman, Rachel Weber, Molly Winkeljohn, Ally Wolfgang, Rowan Wolke, Michael Wong, Jane Woods, Kaleb Woodworth, and Aurora Young. This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Financial Sentiment Analysis Dataset
Overview
This dataset is a comprehensive collection of tweets focused on financial topics, meticulously curated to assist in sentiment analysis in the domain of finance and stock markets. It serves as a valuable resource for training machine learning models to understand and predict sentiment trends based on social media discourse, particularly within the financial sector.
Data Description
The dataset comprises tweets… See the full description on the dataset page: https://huggingface.co/datasets/TimKoornstra/financial-tweets-sentiment.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for tweet_eval
Dataset Summary
TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits.
Supported Tasks and Leaderboards
text_classification: The dataset can be… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_eval.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Collection of 13M tweets divided into training, validation, and test sets for the purposes of predicting emoji based on text and/or images.
The data provides the tweet status ID and the emoji annotations associated with it. In the case of image-containing subsets, the image URL is also listed.
The Full, unbalanced dataset consists of a random test and validation sets of 1M tweets, with the remainder in the training set.
The Balanced testset is a subset of the test set chosen to improve emoji class balance.
The Image subsets are image-containing tweets.
Finally, emoji_map_1791.csv provides information regarding the emoji labels and potential metadata.
URL to get the tweet based on ID: `https://twitter.com/anyuser/status/
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Overview This is an entity-level sentiment analysis dataset of twitter. Given a message and an entity, the task is to judge the sentiment of the message about the entity. There are three classes in this dataset: Positive, Negative and Neutral. We regard messages that are not relevant to the entity (i.e. Irrelevant) as Neutral.
Usage Please use twitter_training.csv as the training set and twitter_validation.csv as the validation set. Top 1 classification accuracy is used as the metric.
Original Data Source: Twitter Sentiment Analysis
StEduCov, a dataset annotated for stances toward online education during the COVID-19 pandemic. StEduCov has 17,097 tweets gathered over 15 months, from March 2020 to May 2021, using Twitter API. The tweets are manually annotated into agree, disagree or neutral classes. We used a set of relevant hashtags and keywords. Specifically, we utilised a combination of hashtags, such as '#COVID 19' or '#Coronavirus' with keywords, such as 'education', 'online learning', 'distance learning' and 'remote learning'. To ensure high annotation quality, three different annotators annotated each tweet and at least one of the reviewers from three judges revised it. They were guided by some instructions, such as that in the case of disagree class, there should be a clear negative statement about online education or its impact. Also, if the tweet is negative but refers to other people (e.g. 'my children hate online learning').
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
From conspiracy theories to fake cures and fake treatments, COVID-19 has become a hot-bed for the spread of misinformation online. It is more important than ever to identify methods to debunk and correct false information online. Detection and characterization of misinformation requires an availability of annotated datasets. Most of the published COVID-19 Twitter datasets are generic, lack annotations or labels, employ automated annotations using transfer learning or semi-supervised methods, or are not specifically designed for misinformation. Annotated datasets are either only focused on "fake news", are small in size, or have less diversity in terms of classes.
Here, we present a novel Twitter misinformation dataset called "CMU-MisCov19" with 4573 annotated tweets over 17 themes around the COVID-19 discourse. We also present our annotation codebook for the different COVID-19 themes on Twitter, along with their descriptions and examples, for the community to use for collecting further annotations. Further details related to the dataset, and our analysis based on this dataset can be found at https://arxiv.org/abs/2008.00791. In adherence to the Twitter’s terms and conditions, we do not provide the full tweet JSONs but provide a ".csv" file with the tweet IDs so that the tweets can be rehydrated. We also provide the annotations, and the date of creation for each tweet for the reproduction of the results of our analyses.
Note: If for any reason, you are not able to rehydrate all the tweets, reach out to Shahan Ali Memon at (shahan@nyu.edu).
If you use this data, please cite our paper as follows:
"Shahan Ali Memon and Kathleen M. Carley. Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset, In Proceedings of The 5th International Workshop on Mining Actionable Insights from Social Networks (MAISoN 2020), co-located with CIKM, virtual event due to COVID-19, 2020."
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Disaster Tweets Dataset For Binary Classification
This dataset contains tweets classified as either disastrous (label 1) or not disastrous (label 0). It is designed to train and evaluate machine learning models for disaster-related tweet classification.
Files Included
train.csv: Contains 7,613 tweets with their respective labels. test.csv: Contains 3,263 tweets without labels.
Columns
Each CSV file contains the following columns:
id – Unique identifier for… See the full description on the dataset page: https://huggingface.co/datasets/mhamza-007/binary-class-tweets-dataset.
Data Collection I streamed live tweets from the twitter after WHO declared Covid-19 as a pandemic. Since this Covid-19 epidemic has affected the entire world, I collected worldwide Covid-19 related English tweets at a rate of almost 10k per day in three phases starting from April-June, 2020, August-October, 2020 and April-June, 2021. I prepared the first phase dataset of about 235k tweets collected from 19th April to 20th June 2020. After one month I again start collecting tweets from Twitter as at that time the pandemic was spreading with its fatal intensity. I collected almost 320k tweets in the period August 20 to October 20, 2020, for the second phase dataset. Finally, after six months collected almost 489k tweets in the period 26th April to 27th June 2021 for the third phase dataset.
Content The datasets I developed contain important information about most of the tweets and their attributes. The main attributes of both of these datasets are:
Tweet ID Creation Date & Time Source Link Original Tweet Favorite Count Retweet Count Original Author Hashtags User Mentions Place Finally, I collected 2,35,240, 3,20,316, and 4,89,269 tweets for first, second, and third phase datasets containing the hash-tagged keywords like - #covid-19, #coronavirus, #covid, #covaccine, #lockdown, #homequarantine, #quarantinecenter, #socialdistancing, #stayhome, #staysafe, etc. Here I represented an overview of the collected dataset.
Data Pre-Processing I pre-processed these collected data by developing a user-defined pre-processing function based on NLTK (Natural Language Toolkit, a Python library for NLP). At the initial stage, it converts all the tweets into lowercase. Then it removes all extra white spaces, numbers, special characters, ASCII characters, URLs, punctuations & stopwords from the tweets. Then it converts all ‘covid’ words into ‘covid19’ as we already removed all numbers from the tweets. Using stemming the pre-processing function has reduced inflected words to their word stem.
Sentiment Analysis I calculated the sentiment polarity of each cleaned and pre-processed tweet using the NLTK-based Sentiment Analyzer and get the sentiment scores for positive, negative, and neutral categories to calculate the compound sentiment score for each tweet. I classified the tweets on the basis of the compound sentiment scores into three different classes i.e., Positive, Negative, and Neutral. Then we assigned the sentiment polarity ratings for each tweet based on the following algorithm-
Algorithm Sentiment Classification of Tweets (compound, sentiment):
for each tweet in the dataset: if tweet[compound] < 0: tweet[sentiment] = 0.0 # assigned 0.0 for Negative Tweets elif tweet[compound] > 0: tweet[sentiment] = 1.0 # assigned 1.0 for Positive Tweets else: tweet[sentiment] = 0.5 # assigned 0.5 for Neutral Tweets end Acknowledgements I wouldn't be here without the help of my project guide Dr. Anup Kumar Kolya, Assistant Professor, Dept of Computer Science and Engineering, RCCIIT whose kind and valuable suggestions and excellent guidance enlightened to give me the best opportunity in preparing these datasets. If you owe any attributions or thanks, include him here along with any citations of past research.
This datasets are the part of the publications entitled:
Chakraborty, A. K., Das, D., & Kolya, A. K. (2023). Sentiment Analysis on Large-Scale Covid-19 Tweets using Hybrid Convolutional LSTM Based on Naïve Bayes Sentiment Modeling. ECTI Transactions on Computer and Information Technology (ECTI-CIT), 17(3), 343–357. https://doi.org/10.37936/ecti-cit.2023173.252549 Chakraborty, A. K., & Das, S. (2023). A comparative study of a novel approach with baseline attributes leading to sentiment analysis of Covid-19 tweets. In Elsevier eBooks (pp. 179–208). https://doi.org/10.1016/b978-0-32-390535-0.00013-6 Chakraborty, A. K., Das, S., & Kolya, A. K. (2021). Sentiment analysis of COVID-19 tweets using Evolutionary Classification-Based LSTM model. In Advances in intelligent systems and computing (pp. 75–86). https://doi.org/10.1007/978-981-16-1543-6_7
Original Data Source: Covid-19 Twitter Dataset
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains a set of domain sharing actions that occurred on Twitter during the month of June 2017. Each domain sharing action can be thought of as a triple (twitter user id, tweet id, domain). The tweets in the dataset were collected through Twitter's decahose API. Each user in the dataset was responsible for sharing at least one news article, and at least one article that can be labeled as misinformation.
The data is distributed in domain-shares.data in the following JSON format:
{
"
For example:
{
"2274775459": {
"880214796779573249": ["palmerreport.com"],
"870801925054373888": ["reuters.com", "abcn.ws"],
"879899831808012292": ["mobile.nytimes.com"]
},
"2909434459": {
"879856813755256832": ["dallasnews.com"]
}
}
In addition, a TAB-separated version with (twitter user id, tweet id, domain) triples is also available.
For further information on how the dataset was constructed and on analyses that have been conducted on it, please refer to the accompanying Github repository at https://github.com/dimitargnikolov/twitter-bias.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wiki-MID Dataset Wiki-MID is a LOD compliant multi-domain interests dataset to train and test Recommender Systems. Our English dataset includes an average of 90 multi-domain preferences per user on music, books, movies, celebrities, sport, politics and much more, for about half million Twitter users traced during six months in 2017. Preferences are either extracted from messages of users who use Spotify, Goodreads and other similar content sharing platforms, or induced from their "topical" friends, i.e., followees representing an interest rather than a social relation between peers. In addition, preferred items are matched with Wikipedia articles describing them. This unique feature of our dataset provides a mean to categorize preferred items, exploiting available semantic resources linked to Wikipedia such as the Wikipedia Category Graph, DBpedia, BabelNet and others. Data model: Our resource is designed on top of the Semantically-Interlinked Online Communities (SIOC) core ontology. The SIOC ontology favors the inclusion of data mined from social networks communities into the Linked Open Data (LOD) cloud.We represent Twitter users as instances of the SIOC UserAccount class.Topical users and message based user interests are then associated, through the usage of the Simple Knowledge Organization System Namespace Document (SKOS) predicate relatedMatch, to a corresponding Wikipedia page as a result of our automated mapping methodology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study seeks to develop a method for identifying the occurrences and proportions of researchers, media and other professionals active in Twitter discussions. As a case example, an anonymised dataset from Twitter vaccine discussions is used. The study proposes a method of using keywords as strings within lists to identify classes from user biographies. This provides a way to apply multiple classification principles to a set of Twitter biographies using semantic rules through the Python programming language. The script used for the study is here deposited.
Method development for Twitter biography classification concerning occurrences of academics, academically related groups and individuals, media, other groups and members of the general public. Written in the Python programming language.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was built to analyze the spread of misinformation about CoronaVac in Brazil by using data from Twitter for two specific events: the approval for emergency use in adults over 18 years old (January 17, 2021) and the approval for use in children aged 6 to 17 years (January 20, 2022).
We choose to label the original tweets with at least one retweet in the analyzed period. The manual labeling of such tweets was initially performed by two annotators with high knowledge about the dataset and the considered context. In cases in which there was no agreement between the two annotators, a third annotator was considered to define the class of the tweet.
The final dataset contains 1,010 tweets from January 17, 2021, and 816 tweets from January 20, 2022.
This dataset was originally built for a conference paper accepted at BraSNAM 2022. If you make use of the dataset, please also cite the following paper:
Gabriel P. Oliveira, Beatriz F. Paiva, Ana Paula Couto da Silva, and Mirella M. Moro. Characterizing the Diffusion of Misinformation Regarding the CoronaVac Vaccine in Brazil. In Proceedings of the XI Brazilian Workshop on Social Network Analysis and Mining (BraSNAM 2022), 2022.
@inproceedings{brasnam/OliveiraPSM22,
title = {Characterizing the Diffusion of Misinformation Regarding the CoronaVac Vaccine in Brazil},
author = {Gabriel P. Oliveira and
Beatriz F. Paiva and
Ana Paula Couto da Silva and
Mirella M. Moro},
booktitle = {Proceedings of the XI Brazilian Workshop on Social Network Analysis and Mining (BraSNAM)}
year = {2022}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created as part of the Master's thesis titled "Multi-Class Depression Detection Through Tweets Using Artificial Intelligence." It contains tweets labeled for five types of depression (Bipolar, Major, Psychotic, Atypical, and Postpartum) using lexicons verified by psychiatrists.
Purpose: Designed for multi-class classification of depression using AI, focusing on Explainable AI for highlighting key words in the tweets influencing the predictions.
Applications: The dataset is suitable for research in natural language processing, sentiment analysis, mental health prediction, and Explainable AI.
This dataset is shared under the Creative Commons Attribution 4.0 International (CC BY) license, requiring proper attribution for any use or modification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Preprints, 2022, DOI: 10.20944/preprints202206.0146.v1
Abstract
The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset files contain the raw version that comprises 52,868 Tweet IDs (that correspond to the same number of Tweets) and the cleaned and preprocessed version that contains 46,208 unique Tweet IDs. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description
The dataset comprises 7 .txt files. The raw version of this dataset comprises 6 .txt files (TweetIDs_Corona Virus.txt, TweetIDs_Corona.txt, TweetIDs_Coronavirus.txt, TweetIDs_Covid.txt, TweetIDs_Omicron.txt, and TweetIDs_SARS CoV2.txt) that contain Tweet IDs grouped together based on certain synonyms or terms that were used to refer to online learning and the Omicron variant of COVID-19 in the respective tweets. Table 1 shows the list of all the synonyms or terms that were used for the dataset development. The cleaned and preprocessed version of this dataset is provided in the .txt file - TweetIDs_Duplicates_Removed.txt. A description of these dataset files is provided in Table 2.
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.
Table 1. List of commonly used synonyms, terms, and phrases for online learning and COVID-19 that were used for the dataset development
Terminology |
List of synonyms and terms |
COVID-19 |
Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus |
online learning |
online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures |
Table 2: Description of the dataset files along with the information about the number of Tweet IDs in each of them
Filename |
No. of Tweet IDs |
Description |
TweetIDs_Corona Virus.txt |
321 |
Tweet IDs correspond to tweets that comprise the keywords – "corona virus" and one or more keywords/terms that refer to online learning |
TweetIDs_Corona.txt |
1819 |
Tweet IDs correspond to tweets that comprise the keyword – "corona" or "coronaoutbreak" and one or more keywords/terms that refer to online learning |
TweetIDs_Coronavirus.txt |
1429 |
Tweet IDs correspond to tweets that comprise the keywords – "coronavirus" or "coronaviruspandemic" and one or more keywords/terms that refer to online learning |
TweetIDs_Covid.txt |
41088 |
Tweet IDs correspond to tweets that comprise the keywords – "COVID" or "COVID19" or "COVID-19" and one or more keywords/terms that refer to online learning |
TweetIDs_Omicron.txt |
8198 |
Tweet IDs correspond to tweets that comprise the keywords – "omicron" or "omicron variant" and one or more keywords/terms that refer to online learning |
TweetIDs_SARS CoV2.txt |
13 |
Tweet IDs correspond to tweets that comprise the keyword – "SARS-CoV-2" and one or more keywords/terms that refer to online learning |
TweetIDs_Duplicates_Removed.txt |
46208 |
A collection of unique Tweet IDs from all the 6 .txt files mentioned above after data preprocessing, data clearing, and removal of duplicate tweets |