CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Collection of 13M tweets divided into training, validation, and test sets for the purposes of predicting emoji based on text and/or images.The data provides the tweet status ID and the emoji annotations associated with it. In the case of image-containing subsets, the image URL is also listed.The Full, unbalanced dataset consists of a random test and validation sets of 1M tweets, with the remainder in the training set.The Balanced testset is a subset of the test set chosen to improve emoji class balance.The Image subsets are image-containing tweets.Finally, emoji_map_1791.csv provides information regarding the emoji labels and potential metadata.
The share of posts on microblogging platform Twitter that contain emojis has increased significantly over the past ten years. In July 2013, 4.25 percent of tweets contained at least one emoji. Just under one decade later, in March 2023, 26.7 percent of tweets contained an emoji. The most common reason for using emojis, according to users in the United States, was to make conversations more fun.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Coincidence matrix for tweets with emojis.
The twitter emoji dataset obtained from CodaLab comprises of 50 thousand tweets along with the associated emoji label. Each tweet in the dataset has a corresponding numerical label which maps to a specific emoji. The emojis are of the 20 most frequent emojis and hence the labels range from 0 to 19
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For each set, the mean, sd and sem are computed from the distribution of negative, neutral, and positive tweets.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The ITAmoji dataset collects 275, 000 tweets that contain one and only one emoji over the 25 most frequent emojis. The dataset has been created and used in the context of the ITAmoji task (https://sites.google.com/view/itamoji/), organised as part of EVALITA 2018(http://www.evalita.it/2018). The task challenged participants to develop automatic systems that predict, given an Italian tweet, its most likely associ- ated emoji, selected in a wide and heterogeneous emoji space. The dataset is split into training set (250,000 tweets) and test set (25,000 tweets).
In order to comply with GDPR privacy rules and Twitter’s policies, the identifiers of tweets and users have been anonymized and replaced by unique identifiers.
In January 2022, the face with tears of joy emoji was the most used emoji on Twitter, with a usage rate of 1.81 for every ten thousand tweets. Loudly crying face emoji followed, with a usage rate of 1.78. Other popular emojis on Twitter included sparkles, rolling on the floor laughing, pleading face, and the red heart emoji.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 100d word embeddings trained on 48M Italian tweets using fastText and employed by our team to predict emojis during ITAmoji competition of EVALITA 2018 Evaluation Campaign.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of 479 193 tweets each of them contains one of the 31 gesture emoji (different hand configurations) and its skin tone modifier options (e.g. 🙏🙏🏿🙏🏾🙏🏽🙏🏼🙏🏻), posted within 250km from San Jose, CA and within 200km from Los Angeles, CA, in English, during May-August 2021. The dataset can be used to investigate the use of gesture emoji by English-speaking California Twitter users. Python libraries used for collecting tweets and preprocessing: tweepy, re, preprocessor, emoji, regex, string, nltk.
The dataset contains 12 columns:
tweet_original
original text of the tweet
preprocessed
preprocessed text of the tweet (4 steps)
all_emoji
lists all emoji in a given tweet
hashtags
lists all hashtags in a given tweet
user_encoded
encoded Twitter user name: the first 3 characters of the user name and the first 3 characters of the user's location
location_encoded
location of the user: "los_angeles", "san_diego", "san_jose", "san_francisco", "fresno", "long_beach", "sacramento", "oakland", "bakersfield", "anaheim", or "other"
mention_present
checks whether each tweet contains mentions
url_present
checks whether each tweet contains url
preprocess_tweet
preprocessing step 1: tokenizing mentions, urls, and hashtags
lowercase_tweet
preprocessing step 2: lowercasing
remove_punct_tweet
preprocessing step 3: removing punctuation
tokenize_tweet
preprocessing step 4: tokenizing
The further information on the research project can be found here: https://github.com/mzhukovaucsb/emoji_gestures/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Coincidence matrix for tweets without emojis.
In July 2021, 20.69 percent of monitored tweets contained at least one emoji, up from 20.15 percent in July of the previous year. Between 2016 and 2021, emoji usage on the micro-blogging platform increased by over 42 percent. Overall, 2018 to 2019 saw the largest year-on-year increase in emoji usage on Twitter.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of 48 838 tweets each of them contains one of the 31 gesture emoji (different hand configurations) and its skin tone modifier options (e.g. 🙏🙏🏿🙏🏾🙏🏽🙏🏼🙏🏻), and posted within 50km from Moscow, Russia, in Russian, during May-August 2021. The dataset can be used to investigate the use of gesture emoji by Russian users of the Twitter platform. Python libraries used for collecting tweets and preprocessing: tweepy, re, preprocessor, emoji, regex, string, nltk.
The dataset contains 12 columns:
tweet_original
original text of the tweet
preprocessed
preprocessed text of the tweet (4 steps)
all_emoji
lists all emoji in a given tweet
hashtags
lists all hashtags in a given tweet
user_encoded
encoded Twitter user name: the first 3 characters of the user name and the first 3 characters of the user's location
location_encoded
location of the user: "moscow", "moscow_region", or "other"
mention_present
checks whether each tweet contains url
url_present
checks whether each tweet contains url
preprocess_tweet
preprocessing step 1: tokenizing mentions, urls, and hashtags
lowercase_tweet
preprocessing step 2: lowercasing
remove_punct_tweet
preprocessing step 3: removing punctuation
tokenize_tweet
preprocessing step 4: tokenizing
The further information on the research project can be found here: https://github.com/mzhukovaucsb/emoji_gestures/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Emojis are ordered by the number of occurrences N. The average position ranges from 0 (the beginning of the tweets) to 1 (the end of the tweets). pc, c ∈ {−1, 0, +1}, are the negativity, neutrality, and positivity, respectively. is the sentiment score.
Tweet IDs used to study emoji syntax - Covid According to: Pereira, A., & Pestana, G. (2022). Is There Meaning in the Emoji Sequences Used on Social Media? The Architecture of a Model for Emoji Sequences Analysis. World Conference on Information Systems and Technologies (pp. 279–292). https://doi.org/10.1007/978-3-031-04819-7_28 Pereira, A., & Pestana, G. (2024). Syntax in Emoji Sequences on Social Media Posts. In World Conference on Information Systems and Technologies (pp. 97–107). Pereira, A., & Leite M.C., & Pestana, G. (2024) [Forthcoming]. Analyzing Syntactic Patterns in Emoji Sequences on Social Media.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for tweet_eval
Dataset Summary
TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits.
Supported Tasks and Leaderboards
text_classification: The dataset can be… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_eval.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Context I wrote a quick script to scrape 100k tweets that mentioned the keywords "New Year". I pulled these tweets from the Twitter API over the span of a couple of hours so there wouldn't be a clustering of tweets from a single timezone/country.
Content These tweets were all scraped in the evening to the night of December 31st, 2021 from the Twitter API. I ignored all tweets that just retweeted or quote tweets from other users.
Column 1 This column is just to keep track of the tweet number in this dataset. Since the id column tracks the tweet id from Twitter and those numbers are quite large. I wanted something smaller to keep track of ids in this scope.
author_id This column is the unique id of the author of the tweet.
id This column is the tweet id provided by Twitter.
text The text of the tweet. Some tweets contain emojis, links, and mentions.
username The username of the author of the tweet.
Acknowledgements This dataset would not exist without the Twitter API.
Inspiration One of my main ideas of something that could be done with this data would be a sentiment analysis on how people were feeling about the new year starting.
CC0
Original Data Source: New Years 2021 Tweets
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The agreement is computed in terms of three measures over a subset of tweets that were labeled by two different annotators.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of tweets scraped from Twitter containing the hashtag "videogames". There are 1135 tweets from August 2020 to December 2020. For simplicity of use, I have added another column that consists of clean tweets in tokenized form.
This dataset consists of 1135 tweets and 5 columns: timestamp: Contains both the dates in YYYY-MM-DD format and time in HH:MM:SS format from August 2020 to December 2020. text: Tweets in their raw text format. likes: Number of likes the tweet received. retweets: Number of times the tweet was retweeted. clean_text: Tweets after they were cleaned (punctuations, stopwords, emojis and URLs removed, lemmatized, tokenized)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Tweet IDs used to study emoji syntax - description
Tweet IDs used to study emoji syntax - Climate change According to: Pereira, A., & Pestana, G. (2022). Is There Meaning in the Emoji Sequences Used on Social Media? The Architecture of a Model for Emoji Sequences Analysis. World Conference on Information Systems and Technologies (pp. 279–292). https://doi.org/10.1007/978-3-031-04819-7_28 Pereira, A., & Pestana, G. (2024). Syntax in Emoji Sequences on Social Media Posts. In World Conference on Information Systems and Technologies (pp. 97–107). Pereira, A., & Leite M.C., & Pestana, G. (2024) [Forthcoming]. Analyzing Syntactic Patterns in Emoji Sequences on Social Media.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Collection of 13M tweets divided into training, validation, and test sets for the purposes of predicting emoji based on text and/or images.The data provides the tweet status ID and the emoji annotations associated with it. In the case of image-containing subsets, the image URL is also listed.The Full, unbalanced dataset consists of a random test and validation sets of 1M tweets, with the remainder in the training set.The Balanced testset is a subset of the test set chosen to improve emoji class balance.The Image subsets are image-containing tweets.Finally, emoji_map_1791.csv provides information regarding the emoji labels and potential metadata.