51 datasets found

Sentiment Analysis on Financial Tweets
kaggle.com
zip
Updated Sep 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivek Rathi (2019). Sentiment Analysis on Financial Tweets [Dataset]. https://www.kaggle.com/datasets/vivekrathi055/sentiment-analysis-on-financial-tweets
Explore at:
zip(2538259 bytes)Available download formats
Dataset updated
Sep 5, 2019
Authors
Vivek Rathi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The following information can also be found at https://www.kaggle.com/davidwallach/financial-tweets. Out of curosity, I just cleaned the .csv files to perform a sentiment analysis. So both the .csv files in this dataset are created by me.

Anything you read in the description is written by David Wallach and using all this information, I happen to perform my first ever sentiment analysis.

"I have been interested in using public sentiment and journalism to gather sentiment profiles on publicly traded companies. I first developed a Python package (https://github.com/dwallach1/Stocker) that scrapes the web for articles written about companies, and then noticed the abundance of overlap with Twitter. I then developed a NodeJS project that I have been running on my RaspberryPi to monitor Twitter for all tweets coming from those mentioned in the content section. If one of them tweeted about a company in the stocks_cleaned.csv file, then it would write the tweet to the database. Currently, the file is only from earlier today, but after about a month or two, I plan to update the tweets.csv file (hopefully closer to 50,000 entries.

I am not quite sure how this dataset will be relevant, but I hope to use these tweets and try to generate some sense of public sentiment score."

Content

This dataset has all the publicly traded companies (tickers and company names) that were used as input to fill the tweets.csv. The influencers whose tweets were monitored were: ['MarketWatch', 'business', 'YahooFinance', 'TechCrunch', 'WSJ', 'Forbes', 'FT', 'TheEconomist', 'nytimes', 'Reuters', 'GerberKawasaki', 'jimcramer', 'TheStreet', 'TheStalwart', 'TruthGundlach', 'Carl_C_Icahn', 'ReformedBroker', 'benbernanke', 'bespokeinvest', 'BespokeCrypto', 'stlouisfed', 'federalreserve', 'GoldmanSachs', 'ianbremmer', 'MorganStanley', 'AswathDamodaran', 'mcuban', 'muddywatersre', 'StockTwits', 'SeanaNSmith'

Acknowledgements

The data used here is gathered from a project I developed : https://github.com/dwallach1/StockerBot

Inspiration

I hope to develop a financial sentiment text classifier that would be able to track Twitter's (and the entire public's) feelings about any publicly traded company (and cryptocurrency)
Z
Data from: IA Tweets Analysis Dataset (Spanish)
data.niaid.nih.gov
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serrano-Fernández, Alejandro (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10821484
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Guerrero-Contreras, Gabriel
Balderas-Díaz, Sara
Muñoz, Andrés
Serrano-Fernández, Alejandro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description

This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

Data Collection Method

Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

Dataset Content

ID: A unique identifier for each tweet.

text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

user_followers_count: The current number of followers the account has. It is a non-negative integer.

user_friends_count: The number of users that the account is following. It is a non-negative integer.

user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

Cite as

Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

Potential Use Cases

This dataset is aimed at academic researchers and practitioners with interests in:

Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

Exploring correlations between user engagement metrics and sentiment in discussions about AI.

Data Format and File Type

The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

License

The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
X/Twitter: Countries with the largest audience 2025
statista.com
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). X/Twitter: Countries with the largest audience 2025 [Dataset]. https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected-countries/
Explore at:
Dataset updated
Jun 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description
Social network X/Twitter is particularly popular in the United States, and as of February 2025, the microblogging service had an audience reach of 103.9 million users in the country. Japan and the India were ranked second and third with more than 70 million and 25 million users respectively. Global Twitter usage As of the second quarter of 2021, X/Twitter had 206 million monetizable daily active users worldwide. The most-followed Twitter accounts include figures such as Elon Musk, Justin Bieber and former U.S. president Barack Obama. X/Twitter and politics X/Twitter has become an increasingly relevant tool in domestic and international politics. The platform has become a way to promote policies and interact with citizens and other officials, and most world leaders and foreign ministries have an official Twitter account. Former U.S. president Donald Trump used to be a prolific Twitter user before the platform permanently suspended his account in January 2021. During an August 2018 survey, 61 percent of respondents stated that Trump's use of Twitter as President of the United States was inappropriate.
u
Data from: IA Tweets Analysis Dataset (Spanish)
produccioncientifica.uca.es
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://produccioncientifica.uca.es/documentos/67321e53aea56d4af04854c2
Explore at:
Dataset updated
2024
Authors
Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés
Description
Cite as

Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

General Description

This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

Data Collection Method

Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

Dataset Content

ID: A unique identifier for each tweet.

text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

user_followers_count: The current number of followers the account has. It is a non-negative integer.

user_friends_count: The number of users that the account is following. It is a non-negative integer.

user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

Potential Use Cases

This dataset is aimed at academic researchers and practitioners with interests in:

Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

Exploring correlations between user engagement metrics and sentiment in discussions about AI.

Data Format and File Type

The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

License

The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
X/Twitter: number of worldwide users 2019-2024
statista.com
Updated Dec 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). X/Twitter: number of worldwide users 2019-2024 [Dataset]. https://www.statista.com/statistics/303681/twitter-users-worldwide/
Explore at:
Dataset updated
Dec 13, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Dec 2022
Area covered
Worldwide
Description
As of December 2022, X/Twitter's audience accounted for over *** million monthly active users worldwide. This figure was projected to ******** to approximately *** million by 2024, a ******* of around **** percent compared to 2022.
Z
Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...
data.niaid.nih.gov
zenodo.org
Updated Jun 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Imran (2020). GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3878598
Explore at:
Dataset updated
Jun 16, 2020
Dataset provided by
Umair Qazi
Muhammad Imran
Ferda Ofli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.
o
Twitter User Profile Data
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Twitter User Profile Data [Dataset]. https://www.opendatabay.com/data/ai-ml/68edb66a-4b34-4f73-a3ff-4b45a9bcc140
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
This dataset provides detailed information on social media users, specifically those from Twitter. It was created using the Tweepy API and is a foundational resource for understanding user behaviour and network characteristics. The dataset is suitable for analysing user profiles and their publicly available activities, offering insights into various user attributes.

Columns

Index: A unique identifier for each user.

Name: The name of the user, which has been hidden due to privacy reasons.

Followers_count: The total number of followers associated with the user's account.

Verified: A boolean indicator specifying whether the user's account is verified.

Protected: A boolean indicator showing if the user's account is private or public.

Location: The self-reported geographical location of the user.

Status_count: The total number of tweets posted by the user.

Description: The biographical information or profile description provided by the user.

Distribution

The data files are typically provided in CSV format. This dataset contains approximately 2,065 individual user records. Sample files will be updated separately to the platform.

Usage

This dataset is ideal for: * Analysing social media user behaviour, patterns, and trends. * Developing and testing Natural Language Processing (NLP) models, particularly on user biographies and profile descriptions. * Conducting research into user verification statuses and account privacy settings. * Exploring the geographical distribution and self-reported locations of users. * Building comprehensive user profiles for targeted analysis or application development.

Coverage

The dataset's geographic scope is global, encompassing users from various regions around the world. Notable geographic data includes specific locations reported by users, with India being an example of a significant represented region. The dataset represents a snapshot of user information; the exact time range of data capture is not specified. Demographic coverage is limited to publicly accessible user profile information.

License

CC0

Who Can Use It

This dataset is valuable for: * Data Scientists who aim to build models related to user engagement and behaviour. * Researchers focusing on online social networks and digital demographics. * Developers requiring user profile information for integration into applications. * Students learning about data analysis, social media data, and NLP techniques.

Dataset Name Suggestions

Twitter User Profile Data

Social Media User Analytics

Tweepy User Statistics

Global Twitter User Data

Attributes

Original Data Source: Twitter Dataset
o
Twitter Dataset - Over 200,000 Tweets containing the word "Vaccine" for...
explore.openaire.eu
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aldo Averza (2022). Twitter Dataset - Over 200,000 Tweets containing the word "Vaccine" for research porpuses [Dataset]. http://doi.org/10.5281/zenodo.5888306
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5888306
Dataset updated
Jan 1, 2022
Authors
Aldo Averza
Description
This dataset contains 220,085 tweets containing the word vaccine between December 9th and December 18th 2021 at different times during each day, extracted using the Twitter API v2. Each tweet was extracted at least 3 days after its initial posting time in order to register 3 days of engagements, and it doesn't include retweets. Includes: Tweet ID Text Author ID Date Like count Retweet count Quote count Reply count User data (Followers, Following, Tweet count, Account creation date, Verified status) Usernames are hidden for privacy reasons
Elon Musk Tweets
kaggle.com
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Preda (2023). Elon Musk Tweets [Dataset]. http://doi.org/10.34740/kaggle/dsv/5917319
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5917319
Dataset updated
Jun 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gabriel Preda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Tweets by Elon Musk are very popular, he is currently one of the most followed users on Twitter, with >100M followers. He is also constantly tweeting, and therefore the content generated is really interesting.

https://storage.googleapis.com/kaggle-datasets-images/2342642/4438151/919dc4a0c5b82411dcb1a7064b67855c/dataset-cover.png?t=2022-11-03-08-15-31" alt="">

Content

This dataset is collected daily using tweepy and Twitter API The source of the dataset is: public tweets by Elon Musk

Data columns

The following columns are included: - ID
- User name
- User location
- User description
- User created
- User followers
- User friends
- User favorites
- User verified
- Date
- Text
- Hashtags
- Source
- Retweets
- Is retweet

Ideas for analysis

You can use this dataset (daily updated) to test your skills with NLP tools and techniques.
o
Capitol Protest Social Media Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Capitol Protest Social Media Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/f0aef256-125d-4396-8e94-aee208b5281d
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
This dataset provides over 80,000 tweets from 6th January 2021, the day of the Capitol Hill riots. Created using the Twitter Developer API and Tweepy, it offers valuable social media data for analysis. While not as extensive as Parler data dumps, it is well-suited for Natural Language Processing (NLP) tasks. The tweets have had mentions, hyperlinks, emojis, and punctuation removed, and all text is converted to lowercase for consistency. Some tweets include geographical coordinates if users had geotagging enabled. Information on verified users is included via their usernames, and user location is provided based on their self-reported profile details, with blanks for locations outside of US states or DC.

Columns

id: A unique identifier for each tweet.

text: The content of the tweet.

query: The search query used to retrieve the tweet.

usr_id: A unique identifier for the user who posted the tweet.

username: The username associated with the tweet's author.

followers: The number of followers the user has.

tweet count: The total number of tweets posted by the user.

number of likes: The total count of likes the tweet received.

number of retweets: The total count of retweets the tweet received.

location: The user's self-reported location from their profile.

Distribution

The dataset is provided as a CSV file and contains over 80,000 individual tweet records. Each record is structured according to the columns listed above, offering a clear tabular format for data manipulation and analysis.

Usage

This dataset is ideal for a range of analytical applications, particularly for those with NLP experience. Potential use cases include: * Sentiment analysis of public opinion surrounding the Capitol Riot. * Topic modelling to identify key themes and narratives in the social discourse. * Trend analysis of how discussions evolved throughout the day. * Social network analysis focusing on user behaviour, verified accounts, and location-based insights related to the event. * Academic research into political events and social media's role.

Coverage

The dataset's time range is strictly limited to 6th January 2021. Geographic coverage is primarily based on user-reported locations within the US (including DC), with some tweets containing precise coordinates if geotagging was active. Demographic scope includes information on verified users. Notably, the raw text data has been pre-processed, meaning mentions, hyperlinks, emojis, and punctuation have been removed, and all text is in lowercase.

License

CC0

Who Can Use It

This dataset is suitable for: * NLP practitioners and data scientists seeking real-world social media data for model training and linguistic analysis. * Researchers and academics in fields such as political science, sociology, and media studies, for investigating public discourse during significant events. * Journalists and analysts interested in understanding the social media landscape of the Capitol Riot.

Dataset Name Suggestions

Capitol Riot Tweets 2021

January 6th Twitter Data

US Capitol Protest Social Media

2021 Capitol Hill Tweets Dataset

Attributes

Original Data Source: Capitol Riot Tweets
o
FPL Tweets Dataset
opendatabay.com
.undefined
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). FPL Tweets Dataset [Dataset]. https://www.opendatabay.com/data/web-social/b10e09b2-82f1-47a3-912d-17aaacea2633
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 10, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
What is FPL? Fantasy Premier League (FPL) is an online fantasy football game based on the English Premier League. In the game, participants select a squad of real-life Premier League players and earn points based on their performances in actual matches.

Here are some facts about FPL:

FPL has over 9 million registered users worldwide, making it one of the most popular fantasy sports games in the world. The budget for each FPL team is £100.0m, with the most expensive player being Mohamed Salah at £13 million for the current season. The highest-scoring FPL player of all time is again Mohamed Salah, who scored 303 points in the 2017/18 season. Content This dataset contains a collection of tweets with keywords Fantasy Premier League and FPL. The tweets were scraped using the snscrape library. Check out the Tutorial Notebook

The dataset includes the following information for each tweet:

ID: The unique identifier for the tweet. Timestamp: The date and time when the tweet was posted. User: The Twitter handle of the user who posted the tweet. Text: The content of the tweet. Hashtag: The hashtags included in the tweet, if any. Retweets: The number of times the tweet has been retweeted as of the time it was scraped. Likes: The number of likes the tweet has received as of the time it was scraped. Replies: The number of replies to the tweet as of the time it was scraped. Source: The source application or device used to post the tweet. Location: The location listed on the user's Twitter profile, if any. Verified_Account: A Boolean value indicating whether the user's Twitter account has been verified. Followers: The number of followers the user has as of the time the tweet was scraped. Following: The number of accounts the user is following as of the time the tweet was scraped The dataset provides a glimpse into the online chatter related to Fantasy Premier League and can be used for various natural language processing and machine learning tasks, such as sentiment analysis, topic modeling, and more. It allows an understanding of the community, the level of interest, and the experience of playing FPL.

Original Data Source: FPL Tweets Dataset
Twitter users in Brazil 2019-2028
statista.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Twitter users in Brazil 2019-2028 [Dataset]. https://www.statista.com/forecasts/1146589/twitter-users-in-brazil
Explore at:
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Brazil
Description
The number of Twitter users in Brazil was forecast to continuously increase between 2024 and 2028 by in total *** million users (+***** percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach ***** million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to *** countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
#ChatGPT 1000 Daily 🐦 Tweets
kaggle.com
Updated May 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enric Domingo (2023). #ChatGPT 1000 Daily 🐦 Tweets [Dataset]. http://doi.org/10.34740/kaggle/dsv/5685262
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5685262
Dataset updated
May 14, 2023
Dataset provided by
Kaggle
Authors
Enric Domingo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.

This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.

This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.

Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! 🤗

Columns Description:

tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.

tweet_created: Timestamp. Time of the tweet's creation.

tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).

text: String. The raw payload text from the tweet.

lang: String. Short name for the Tweet text's language.

user_id: Integer. Twitter's unique user id.

user_name: String. The author's public name on Twitter.

user_username: String. The author's Twitter account username (@example)

user_location: String. The author's public location.

user_description: String. The author's public profile's bio.

user_created: Timestamp. Timestamp of user's Twitter account creation.

user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction

user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction

user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.

user_verified: Boolean. True if the user is verified (blue mark).

source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far).

retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.

like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.

reply_count: Integer. Number of reply messages to the Tweet.

impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.

More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user
o
Data from: Twitter data reveal six distinct environmental personas
explore.openaire.eu
search.dataone.org
+1more
Updated May 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlotte Chang; Paul Armsworth; Yuta Masuda (2022). Twitter data reveal six distinct environmental personas [Dataset]. http://doi.org/10.5061/dryad.79cnp5ht0
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.79cnp5ht0
Dataset updated
May 9, 2022
Authors
Charlotte Chang; Paul Armsworth; Yuta Masuda
Description
Replication materials documentation for "Twitter data reveal six distinct environmental personas" This replication code and dataset accompanies the manuscript linked at: https://doi.org/10.1002/fee.2510. I provide a description of the replication datasets below and include a SHA256 checksum that you can use to ensure the integrity of the downloaded file (please execute shasum -a 256 FILENAME in the command line to verify, or use some other utility to find the SHA256 checksum for each file). ### Datasets * MeanViewpoints.tsv contains the mean and standard error of the mean (SEM) for the issue viewpoints shown in Figure 1 for the six personas in a "long data" format * Columns: * variable: name of the environmental issue * mean: persona-level mean viewpoint value for that issue * SEM: standard error of the mean * Persona: abbreviated name for the six personas (SMA: Smart alecks, GEN: Generalists, STE: Stewards, CLC: Climate concerned, TEC: Technocrats, RES: Reserved) * SHA256: 5d540edcb39c8d7a14db315b5eaeed83689021ec43cbe16a1c7eb4467c943098 * UserTweetIDs.txt contains one tweet ID per user of the 1+ million users in our sample. These tweet IDs can be "hydrated" and used to find the users sampled in our study. * TweetID: single column listing one tweet ID per user * SHA256: fbe1da240a5ab9d9aebac0aabbde247e6eeebfa77c5471cdb6136f45110b1111 * EnvironmentalPundits.tsv contains the user names and IDs for the environmental pundits whose timelines were used as the data source to train the probabilistic latent Dirichlet allocation topic model. * Columns: * Screenname: User name (e.g. GretaThunberg, which you can use to navigate to twitter.com/GretaThunberg) * ID: User ID * SHA256: a6a987d934dea75e8ba2329820d6cfe354af0991f2bdbd4746b0f83ad6dafaa3 * Persona_PoliticalIdeology.tsv provides the mean political ideology score for the six personas * Columns: * mean: mean political ideology score * SEM: standard error of the mean * Persona: abbreviated name for the six personas * SHA256: e568d9737cbd7c0b1b1ce61a6c9c8294f14a62d934446cc0d618ebf091bf1a13 * US_geography.tsv shows the state-level ranks for each persona * Columns: * name: State name * Persona: abbreviated name for the six personas * Rank: Rank for the 50 states (+ Washington DC) * SHA256: 8ba7e0ca437639656e25a473c4aec281e828e59941af847d8865bf4eddf1371d ### Code * Scraper.py provides code that can be used to obtain user information from the UserTweetIDs.txt data file above to reproduce the user set in our analysis. * Plotting.R provides code to reproduce the plots in the main text. Effective digital environmental communication is integral to galvanizing public support for conservation in the age of social media. Environmental advocates require messaging strategies suited to social media platforms, including ways to identify, target, and mobilize distinct audiences. Here, we provide – to the best of our knowledge – the first systematic characterization of environmental personas on social media. Beginning with 1 million environmental nongovernmental organization (NGO) followers on Twitter, of which 500,000 users met data quality criteria, we identified six personas that differ in their expression of 21 environmental issues. General consistency in the proportional composition of personas was detected across 14 countries with sufficiently large samples. Within the US, although the six personas varied in their mean political ideology, we did not observe that the personas split along political party lines. Our results pave the way for environmental advocates – including NGOs, public agencies, and researchers – to use audience segmentation methods like the one discussed here to target and tailor messages to distinct constituencies at speed and scale. This repository contains several tabular files that can be used to query user data from Twitter or reproduce the main results in the main text of the article. These data may only be used for publically accessible research and may not be used for private, for-profit use. This replication code and dataset accompanies the manuscript linked at: https://doi.org/10.1002/fee.2510. Please see the main text of the article and Supplementary Information for more details on how the data were gathered and processed.
X/Twitter users in the United Kingdom 2019-2028
statista.com
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). X/Twitter users in the United Kingdom 2019-2028 [Dataset]. https://www.statista.com/topics/11843/x-formerly-twitter-in-the-united-kingdom-uk/
Explore at:
Dataset updated
Jan 13, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United Kingdom
Description
The number of Twitter users in the United Kingdom was forecast to continuously increase between 2024 and 2028 by in total 0.9 million users (+5.1 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 18.55 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
m
Tracking the Global Pulse: The first public Twitter dataset from FIFA World...
data.mendeley.com
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kheir eddine daouadi (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup [Dataset]. http://doi.org/10.17632/gw3mcnbkwr.2
Explore at:
Unique identifier
https://doi.org/10.17632/gw3mcnbkwr.2
Dataset updated
May 27, 2025
Authors
kheir eddine daouadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
The first public large-scale multilingual Twitter dataset related to the FIFA World Cup 2022, comprising over 28 million posts in 69 unique spoken languages, including Arabic, English, Spanish, French, and many others. This dataset aims to facilitate research in future sentiment analysis, cross-linguistic studies, event-based analytics, meme and hate speech detection, fake news detection, and social manipulation detection.

The file 🚨Qatar22WC.csv🚨 🚀Codebook for | Column Name | Description| |-------------------------------- | day, | hou, | age_of_the_user_account | tweet_count | location | follower_count | following_count | follower_to_Following | favouite_count | verified | Avg_tweet_count | list_count | Tweet_Id | is_reply_tweet | is_quote | retid | lang | hashtags | is_image, | is_video |------------------------ contains tweet-level and user-level metadata for our collected tweets. FIFA World Cup 2022 Twitter Dataset🚀 |----------------------------------------------------------------------------------------| month, year | The date where the tweet posted | min, sec | Hour, minute, and second of tweet timestamp | | User Account age in days | | Total number of tweets posted by the user | | User-defined location field | | Number of followers the user has | | Number of accounts the user is following | | Follower-following ratio | | Number of likes the user did| | Boolean indicating if the user is verified (1 = Verified, 0 = Not Verified) | | Average tweets per day for the user activity| | Number of lists the user is a member | | Tweet ID | | ID of the tweet being replied to (if applicable) | | boolean representing if the tweet is a quote | | Retweet ID if it's a retweet; NaN otherwise | | Language of the tweet | | The keyword or hashtag used to collect the tweet | | Boolean indicating if the tweet associated with image| | Boolean indicating if the tweet associated with video | -------|----------------------------------------------------------------------------------------|

Examples of use case queries are described in the file 🚨fifa_wc_qatar22_examples_of_use_case_queries.ipynb🚨 and accessible via: https://github.com/khairied/Qata_FIFA_World_Cup_22

🚀 Please Cite This as: Daouadi, K. E., Boualleg, Y., Guehairia, O. & Taleb-Ahmed, A. (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup, Journal of Computational Social Science.
h
Supporting data for "A Meta-Intervention: Quantifying the Impact of Social...
datahub.hku.hk
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingzhe Quan (2025). Supporting data for "A Meta-Intervention: Quantifying the Impact of Social Media Information on Adherence to Non-Pharmaceutical Interventions" [Dataset]. http://doi.org/10.25442/hku.29068061.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.29068061.v1
Dataset updated
May 23, 2025
Dataset provided by
HKU Data Repository
Authors
Mingzhe Quan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset supports a research project in the field of digital medicine, which aims to quantify the impact of disseminating scientific information on social media—as a form of "meta-intervention"—on public adherence to Non-Pharmaceutical Interventions (NPIs) during health crises such as the COVID-19 pandemic. The research encompasses multiple sub-studies and pilot experiments, drawing data from various global and China-specific social media platforms.The data included in this submission has been collected from several sources:From Sina Weibo and Tencent WeChat, 189 online poll datasets were collected, involving a total of 1,391,706 participants. These participants are users of Sina Weibo or Tencent WeChat.From Twitter, 187 tweets published by scientists (verified with a blue checkmark) related to COVID-19 were collected.From Xiaohongshu and Bilibili, textual content from 143 user posts/videos concerning COVID-19, along with associated user comments and specific user responses to a question, were gathered.It is important to note that while the broader research project also utilized a 3TB Reddit corpus hosted on Academic Torrents (academictorrents.com), this specific Reddit dataset is publicly available directly from Academic Torrents and is not included in this particular DataHub submission. The submitted dataset comprises publicly available data, formatted as Excel files (.xlsx), and includes the following:Filename: scientists' discourse (source from screenshot of tweets)Description: This file contains screenshots of tweets published by scientists on Twitter concerning COVID-19 research, its current status, and related topics. It also includes a coded analysis of the textual content from these tweets. Specific details regarding the coding scheme can be found in the readme.txt file.Filename: The links of online polls (Weibo & WeChat)Description: This data file includes information from online polls conducted on Weibo and WeChat after December 7, 2022. These polls, often initiated by verified users (who may or may not be science popularizers), aimed to track the self-reported proportion of participants testing positive for COVID-19 (via PCR or rapid antigen test) or remaining negative, particularly during periods of rapid Omicron infection spread. The file contains links to the original polls, links to the social media accounts that published these polls, and relevant metadata about both the poll-creating accounts and the online polls themselves.Filename: Online posts & comments (From Xiaohongshu & Bilibili)Description: This file contains textual content from COVID-19 related posts and videos published by users on the Xiaohongshu and Bilibili platforms. It also includes user-generated comments reacting to these posts/videos, as well as user responses to a specific question posed within the context of the original content.Key Features of this Dataset:Data Type: Mixed, including textual data, screenshots of social media posts, web links to original sources, and coded metadata.Source Platforms: Twitter (global), Weibo/WeChat (primarily China), Xiaohongshu (China), and Bilibili (video-sharing platform, primarily China).Use Case: This dataset is intended for the analysis of public discourse, the dissemination of scientific information, and user engagement patterns across different cultural contexts and social media platforms, particularly in relation to public health information.
Verification Account Management System (VAMS)
catalog.data.gov
datasets.ai
+2more
Updated Jul 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Verification Account Management System (VAMS) [Dataset]. https://catalog.data.gov/dataset/verification-account-management-system-vams
Explore at:
Dataset updated
Jul 4, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
The Verification Account Management System (VAMS) is the centralized location for maintaining SSA's verification and data exchange accounts. VAMS account management functionalities include: creating new accounts, selecting account parameters, searching for existing accounts, updating current accounts and generating MI reports. All accounts in VAMS are issued a unique Verification Account Number (VAN). This VAN is used to determine account status prior to processing the verification or data exchange request. Currently all requests from Enumeration Verification Systems (EVS), Numident Online Verification Utility (NOVU), State Verification Exchange System (SVES), and State Online Query System (SOLQ) are verified in VAMS. In addition, VAMS interfaces with the Data Exchanges and Verifications Online (DEVO) application as it stores the account parameters used for parameter driven processing.
H
Data from: Database of Indian Social Media Influencers on Twitter
dataverse.harvard.edu
search.dataone.org
Updated Mar 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arshia Arya; Dibyendu Mishra; Gazal Shekhawat; Ankur Sharma; Anmol Panda; Faisal M Lalani; Parantak Singh; Ramaravind Kommiya Mothilal; Rynaa Grover; Sachita Nishal; Saloni Dash; Shehla Rashid Shora; Syeda Zainab Akbar; Joyojeet Pal (2022). Database of Indian Social Media Influencers on Twitter [Dataset]. http://doi.org/10.7910/DVN/T2CFHO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/T2CFHO
Dataset updated
Mar 23, 2022
Dataset provided by
Harvard Dataverse
Authors
Arshia Arya; Dibyendu Mishra; Gazal Shekhawat; Ankur Sharma; Anmol Panda; Faisal M Lalani; Parantak Singh; Ramaravind Kommiya Mothilal; Rynaa Grover; Sachita Nishal; Saloni Dash; Shehla Rashid Shora; Syeda Zainab Akbar; Joyojeet Pal
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
India
Description
Databases of highly networked individuals have been indispensable in studying narratives and influence on social media. To support studies on Twitter in India, we present a systematically categorized database of accounts of influence on Twitter in India, identified and annotated through an iterative process of friends, networks, and self-described profile information, verified manually. We built an initial set of accounts based on the friend network of a seed set of accounts based on real-world renown in various fields, and then snowballed friends of friends" multiple times, and rank ordered individuals based on the number of in-group connections, and overall followers. We then manually classified identified accounts under the categories of entertainment, sports, business, government, institutions, journalism, civil society accounts that have independent standing outside of social media, as well as a category ofdigital first" referring to accounts that derive their primary influence from online activity. Overall, we annotated 11580 unique accounts across all categories. The database is useful studying various questions related to the role of influencers in polarisation, misinformation, extreme speech, political discourse etc.

Data from: Analyzing Mentions of Death in Covid-19 Tweets

zenodo.org

Updated Jul 6, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Divya Mani Adhikari; Divya Mani Adhikari; Muhammad Imran; Muhammad Imran; Umair Qazi; Umair Qazi; Ingmar Weber; Ingmar Weber (2024). Analyzing Mentions of Death in Covid-19 Tweets [Dataset]. http://doi.org/10.5281/zenodo.10839649

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.10839649

Dataset updated

Jul 6, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Divya Mani Adhikari; Divya Mani Adhikari; Muhammad Imran; Muhammad Imran; Umair Qazi; Umair Qazi; Ingmar Weber; Ingmar Weber

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset preparation and annotation

The dataset is a subset of the TBCOV dataset collected at QCRI filtered for mentions of personally related COVID-19 deaths. The filtering was done using regular expressions such as my * passed, my * died, my * succumbed & lost * battle. A sample of the dataset was annotated on Appen. Please see 'annotation-instructions.txt' for the full instructions provided to the annotators.

Dataset description

The "classifier_filtered_english.csv" file contains 33k deduplicated and classifier-filtered tweets (following X's content redistribution policy). for the 6 countries (Australia, Canada, India, Italy, United Kingdom, and United States) from March 2020 to March 2021 with classifier-labeled death labels, regular expression-filtered gender and relationship labels, and the user device label. The full 57k regex-filtered collection of tweets can be made available on special cases for Academics and Researchers.

date: the date of the tweet

country_name: the country name from Nominatim API

tweet_id: the ID of the tweet

url: the full URL of the tweet

full_text: the full-text content of the tweet (also includes the URL of any media attached)

does_the_tweet_refer_to_the_covidrelated_death_of_one_or_more_individuals_personally_known_to_the_tweets_author: the classifier predicted label for the death (also includes the original labels for the annotated samples)

what_is_the_relationship_between_the_tweets_author_and_the_victim_mentioned: the annotated relationship labels

relative_to_the_time_of_the_tweet_when_did_the_mentioned_death_occur: the annotated relative time labels

user_is_verified: if the user is verified or not

user_gender: the gender of the Twitter user (from the user profile)

user_device: the Twitter client the user uses

has_media: if the tweet has any attached media

has_url: if the tweet text contains a URL

matched_device: the device (Apple or Android) based on the Twitter client

regex_gender: the gender inferred from regular expression-based filtering

regex_relationship: the relationship label from regular expression-based filtering

Inferring gender using regular expressions

We first determine the mapping between different relationship labels mentioned in the tweet to the gender. We do not use any relationship like "cousin" from which we cannot easily infer the gender.

Male relationships: 'father', 'dad', 'daddy', 'papa', 'pop', 'pa', 'son', 'brother', 'uncle', 'nephew', 'grandfather', 'grandpa', 'gramps', 'husband', 'boyfriend', 'fiancé', 'groom', 'partner', 'beau', 'friend', 'buddy', 'pal', 'mate', 'companion', 'boy', 'gentleman', 'man', 'father-in-law', 'brother-in-law', 'stepfather', 'stepbrother'

Female relationships: 'mother', 'mom', 'mama', 'mum', 'ma', 'daughter', 'sister', 'aunt', 'niece', 'grandmother', 'grandma', 'granny', 'wife', 'girlfriend', 'fiancée', 'bride', 'partner', 'girl', 'lady', 'woman', 'miss', 'mother-in-law', 'sister-in-law', 'stepmother', 'stepsister'

Based on these mappings, we used the following regex for each gender label to determine the gender of the deceased mentioned in the tweet.

"[m|M]y\s(" + "|".join([r + "s?" for r in relationships]) + ")\s(died|succumbed|deceased)"

Age groups from relationship labels

First, we get the relationship labels using regex filtering, and then we group them into different age-group categories as shown in the following table. The UK and the US use different age groups because of the different age group definitions in the official data.

Category	Relationship (from tweets)	Age Group (UK)	Age Group (US)
Grandparents	grandfather, grandmother	65+	65+
Parents	father, mother, uncle, aunt	45-64	35-64
Siblings	brother, sister, cousin	15-44	15-34
Children	son, daughter, nephew, niece	0-14	0-14

Training the classifier

The 'english-training.csv' file contains about 13k deduplicated human-annotated tweets. We use a random seed (42) to create the train/test split. The model Covid-Bert-V2 was fine-tuned on the training set for 2 epochs with the following hyperparameters (obtained using 10-fold CV): random_seed: 42, batch_size: 32, dropout: 0.1. We obtained a F1-score of 0.81 on the test set. We used about 5% (671) of the combined and deduplicated annotated tweets as the test set, about 2% (255) as the validation set, and the remaining 12,494 tweets were used for fine-tuning the model. The tweets were preprocessed to replace mentions, URLs, emojis, etc with generic keywords. The model was trained on a system with a single Nvidia A4000 16GB GPU. The fine-tuned model is also available as the 'model.bin' file. The code for finetuning the model as well as reproducing the experiments are available in this GitHub repository.

Datasheet

We also include a datasheet for the dataset following the recommendation of "Datasheets for Datasets" (Gebru et. al.) which provides more information about how the dataset was created and how it can be used. Please see "Datasheet.pdf".

NOTE: We recommend that researchers try to rehydrate the individual tweets to ensure that the user has not deleted the tweet since posting. This gives users a mechanism to opt out of having their data analyzed.

Please only use your institutional email when requesting the dataset as anything else (like gmail.com) will be rejected. The dataset will only be made available on reasonable request for Academics and Researchers. Please mention why you need the dataset and how you plan to use the dataset when making a request.

Facebook

Twitter

Click to copy link

Link copied

Cite

Vivek Rathi (2019). Sentiment Analysis on Financial Tweets [Dataset]. https://www.kaggle.com/datasets/vivekrathi055/sentiment-analysis-on-financial-tweets

Sentiment Analysis on Financial Tweets

Tweets from verified users(Cleaner version of ....davidwallach/financial-tweets)

Explore at:

zip(2538259 bytes)Available download formats

Dataset updated

Sep 5, 2019

Authors

Vivek Rathi

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

The following information can also be found at https://www.kaggle.com/davidwallach/financial-tweets. Out of curosity, I just cleaned the .csv files to perform a sentiment analysis. So both the .csv files in this dataset are created by me.

Anything you read in the description is written by David Wallach and using all this information, I happen to perform my first ever sentiment analysis.

"I have been interested in using public sentiment and journalism to gather sentiment profiles on publicly traded companies. I first developed a Python package (https://github.com/dwallach1/Stocker) that scrapes the web for articles written about companies, and then noticed the abundance of overlap with Twitter. I then developed a NodeJS project that I have been running on my RaspberryPi to monitor Twitter for all tweets coming from those mentioned in the content section. If one of them tweeted about a company in the stocks_cleaned.csv file, then it would write the tweet to the database. Currently, the file is only from earlier today, but after about a month or two, I plan to update the tweets.csv file (hopefully closer to 50,000 entries.

I am not quite sure how this dataset will be relevant, but I hope to use these tweets and try to generate some sense of public sentiment score."

Content

This dataset has all the publicly traded companies (tickers and company names) that were used as input to fill the tweets.csv. The influencers whose tweets were monitored were: ['MarketWatch', 'business', 'YahooFinance', 'TechCrunch', 'WSJ', 'Forbes', 'FT', 'TheEconomist', 'nytimes', 'Reuters', 'GerberKawasaki', 'jimcramer', 'TheStreet', 'TheStalwart', 'TruthGundlach', 'Carl_C_Icahn', 'ReformedBroker', 'benbernanke', 'bespokeinvest', 'BespokeCrypto', 'stlouisfed', 'federalreserve', 'GoldmanSachs', 'ianbremmer', 'MorganStanley', 'AswathDamodaran', 'mcuban', 'muddywatersre', 'StockTwits', 'SeanaNSmith'

Acknowledgements

The data used here is gathered from a project I developed : https://github.com/dwallach1/StockerBot

Inspiration

I hope to develop a financial sentiment text classifier that would be able to track Twitter's (and the entire public's) feelings about any publicly traded company (and cryptocurrency)

Clear search

Close search

Google apps

Main menu

Sentiment Analysis on Financial Tweets

Context

Content

Acknowledgements

Inspiration

Data from: IA Tweets Analysis Dataset (Spanish)

X/Twitter: Countries with the largest audience 2025

Data from: IA Tweets Analysis Dataset (Spanish)

X/Twitter: number of worldwide users 2019-2024

Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...

Twitter User Profile Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Twitter Dataset - Over 200,000 Tweets containing the word "Vaccine" for...

Elon Musk Tweets

Context

Content

Data columns

Ideas for analysis

Capitol Protest Social Media Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

FPL Tweets Dataset

Twitter users in Brazil 2019-2028

#ChatGPT 1000 Daily 🐦 Tweets

Columns Description:

Data from: Twitter data reveal six distinct environmental personas

X/Twitter users in the United Kingdom 2019-2028

Tracking the Global Pulse: The first public Twitter dataset from FIFA World...

Supporting data for "A Meta-Intervention: Quantifying the Impact of Social...

Verification Account Management System (VAMS)

Data from: Database of Indian Social Media Influencers on Twitter

Data from: Analyzing Mentions of Death in Covid-19 Tweets

Dataset preparation and annotation

Dataset description

Inferring gender using regular expressions

Age groups from relationship labels

Training the classifier

Datasheet

Sentiment Analysis on Financial Tweets

Tweets from verified users(Cleaner version of ....davidwallach/financial-tweets)

Context

Content

Acknowledgements

Inspiration