Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the key Twitter user statistics that you need to know.
https://brightdata.com/licensehttps://brightdata.com/license
Utilize our Twitter dataset for diverse applications to enrich business strategies and market insights. Analyzing this dataset provides a comprehensive understanding of social media trends, empowering organizations to refine their communication and marketing strategies. Access the entire dataset or customize a subset to fit your needs. Popular use cases include market research to identify trending topics and hashtags, AI training by reviewing factors such as tweet content, retweets, and user interactions for predictive analytics, and trend forecasting by examining correlations between specific themes and user engagement to uncover emerging social media preferences.
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These Twitter user statistics will give you the complete story of where Twitter is at today and what the future looks like for the social media company.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Twitter is a social news website. It can be viewed as a hybrid of email, instant messaging and sms messaging all rolled into one neat and simple package. It s a new and easy way to discover the latest news related to subjects you care about. |Attribute|Value| |-|-| |Number of Nodes: |11316811| |Number of Edges: |85331846| |Missing Values? |no| |Source:| N/A| ##Data Set Information: 1. nodes.csv — it s the file of all the users. This file works as a dictionary of all the users in this data set. It s useful for fast reference. It contains all the node ids used in the dataset 2. edges.csv — this is the friendship/followership network among the users. The friends/followers are represented using edges. Edges are directed. Here is an example. 1,2 This means user with id "1" is followering user with id "2". ##Attribute Information: Twitter is a social news website. It can be viewed as a hybrid of email, instant messaging and sms messaging all rolled into one ne
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset consists of various columns containing information related to tweets posted on Twitter. Each row in the dataset represents a single tweet. Here's an explanation of the columns in the dataset from a third-person perspective:
Tweet: This column contains the actual text content of the tweet. It includes the message that the user posted on Twitter. Tweets can vary in length from a few characters to the maximum allowed by Twitter.
Sentiment: This column indicates the sentiment or emotional tone of the tweet. Sentiment can be classified into categories such as positive, negative, or neutral. It reflects the overall opinion or attitude expressed in the tweet.
Username: This column contains the username of the Twitter account that posted the tweet. Each Twitter user has a unique username that identifies their account.
Timestamp: This column contains the timestamp indicating when the tweet was posted. It includes information about the date and time when the tweet was published on Twitter.
Retweets: This column represents the number of times the tweet has been retweeted by other Twitter users. A retweet is when a user shares another user's tweet with their followers.
Likes: This column indicates the number of likes or favorites received by the tweet. Users can express their appreciation for a tweet by liking it.
Hashtags: This column contains any hashtags included in the tweet. Hashtags are keywords or phrases preceded by the "#" symbol, used to categorize or label tweets and make them more discoverable.
Mentions: This column includes any Twitter usernames mentioned in the tweet. Mentions are when a user tags another user in their tweet by including their username preceded by the "@" symbol.
Location: This column provides information about the location associated with the tweet. It may include details such as the city, state, country, or geographical coordinates from which the tweet was posted, if available.
Source: This column specifies the source or platform used to post the tweet. It indicates whether the tweet was posted from the Twitter website, a mobile app, or a third-party application.
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 22nd which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (40,823,816 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (7,479,940 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.
https://brightdata.com/licensehttps://brightdata.com/license
Leverage our Twitter profiles dataset for a wide range of applications to enhance business strategies and market insights. Analyzing this dataset offers a deep understanding of user demographics, engagement patterns, and online behavior, enabling organizations to optimize their communication and marketing strategies. Access the complete dataset or tailor a subset to meet your specific requirements. Popular use cases include market research to identify influential profiles and emerging audiences, AI training by analyzing follower demographics and engagement data for predictive modeling, and trend forecasting by examining correlations between user bios, activity levels, and growth metrics to uncover evolving social media dynamics.
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541
The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.
The following columns are in the dataset:
➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.
Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains statistics related to the Unleashed Twitter account (@SAUnleashed). Unleashed is an open data competition, an initiative of the Office for Digital Government, Department of the Premier and Cabinet. The data is used to monitor the level of engagement activity with the audience, and make the communication effective in regards to the event.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (230,961,781 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (52,026,197 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 653 996 tweets related to the Coronavirus topic and highlighted by hashtags such as: #COVID-19, #COVID19, #COVID, #Coronavirus, #NCoV and #Corona. The tweets' crawling period started on the 27th of February and ended on the 25th of March 2020, which is spread over four weeks.
The tweets were generated by 390 458 users from 133 different countries and were written in 61 languages. English being the most used language with almost 400k tweets, followed by Spanish with around 80k tweets.
The data is stored in as a CSV file, where each line represents a tweet. The CSV file provides information on the following fields:
Author: the user who posted the tweet
Recipient: contains the name of the user in case of a reply, otherwise it would have the same value as the previous field
Tweet: the full content of the tweet
Hashtags: the list of hashtags present in the tweet
Language: the language of the tweet
Relationship: gives information on the type of the tweet, whether it is a retweet, a reply, a tweet with a mention, etc.
Location: the country of the author of the tweet, which is unfortunately not always available
Date: the publication date of the tweet
Source: the device or platform used to send the tweet
The dataset can as well be used to construct a social graph since it includes the relations "Replies to", "Retweet", "MentionsInRetweet" and "Mentions".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Description
This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.
Data Collection Method
Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.
Dataset Content
ID: A unique identifier for each tweet.
text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.
polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).
favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.
retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.
user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.
user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.
user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.
user_followers_count: The current number of followers the account has. It is a non-negative integer.
user_friends_count: The number of users that the account is following. It is a non-negative integer.
user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.
user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.
user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.
user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.
Cite as
Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.
Potential Use Cases
This dataset is aimed at academic researchers and practitioners with interests in:
Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.
Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.
Exploring correlations between user engagement metrics and sentiment in discussions about AI.
Data Format and File Type
The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.
License
The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data introduction • Twitter-tweets-sentiment dataset is a dataset that aims to analyze tweet sentiment for Twitter and natural language processing.
2) Data utilization (1)Twitter-tweets-sentiment data has characteristics that: • The data consists of three columns, including emotion and text, and aims to block negative tweets through a powerful classification model. (2) Twitter-tweets-sentiment data can be used to: • Social Media Monitoring: Businesses and organizations can use data to monitor social media platforms and gauge public sentiment about a brand, product, event, or social issue. • Sentiment analysis: This dataset can be used to train models that classify the sentiment of tweets, which can help companies and researchers understand public opinion on a variety of topics.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes news articles from US major news outlets and associated sharing activities on Twitter, covering the content of the sharing tweets and details of the users. The dataset is conceived to highlight users’ involvement in the process of news dissemination as we believe that understanding news sharing behaviours can provide further insights on detecting users’ opinions, stance and communities. In particular, we describe a practical usage of our dataset in the context of political stance classification. The original dataset is composed of collections stored in a mongoDB environment. Their schema is described in ".rtf" files. Due to Twitter developers terms we can only provide ids for users and tweets, that can be used to retrieve the original data throught the Twitter API. For additional details please refer to "News Sharing User Behaviour on Twitter: A Comprehensive Data Collection of News Articles and Social Interactions", Brena et al. ICWSM'19 (2019)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The US has historically been the target country for Twitter since its launch in 2006. This is the full breakdown of Twitter users by country.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the key Twitter user statistics that you need to know.