Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used in the manuscript "Scaling laws and dynamics of hashtags on Twitter"..
The Twitter data was obtained from a sample of 10% of all public tweets, provided by the Twitter streaming application programming interface. We extracted the hashtags from each tweet and counted how many times they were used in different time intervals. Time intervals of three different lengths were used: days, hours, and minutes. The tweets were published between November 1st 2015 and November 30th 2016, but not all time intervals between these dates are available.
The four files in this dataset correspond each to one folder (collected using tar). Each folder contains compressed .csv files (compressed using gzip). The content of the .csv files in each folder are:
hashtags_frequency_day.tar Counts of hashtags in each day. The name of each file in the folder indicates the date (GMT). The entries in each file are the hashtag and the count in the interval.
hashtags_frequency_hour.tar Counts of hashtags in each hour. The name of each file in the folder indicates the date (GMT). The entries in each file are the hashtag and the count in the interval.
hashtags_frequency_minutes.tar Counts of hashtags in each minute. The name of each file in the folder indicates the date (GMT, only a fraction of all days is available). The entries in each file are the hashtag and the count in the interval.
number_of_tweets.tar Counts of the number of tweets in each minute. The name of each file in the folder indicates the day. The entries in each file are the minute in the day (GMT) and count of tweets in our dataset.
In the digital age, every minute counts as billions of users engage with online platforms worldwide. The year 2024 saw an astounding 251.1 million emails sent, 138.9 million Reels played on Facebook and Instagram, and 5.9 million Google searches conducted every 60 seconds. Social media's continued dominance Social media platforms remain at the forefront of online interactions, with Facebook leading the pack at over three billion monthly active users. The broader Meta ecosystem, including Instagram and WhatsApp, further solidifies its position in the digital landscape. TikTok, a relative newcomer, has rapidly gained traction, generating 186 million downloads in the fourth quarter of 2024 alone. Evolving digital consumption patterns While traditional streaming services like Netflix continue to dominate, with 362,962 hours streamed every minute, the digital media landscape is experiencing shifts in user preferences. Netflix recorded over 300 million paid subscribers worldwide as of the fourth quarter of 2024.
This dataset contains the tweet ids of 7,275,228 tweets related to the Women's March on January 21, 2017. They were collected between December 19, 2016 and January 23, 2017 from the Twitter API using Social Feed Manager. These tweets were collected using the POST statuses/filter method of the Twitter Stream API. There is a README.txt file containing additional documentation on how it was collected. The GET statuses/lookup method supports retrieving the complete tweet for a tweet id (known as hydrating). Tools such as Twarc or Hydrator can be used to hydrate tweets. When hydrating be aware that: Twitter limits hydration to 900 requests of 100 tweet ids per 15 minute window per set of user credentials. The Twitter API will not return tweets that have been deleted or belong to accounts that have been suspended, deleted, or made private. You should expect a large number of these tweets to be unavailable. For tweets collected from the Twitter filter stream, this is not a complete set of tweets that match the filter. Gaps may exist because: Twitter limits the number of tweets returned by the filter at any point in time. Social Feed Manager stops and starts the Twitter filter stream every 30 minutes. In Social Feed Manager, collecting is turned off while a user is making changes to the collection criteria. There were some operational issues, e.g., network interruptions, during the collection period. Per Twitter’s Developer Policy, tweet ids may be publicly shared; tweets may not. Questions about this dataset can be sent to sfm@gwu.edu. George Washington University researchers should contact us for access to the tweets. This work is supported by grant #NARDI-14-50017-14 from the National Historical Publications and Records Commission.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected the data of a Twitter user using Tweepy to access the Twitter API. We crawled the list of each user account’s followers. Twitter allowed a request of a maximum of 200 tweets per time window and because of limitations of the Twitter API, we could only make a request every 15 minutes. Next, we obtained the most recent tweets of each user in the study. We extracted the most common hashtags used in the sample tweets and crawled the most recent 50 tweets that contained each hashtag and tweets that mentioned a particular user, for example ’@username.’ Initially, we chose 101 user accounts and documented the attributes of each user’s account (number of followers, a list of followers, and the recent tweets of each follower).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
PLEASE UPVOTE IF YOU FOUND THIS DATASET USEFUL
This dataset comprises seven days of geo-tagged Tweets from the contiguous United States, collected between January 12 and January 18, 2013. Each Tweet includes exact GPS coordinates (longitude and latitude) and a timestamp (hour, minute, second) reported in Central Standard Time (CST). The data is suitable for tasks such as classification, regression, and clustering, offering insights into spatiotemporal trends in social media activity.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the tweet ids of approximately 280 million tweets related to the 2016 United States presidential election. They were collected between July 13, 2016 and November 10, 2016 from the Twitter API using Social Feed Manager. These tweet ids are broken up into 12 collections. Each collection was collected either from the GET statuses/user_timeline method of the Twitter REST API or the POST statuses/filter method of the Twitter Stream API. The collections are: Candidates and key election hashtags (Twitter filter): election-filter[1-6].txt Democratic candidates (Twitter user timeline): democratic-candidate-timelines.txt Democratic Convention (Twitter filter): democratic-convention-filter.txt Democratic Party (Twitter user timeline): democratic-party-timelines.txt Election Day (Twitter filter): election-day.txt First presidential debate (Twitter filter): first-debate.txt GOP Convention (Twitter filter): republican-convention-filter.txt Republican candidates (Twitter user timeline): republican-candidate-timelines.txt Republican Party (Twitter user timeline): republican-party-timelines.txt Second presidential debate (Twitter filter): second-debate.txt Third presidential debate (Twitter filter): third-debate.txt Vice Presidential debate (Twitter filter): vp-debate.txt There is also a README.txt file for each collection containing additional documentation on how it was collected. The GET statuses/lookup method supports retrieving the complete tweet for a tweet id (known as hydrating). Tools such as Twarc or Hydrator can be used to hydrate tweets. When hydrating be aware that: Twitter limits hydration to 900 requests of 100 tweet ids per 15 minute window per set of user credentials. This works out to 8,640,000 tweets per day, so hydrating this entire dataset will take 32 days. The Twitter API will not return tweets that have been deleted or belong to accounts that have been suspended, deleted, or made private. You should expect a large number of these tweets to be unavailable. There may be duplicate tweets across collections. Also, according to the Twitter documentation, duplicate tweets are possible for tweets collected from the Twitter filter stream. For tweets collected from the Twitter filter stream, this is not a complete set of tweets that match the filter. Gaps may exist because: Twitter limits the number of tweets returned by the filter at any point in time. Social Feed Manager stops and starts the Twitter filter stream every 30 minutes. In Social Feed Manager, collecting is turned off while a user is making changes to the collection criteria. There were some operational issues, e.g., network interruptions, during the collection period. Since some of the terms used to collect from the Twitter filter stream were broad (e.g., “election”), it may contain tweets from elections other than the U.S. presidential election, including state elections, local elections, or elections in other countries. Per Twitter’s Developer Policy, tweet ids may be publicly shared; tweets may not. Questions about this dataset can be sent to sfm@gwu.edu. George Washington University researchers should contact us for access to the tweets. This work is supported by grant #NARDI-14-50017-14 from the National Historical Publications and Records Commission.
According to a survey conducted in June 2023, adults in the United States spent more time per day on TikTok than on any other leading social media platform. Overall, respondents reported spending an average of 53.8 minutes per day on the social video app. YouTube and Twitter ranked second and third, each with an average of 48 minutes and 34 minutes spent on the platforms per day, respectively.
U.S. teens have time for certain platforms
Different social media platforms attract different demographics, with teenagers in the United States being more drawn to TikTok and YouTube over Facebook. In 2023, teenagers in the United States spent an average of almost two hours on YouTube and 1.5 hours on TikTok every day, 1451257 while Facebook was used by teens for less than half an hour per day. Furthermore, social media habits differ between genders, as teen girls were more likely to spend more time than boys on Instagram.
TikTok is king for teens and Gen Z
Although spending 1.5 hours on the Generation Z app of choice may sound rather modest, some TikTok users devote much more of their time to the platform . According to a survey conducted in the United States in 2022, around eight percent of teenagers in the United States spent over five hours a day on TikTok. 1417187 whereas another 22 percent reported spending between two and three hours daily on the video-based app.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected the data for our analysis by utilising the academic Twitter API (V2). The four-letter acronyms associated with the Myers-Briggs Type Indicator (MBTI) give people a short categorisation of their personality that is easily self-reported on social media in the form of a regular expression. As a result, people are much more likely to self-report their categorical MBTI rather than other personality types. The four letter MBTI acronyms are also unique to the Myers-Briggs questionnaire, meaning they can be easily queried using the Twitter API. This also means these personality types won't be confused with any other acronym or word, reducing the likelihood we incorrectly classify any users. When we initially explored Twitter, we found that some users self-reported their personality type in their biography and other users would self-report their personality types in their tweets. As a result, we formulated two methods for querying and labelling the Myers-Briggs personality type of accounts. We describe the two methods below:
Firstly, we used Tweepy's 'search_users' endpoint to obtain the set of users who currently self-report their MBTI in their username or biography. Due to the rate limits associated with this endpoint we were limited to obtaining no more than 1000 users for each unique search query. Secondly, we used the Twitter API's 'full_archive_search' endpoint to obtain the set of users who self-reported their Myers-Briggs personality type in a Tweet since Twitter's creation (March 26, 2006). We searched for users who tweeted any of the three regular expressions, followed by their personality type: 'I am...', 'I am a...' or 'I am an...'. Note that we only searched for self-reports in Tweets and excluded Retweets, Quotes and Replies in our query due to these having a much higher potential of incorrectly labelling an account. Furthermore, we were bound by rate limits of 300 requests per 15-minute window, however there were no hard bounds on the number of tweets or users we could obtain. As a result, we ran this query for each personality type until the search was exhausted.
Note that in both cases, the queries were not case-sensitive. In the attached dataset, we provide both the Twitter User IDs and the Myers-Briggs Personality Types associated with the 68,958 users obtained using the two methods discussed above. We provide this dataset prior to any preprocessing steps performed in our paper.
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Public tweets from the account @rerb on Twitter.
— ‘ID’: Unique identifier of the tweet — ‘created_at’: Date, time, minute, seconds of tweet in UTC time zone — ‘text’: Tweet text — ‘retweet_count’: Number of RT of tweet — ‘favorite_count’: Number of times the tweet was added in the favorites — ‘tweet_mentionne_excuse’: Does the tweet mention the word “excuse” (0 or 1) — ‘tweet_mentionne_regulation’: Does the tweet mention the word “regulation” (0 or 1) — ‘tweet_mentionne_bon_courage’: Does the tweet mention the word “good courage” (0 or 1)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
[READ THIS FIRST! DATASETS FOR Academic/Learning/Non-commercial purpose]
US Election 2020 is very interesting to look into as it is an election in the middle of a pandemic. Me and my teammate created a twitter crawler using Twitter API and Tweepy for my Artificial Intelligence coursework. We chose Donald Trump as a subject of interest as President Trump was known for his twitter interaction.
I decided to deploy my crawler on post-voting day to conduct a sentiment analysis.
Tweet text in this datasets is suitable for Sentiment Analysis usage.
This raw datasets is crawled using Tweepy library and Twitter API. 2500 tweets were gathered per 15 minutes. There are total of 247,500 row of entries and 13 columns, with the total of 3,217,500 cells of data. Data cleaning is needed to perform before doing any analysis.
Datasets date range: 4th November 2020 - 11th November 2020 Tweets with "Trump", "DonalTrump", "realDonalTrump" were capture.
(The User = user of the particular row) username: Twitter User handle accDesc: Description of the user on profile location: Location of the tweet following: Total number of account the user is following followers: Total number of followers of the user totaltweets: Total tweets created of the user usercreated: Date of the user registered his/her Twitter account tweetcreated: Date of the tweet created favouritecount: tweet <3 count (equivalent to like on Facebook) retweetcount: Total tweet's retweet (equivalent to share on Facebook) text: Text body of the tweet tweetsource: Device used to create this tweet hashtags: hashtag of the tweet in JSON format
Banner and thumbnail courtesy of > visuals < from unsplash.com
Much thanks to my teammate Jiacheng Loh and ChenZhen Li for the efforts.
Please do not use this datasets for any malicious attempts, any damage done is not under the responsible of me.
This datasets were gathered for the purpose of learning and not for commercial purposes.
Data were public in the public domain, therefore i assume these data is open for all.
Datasets are gathered with at least 15 minutes interval, therefore datecreated distribution is not equal and may not include all tweets created within the date range.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://ichef.bbci.co.uk/news/976/cpsprodpb/11C98/production/_118165827_gettyimages-1232465340.jpg" alt="">
People across India scrambled for life-saving oxygen supplies on Friday and patients lay dying outside hospitals as the capital recorded the equivalent of one death from COVID-19 every five minutes.
For the second day running, the country’s overnight infection total was higher than ever recorded anywhere in the world since the pandemic began last year, at 332,730.
India’s second wave has hit with such ferocity that hospitals are running out of oxygen, beds, and anti-viral drugs. Many patients have been turned away because there was no space for them, doctors in Delhi said.
https://s.yimg.com/ny/api/res/1.2/XhVWo4SOloJoXaQLrxxUIQ--/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MA--/https://s.yimg.com/os/creatr-uploaded-images/2021-04/8aa568f0-a3e0-11eb-8ff6-6b9a188e374a" alt="">
Mass cremations have been taking place as the crematoriums have run out of space. Ambulance sirens sounded throughout the day in the deserted streets of the capital, one of India’s worst-hit cities, where a lockdown is in place to try and stem the transmission of the virus. source
The dataset consists of the tweets made with the #IndiaWantsOxygen hashtag covering the tweets from the past week. The dataset totally consists of 25,440 tweets and will be updated on a daily basis.
The description of the features is given below | No |Columns | Descriptions | | -- | -- | -- | | 1 | user_name | The name of the user, as they’ve defined it. | | 2 | user_location | The user-defined location for this account’s profile. | | 3 | user_description | The user-defined UTF-8 string describing their account. | | 4 | user_created | Time and date, when the account was created. | | 5 | user_followers | The number of followers an account currently has. | | 6 | user_friends | The number of friends an account currently has. | | 7 | user_favourites | The number of favorites an account currently has | | 8 | user_verified | When true, indicates that the user has a verified account | | 9 | date | UTC time and date when the Tweet was created | | 10 | text | The actual UTF-8 text of the Tweet | | 11 | hashtags | All the other hashtags posted in the tweet along with #IndiaWantsOxygen | | 12 | source | Utility used to post the Tweet, Tweets from the Twitter website have a source value - web | | 13 | is_retweet | Indicates whether this Tweet has been Retweeted by the authenticating user. |
https://globalnews.ca/news/7785122/india-covid-19-hospitals-record/ Image courtesy: BBC and Reuters
The past few days have been really depressing after seeing these incidents. These tweets are the voice of the indians requesting help and people all over the globe asking their own countries to support India by providing oxygen tanks.
And I strongly believe that this is not just some data, but the pure emotions of people and their call for help. And I hope we as data scientists could contribute on this front by providing valuable information and insights.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data for temporal validity change prediction, an NLP task that will be defined in an upcoming publication. The dataset consists of five columns.
The duration labels (context_only_tv, combined_tv) are class indices of the following class distribution:
[no time-sensitive information, less than one minute, 1-5 minutes, 5-15 minutes, 15-45 minutes, 45 minutes - 2 hours, 2-6 hours, more than 6 hours, 1-3 days, 3-7 days, 1-4 weeks, more than one month]
Different dataset splits are provided.
Ce diagramme représente les tranches horaires où le nombre de commentaires à la minute à propos d'une émission est le plus important sur Twitter en France entre 2012 et 2013. On constate que 21 heures était l'heure où le plus d'utilisateurs ont posté des tweets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The first public large-scale multilingual Twitter dataset related to the FIFA World Cup 2022, comprising over 28 million posts in 69 unique spoken languages, including Arabic, English, Spanish, French, and many others. This dataset aims to facilitate research in future sentiment analysis, cross-linguistic studies, event-based analytics, meme and hate speech detection, fake news detection, and social manipulation detection.
The file 🚨Qatar22WC.csv🚨 contains tweet-level and user-level metadata for our collected tweets.
🚀Codebook for FIFA World Cup 2022 Twitter Dataset🚀
| Column Name | Description|
|-------------------------------- |----------------------------------------------------------------------------------------|
| day
, month
, year
| The date where the tweet posted |
| hou
, min
, sec
| Hour, minute, and second of tweet timestamp |
| age_of_the_user_account
| User Account age in days |
| tweet_count
| Total number of tweets posted by the user |
| location
| User-defined location field |
| follower_count
| Number of followers the user has |
| following_count
| Number of accounts the user is following |
| follower_to_Following
| Follower-following ratio |
| favouite_count
| Number of likes the user did|
| verified
| Boolean indicating if the user is verified (1 = Verified, 0 = Not Verified) |
| Avg_tweet_count
| Average tweets per day for the user activity|
| list_count
| Number of lists the user is a member |
| Tweet_Id
| Tweet ID |
| is_reply_tweet
| ID of the tweet being replied to (if applicable) |
| is_quote
| boolean representing if the tweet is a quote |
| retid
| Retweet ID if it's a retweet; NaN otherwise |
| lang
| Language of the tweet |
| hashtags
| The keyword or hashtag used to collect the tweet |
| is_image
, | Boolean indicating if the tweet associated with image|
| is_video
| Boolean indicating if the tweet associated with video |
|-------------------------------|----------------------------------------------------------------------------------------|
Examples of use case queries are described in the file 🚨fifa_wc_qatar22_examples_of_use_case_queries.ipynb🚨 and accessible via: https://github.com/khairied/Qata_FIFA_World_Cup_22
🚀 Please Cite This as: Daouadi, K. E., Boualleg, Y., Guehairia, O. & Taleb-Ahmed, A. (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup, Journal of Computational Social Science.
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supporting figures and table. Figure S1, Tweet volume per minute. Number of tweets per minute in the 12 datasets. (a–d) The six hours during the four debate events (“DEB”). For other categories, we plot the six hour volume centering around the peak within the data range: (e–h) Normal period prior to the debate evenings (“PRE”). (i,j) National convention events including RNC and DNC (“CONV”). (k,l) Breaking political news events including Benghazi attack and Romney's 47-percent video (“NEWS”). Figure S2, Changes in communication volume. Diamond shapes indicate the mean value of each category. This figure shows the ratio of tweets mentioning a user to the total tweets at the peak hour. Figure S3, Lorentz curves for cumulative degree distributions of activity. Increasing equality converges toward diagonal line from the origin to the upper-right and increasing inequality converges toward a hyperbola rising to 100% of volume at the 100th percentile. Figure S4, Connectivity-concentration state spaces. For each of the twelve observed events, the Gini coefficient for the network's degree distribution is plotted on the -axis and the average degree of the network is plotted on the -axis. Table S1, Kolmogorov-Smirnov test (K-S test) for comparing the PRE curves with the remaining three curves in other conditions. (PDF)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two datasets are published as part of my Bachelor's final thesis on hate speech, titled Hate Speech on Twitter: Analysis of LGBTIQ-phobia Before and After Elon Musk:
Both datasets aim to provide a detailed view of interactions on Twitter on the specified days.
The columns include: id, createdAt, source, lang, retweetCount, replyCount, likeCount, quoteCount, viewCount, bookmarkCount, isReply, conversationId, author_verified, author_blue_verified, author_followers, author_following, author_tweets, author_createdAt, hashtags, author_isAutomated, author_fastFollowersCount, author_favouritesCount, texto_analisis, toxicity, severe_toxicity, identity_attack, insult, profanity, threat. The 'texto_analisis' column contains the content of the tweet, with all user mentions removed to comply with privacy regulations such as GDPR. The 'toxicity', 'severe_toxicity', 'identity_attack', 'insult', 'profanity', and 'threat' columns have values ranging from 0 to 1, where 0 indicates the attribute is not present and 1 indicates it is strongly present. The 'createdAt' column represents the tweet's publication date.
For further details, you can find the code for processing and analysis in the project's GitHub repository.
Acknowledgements
We would like to acknowledge the use of tools and support provided by twitterapi.io for data extraction, as well as the Perspective API, which played a crucial role in analyzing tweet toxicity. These resources were indispensable for the successful completion of this project.
Se publican dos conjuntos de datos como parte de mi trabajo de fin de grado (TFG) sobre el discurso de odio, titulado Discurso de odio en Twitter: Análisis de la LGTBIQ-fobia antes y después de Elon Musk:
Ambos conjuntos de datos tienen como objetivo proporcionar una visión detallada de las interacciones en Twitter en los días señalados.
Las columnas incluyen: id, createdAt, source, lang, retweetCount, replyCount, likeCount, quoteCount, viewCount, bookmarkCount, isReply, conversationId, author_verified, author_blue_verified, author_followers, author_following, author_tweets, author_createdAt, hashtags, author_isAutomated, author_fastFollowersCount, author_favouritesCount, texto_analisis, toxicity, severe_toxicity, identity_attack, insult, profanity, threat. La columna 'texto_analisis' contiene el contenido del tuit, una vez eliminadas todas las menciones a usuarios para cumplir con las normativas de privacidad, como la GDPR. Las columnas 'toxicity', 'severe_toxicity', 'identity_attack', 'insult', 'profanity' y 'threat' tienen valores que van del 0 al 1, donde 0 indica que el atributo no está presente y 1 indica que está muy presente. La columna 'createdAt' representa la fecha de publicación del tuit.
Para más detalles, puede consultar el código de procesamiento y análisis de los datos en el repositorio de GitHub del proyecto.
Agradecimientos
Queremos agradecer el apoyo y las herramientas proporcionadas por twitterapi.io para la extracción de datos, así como la Perspective API, que jugó un papel crucial en el análisis de la toxicidad de los tuits. Estos recursos fueron indispensables para la realización exitosa de este proyecto.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each spreadsheet in the Excel file corresponds to one of the tested episodes and contains minute-by-minute values Twitter volume (raw), TV viewership (raw), and EEG metrics (pre-processed) associated with Attention, Motivation, and Memory, as well as the composite EEG metric (the average of the other three metrics). Ad breaks and missing values are indicated in the Notes field on each spreadsheet. (XLSX)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A set of tuits of a few hours during a day, 26 Oct 2011. These were tuits that either mentioned Rajoy or Rubalcaba. The set includes files with times for all tuits, and processed number of tuits, total and per minute.
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio