38 datasets found
  1. A Twitter Dataset of 70+ million tweets related to COVID-19

    • zenodo.org
    csv, tsv, zip
    Updated Apr 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell (2023). A Twitter Dataset of 70+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3732460
    Explore at:
    csv, tsv, zipAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell
    Description

    Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.

    The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

    More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

    As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

  2. Twitter Friends

    • kaggle.com
    Updated Sep 2, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hubert Wassner (2016). Twitter Friends [Dataset]. https://www.kaggle.com/datasets/hwassner/TwitterFriends/discussion?sortBy=recent
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2016
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hubert Wassner
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Twitter Friends and hashtags

    Context

    This datasets is an extract of a wider database aimed at collecting Twitter user's friends (other accound one follows). The global goal is to study user's interest thru who they follow and connection to the hashtag they've used.

    Content

    It's a list of Twitter user's informations. In the JSON format one twitter user is stored in one object of this more that 40.000 objects list. Each object holds :

    • avatar : URL to the profile picture

    • followerCount : the number of followers of this user

    • friendsCount : the number of people following this user.

    • friendName : stores the @name (without the '@') of the user (beware this name can be changed by the user)

    • id : user ID, this number can not change (you can retrieve screen name with this service : https://tweeterid.com/)

    • friends : the list of IDs the user follows (data stored is IDs of users followed by this user)

    • lang : the language declared by the user (in this dataset there is only "en" (english))

    • lastSeen : the time stamp of the date when this user have post his last tweet.

    • tags : the hashtags (whith or without #) used by the user. It's the "trending topic" the user tweeted about.

    • tweetID : Id of the last tweet posted by this user.

    You also have the CSV format which uses the same naming convention.

    These users are selected because they tweeted on Twitter trending topics, I've selected users that have at least 100 followers and following at least 100 other account (in order to filter out spam and non-informative/empty accounts).

    Acknowledgements

    This data set is build by Hubert Wassner (me) using the Twitter public API. More data can be obtained on request (hubert.wassner AT gmail.com), at this time I've collected over 5 milions in different languages. Some more information can be found here (in french only) : http://wassner.blogspot.fr/2016/06/recuperer-des-profils-twitter-par.html

    Past Research

    No public research have been done (until now) on this dataset. I made a private application which is described here : http://wassner.blogspot.fr/2016/09/twitter-profiling.html (in French) which uses the full dataset (Millions of full profiles).

    Inspiration

    On can analyse a lot of stuff with this datasets :

    • stats about followers & followings
    • manyfold learning or unsupervised learning from friend list
    • hashtag prediction from friend list

    Contact

    Feel free to ask any question (or help request) via Twitter : @hwassner

    Enjoy! ;)

  3. A Twitter Dataset of 150+ million tweets related to COVID-19 for open...

    • zenodo.org
    application/gzip, csv +1
    Updated Apr 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 150+ million tweets related to COVID-19 for open research [Dataset]. http://doi.org/10.5281/zenodo.3738018
    Explore at:
    application/gzip, csv, tsvAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
    Description

    Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.

    The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (152,920,832 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (30,990,645 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

    More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

    As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.

  4. A Twitter Dataset of 40+ million tweets related to COVID-19

    • zenodo.org
    • explore.openaire.eu
    csv, tsv
    Updated Apr 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla (2023). A Twitter Dataset of 40+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3723940
    Explore at:
    tsv, csvAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla
    Description

    Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 22nd which yielded over 4 million tweets a day.

    The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (40,823,816 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (7,479,940 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

    More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

    As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

  5. w

    Unleashed Twitter Statistics

    • data.wu.ac.at
    • researchdata.edu.au
    csv
    Updated Oct 27, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    South Australian Governments (2016). Unleashed Twitter Statistics [Dataset]. https://data.wu.ac.at/odso/data_gov_au/ZjZhMmUyNmYtMzg3MC00MmNiLWE2MzktZjI4NmFmMTVmYTYy
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 27, 2016
    Dataset provided by
    South Australian Governments
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset contains statistics related to the Unleashed Twitter account (@SAUnleashed). Unleashed is an open data competition, an initiative of the Office for Digital Government, Department of the Premier and Cabinet. The data is used to monitor the level of engagement activity with the audience, and make the communication effective in regards to the event.

  6. Data from: A large-scale COVID-19 Twitter chatter dataset for open...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    application/gzip, csv +1
    Updated Apr 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration [Dataset]. http://doi.org/10.5281/zenodo.3766929
    Explore at:
    application/gzip, csv, tsvAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
    Description

    Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.

    The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (230,961,781 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (52,026,197 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/

    More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)

    As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.

  7. s

    Covid-19 Twitter chatter dataset for scientific use

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Covid-19 Twitter chatter dataset for scientific use [Dataset]. https://marketplace.sshopencloud.eu/dataset/JicoPW
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full dataset, and a cleaned version with no retweets. There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms, the top 1000 bigrams, and the top 1000 trigrams. Some general statistics per day are included for both datasets. We will continue to update the dataset every two days here and weekly in Zenodo. For more information on processing and visualizations please visit: www.panacealab.org/covid19

  8. f

    October 2022 Covid-19 Twitter Streaming Dataset

    • figshare.com
    application/gzip
    Updated Nov 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Media Lab (2022). October 2022 Covid-19 Twitter Streaming Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21442044.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 1, 2022
    Dataset provided by
    figshare
    Authors
    Social Media Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The file contains Tweet IDs* for COVID-19 related tweets collected in October, 2022 from Twitter's COVID-19 Streaming Endpoint via a custom script developed by the Social Media Lab (https://socialmedialab.ca/).Visit our interactive dashboard at https://stream.covid19misinfo.org/ for a preview and some general stats about this COVID-19 Twitter streaming dataset.For more info about Twitter's COVID-19 Streaming Endpoint, visit https://developer.twitter.com/en/docs/labs/covid19-stream/overviewNote: In accordance with Twitter API Terms, the dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). To recollect tweets contained in this dataset, you can use programs such as Hydrator (https://github.com/DocNow/hydrator/) or the Python library Twarc (https://github.com/DocNow/twarc/).

  9. COVID-19 Twitter chatter

    • kaggle.com
    zip
    Updated Jan 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Storm (2021). COVID-19 Twitter chatter [Dataset]. https://www.kaggle.com/paulrohan2020/covid19-twitter-chatter
    Explore at:
    zip(7099304839 bytes)Available download formats
    Dataset updated
    Jan 9, 2021
    Authors
    Data Storm
    Description

    Source

    Data gathering started from March 11th yielding over 4 million tweets a day.

    The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (891,324,837 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (223,249,143 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/

    More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)

    As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.

  10. Twitter users in the United States 2019-2028

    • statista.com
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department (2025). Twitter users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Area covered
    United States
    Description

    The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.

  11. f

    Tweets discussing the Russia/Ukraine War

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joshua Watt; Bridget Smart (2023). Tweets discussing the Russia/Ukraine War [Dataset]. http://doi.org/10.6084/m9.figshare.20486910.v5
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Joshua Watt; Bridget Smart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Russia, Ukraine
    Description

    We used the Twitter API (V2) to collect all tweets, retweets, quotes and replies containing case-insensitive versions of the hashtags: #(I)StandWithPutin, #(I)StandWithRussia, #(I)SupportRussia, #(I)StandWithUkraine, #(I)StandWithZelenskyy and #(I)SupportUkraine. These were obtained from February 23rd 2022 00:00:00 UTC until March 8th 2022 23:59:59 UTC, the fortnight after Russia invaded Ukraine. We queried the hashtags with and without the `I', a total of 12 query hashtags, collecting 5,203,746 tweets. The data collected predates the beginning of the Russian invasion by one day. These hashtags were chosen as they were found to be the most trending hashtags related to the Russia/Ukraine war which could be easily identified with a particular side in the conflict. We calculated Botometer results on 483,100 (26.5%) of accounts. These accounts were randomly sampled from a list of all unique users in our dataset which posted in English. This random sample leads to an approximately uniform frequency of Tweets from accounts with Botometer labels across the time frame we considered. We include the language dependent and language independent results from Botometer, including the Complete Automation Probabilities (CAP) and each of the sub-category scores for different bot types. Moreoever, we include the display scores and raw scores from Botometer for each account. More information about the Botometer scores can be found at this link: https://rapidapi.com/OSoMe/api/botometer-pro/details You can find our paper here: https://arxiv.org/abs/2208.07038

  12. h

    text-stats

    • huggingface.co
    Updated Dec 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng (2024). text-stats [Dataset]. https://huggingface.co/datasets/agentlans/text-stats
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2024
    Authors
    Alan Tseng
    Description

    Text statistics

    This dataset is a combination of the following datasets:

    agentlans/text-quality-v2 agentlans/readability agentlans/twitter-sentiment-meta-analysis

    The main purpose is to collect the large data into one place for easy training and evaluation.

      Data Preparation and Transformation
    
    
    
    
    
    
    
      Quality Score Normalization
    

    The dataset was enhanced with additional columns, and quality scores (n = 909 533) were normalized using Ordered Quantile… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/text-stats.

  13. f

    Analytic dataset used for the study.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Mazzeffi; Lindsay Strickland; Zachary Coffman; Braden Miller; Ebony Hilton; Lynn Kohan; Ryan Keneally; Peggy McNaull; Nabil Elkassabany (2024). Analytic dataset used for the study. [Dataset]. http://doi.org/10.1371/journal.pone.0298741.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Michael Mazzeffi; Lindsay Strickland; Zachary Coffman; Braden Miller; Ebony Hilton; Lynn Kohan; Ryan Keneally; Peggy McNaull; Nabil Elkassabany
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Twitter (recently renamed X) is used by academic anesthesiology departments as a social media platform for various purposes. We hypothesized that Twitter (X) use would be prevalent among academic anesthesiology departments and that the number of tweets would vary by region, physician faculty size, and National Institutes of Health (NIH) research funding rank. We performed a descriptive study of Twitter (X) use by academic anesthesiology departments (i.e. those with a residency program) in 2022. Original tweets were collected using a Twitter (X) analytics tool. Summary statistics were reported for tweet number and content. The median number of tweets was compared after stratifying by region, physician faculty size, and NIH funding rank. Among 166 academic anesthesiology departments, there were 73 (44.0%) that had a Twitter (X) account in 2022. There were 3,578 original tweets during the study period and the median number of tweets per department was 21 (25th-75th = 0, 75) with most tweets (55.8%) announcing general departmental news and a smaller number highlighting social events (12.5%), research (11.1%), recruiting (7.1%), DEI activities (5.2%), and trainee experiences (4.1%). There was no significant difference in the median number of tweets by region (P = 0.81). The median number of tweets differed significantly by physician faculty size (P

  14. Z

    A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abayomi-Alli Adebayo (2023). A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for Natural Language Processing Tasks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8347220
    Explore at:
    Dataset updated
    Sep 17, 2023
    Dataset provided by
    Abayomi-Alli Adebayo
    Odeyinka, Abiola Michael
    Arogundade Oluwasefunmi Tale
    Abayomi-Alli Ayomide
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Nigeria
    Description

    The dataset contains tweets related to the 2023 presidential elections in Nigeria. The data was retrieved from the social media network, Twitter (Now X) between February 4th, 2023 and April 4th, 2023. The hashtags from the official handles and other popular hashtags endorsed and/or representing the candidates of each party were considered for retrieving election related tweets using an API from Twitter social media platform. Three major political parties in Nigeria were considered and they have been labelled as Party A, Party L and Party P in this dataset. The party or group called "General" contains tweets from the Independent National Electoral Commission (INEC) hashtags such as @inecnigeria and #2023election which is not directly for any political party.

    The dataset has been pre-processed lightly to make it very useful to researcher for a wide range of natural language processing tasks like sentiment analysis, topic modelling, fake news detection, emotion detection, election stance, etc.

    Details of the dataset collection such as hashtags, retrieved tweets, duplicates removed, and the remaining unique tweets is presented in Table 1.

    Table 1: Tweets collection and duplicates removal

    S/N

    Party

    Hash tags

    Retrieved tweets

    Duplicates tweets

    Unique tweets

    1

     X
    

    @inecnigeria

    2023election

    64,496

    47,275

    17,195

    2

     A
    

    TinubuIsComing

    emilokan

    jagabanarmy

    RenewedHope

    BATKSM2023

    263,870

    231,036

    32,832

    3

     L
    

    VoteLP

    NigeriaMustBeBright

    PeterObiForPresident2023

    ObiDatti2023

    PeterObi

    664,083

    310,857

    353,226

    4

    P

    NigeriaDecides

    VotePDP

    AtikuOkowa2023

    FinalPushToVictory

    RecoverNigeria

    387,450

    318,425

    66,227

    1,379,899

    907,593

    468,480

    To encourage NLP tasks, we uploaded in this Version One the following files:

    The combined dataset with pre-processed tweets and their meta data but with removed duplicates are in the file labelled “Combined Dataset Pre-processed without duplicates.csv”

    General statistics on each corpus is in the file labelled “Dataset Statistics.xlsx”

    The preprocessed corpus from the general group with the tweet contents only is in file labelled “Preprocessed_Tweet only_GENERAL.xlsx”

    The preprocessed corpus from Party A with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party A.xlsx”

    The preprocessed corpus from Party L with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party L.xlsx”

    The preprocessed corpus from Party P with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party P.xlsx”

    The top 100 frequent tokens are in the file labelled “Top 100 Tokens and weights.xlsx”

    The top frequent bigrams and their weights are in the file labelled “Top 100 Bigrams and weights.xlsx”

    The top frequent trigrams and their weights are in the file labelled “Top 100 Trigrams and weights.xlsx”

  15. Latest statistics on milk utilisation by dairies - official statistics

    • s3.amazonaws.com
    • gov.uk
    • +1more
    Updated May 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Environment, Food & Rural Affairs (2022). Latest statistics on milk utilisation by dairies - official statistics [Dataset]. https://s3.amazonaws.com/thegovernmentsays-files/content/180/1807491.html
    Explore at:
    Dataset updated
    May 4, 2022
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Environment, Food & Rural Affairs
    Description

    Please note that this Official Statistics publication is no longer updated. Latest statistics on milk utilisation by dairies - national statistics replaced this publication in 2017. Historical publications can be accessed in Milk utilisation by dairies.

    This monthly official statistics notice includes information on the volume of milk used by dairies in England and Wales in the production of drinking milk and milk products. The monthly official statistics on the use of milk by dairies in England and Wales are combined with similar information from Scotland and Northern Ireland to produce a dataset for the UK as a whole. This gives UK milk availability and disposals and the production of liquid drinking milk and milk products such as cheese, butter and milk powders.

    Additional information

    UK supplies of milk products

    Production and overseas trade are brought together in the quarterly milk product supplies dataset. This provides information on how much butter, cheese, cream, condensed milk and milk powders is available for use in the UK, and gives a measure of UK self-sufficiency for these products.

    Due to significant revisions to Northern Ireland data for 2016, the quarterly supplies dataset has been re-issued to maintain comparability with the monthly production data.

    Structure of the UK dairy industry

    Tables showing the size and structure of the UK dairy industry, both in terms of the number of enterprises producing milk products and also in terms of the volumes of production of milk, butter and cheese can be found here.

    Important usage information

    • not designated as national statistics
    • if you require datasets in another format such as Excel, please get in touch, contact details are given below

    Next update: see the statistics release calendar

    For further information please contact:
    Julie.Rumsey@defra.gsi.gov.uk
    https://www.twitter.com/@defrastats" class="govuk-link">Twitter: @DefraStats

  16. Social Power NBA

    • kaggle.com
    Updated Aug 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noah Gift (2017). Social Power NBA [Dataset]. https://www.kaggle.com/noahgift/social-power-nba/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2017
    Dataset provided by
    Kaggle
    Authors
    Noah Gift
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    This data set contains combined on-court performance data for NBA players in the 2016-2017 season, alongside salary, Twitter engagement, and Wikipedia traffic data.

    Further information can be found in a series of articles for IBM Developerworks: "Explore valuation and attendance using data science and machine learning" and "Exploring the individual NBA players".

    A talk about this dataset has slides from March, 2018, Strata:

    https://www.slideshare.net/noahgift/social-power-andinfluenceinthenba-89807740?qid=3f9f835a-f3d7-4174-8a8c-c97f9c82e614&v=&b=&from_search=1

    Further reading on this dataset is in the book Pragmatic AI, in Chapter 6 or full book, Pragmatic AI: An introduction to Cloud-based Machine Learning and watch lesson 9 in Essential Machine Learning and AI with Python and Jupyter Notebook

    Followup Items

    Acknowledgement

    Data sources include ESPN, Basketball-Reference, Twitter, Five-ThirtyEight, and Wikipedia. The source code for this dataset (in Python and R) can be found on GitHub. Links to more writing can be found at noahgift.com.

    Inspiration

    • Do NBA fans know more about who the best players are, or do owners?
    • What is the true worth of the social media presence of athletes in the NBA?
  17. Dataset for "Geospatial analysis of toponyms in geotagged social media...

    • zenodo.org
    zip
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Takayuki Hiraoka; Takayuki Hiraoka; Takashi Kirimura; Takashi Kirimura; Naoya Fujiwara; Naoya Fujiwara (2024). Dataset for "Geospatial analysis of toponyms in geotagged social media posts" [Dataset]. http://doi.org/10.5281/zenodo.13860969
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Takayuki Hiraoka; Takayuki Hiraoka; Takashi Kirimura; Takashi Kirimura; Naoya Fujiwara; Naoya Fujiwara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Geotagged Twitter posts dataset

    Dataset used for the research presented in the following paper: Takayuki Hiraoka, Takashi Kirimura, Naoya Fujiwara (2024) "Geospatial analysis of toponyms in geo-tagged social media posts".

    We collected georeferenced Twitter posts tagged to coordinates inside the bounding box of Japan between 2012-2018. The present dataset represents the spatial distributions of all geotagged posts as well as posts containing in the text each of 24 domestic toponyms, 12 common nouns, and 6 foreign toponyms. The code used to analyze the data is available on GitHub.

    Data description

    • selected_geotagged_tweet_data/: Number of geotagged twitter posts in each grid cell. Each csv file under this directory associates each grid cell (spanning 30 seconds of latitude and 45 secoonds of longitude, which is approximately a 1km x 1km square, specified by an 8 digit code m3code) with the number of geotagged tweets tagged to the coordinates inside that cell (tweetcount). file_names.json relates each of the toponyms studied in this work to the corresponding datafile (all denotes the full data).
    • population/population_center_2020.xlsx: Center of population of each municipality based on the 2020 census. Derived from data published by the Statistics Bureau of Japan on their website (Japanese)
    • population/census2015mesh3_totalpop_setai.csv: Resident population in each grid cell based on the 2015 census. Derived from data published by the Statistics Bureau of Japan on e-stat (Japanese)
    • population/economiccensus2016mesh3_jigyosyo_jugyosya.csv: Employed population in each grid cell based on the 2016 Economic Census. Derived from data published by the Statistics Bureau of Japan on e-stat (Japanese)
    • japan_MetropolitanEmploymentArea2015map/: Shape file for the boundaries of Metropolitan Employment Areas (MEA) in Japan. See this website for details of MEA.
    • ward_shapefiles/: Shape files for the boundaries of wards in large cities, published by the Statistics Bureau of Japan on e-stat
  18. f

    September 2022 Covid-19 Vaccines Twitter Streaming Dataset

    • figshare.com
    application/gzip
    Updated Oct 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Media Lab (2022). September 2022 Covid-19 Vaccines Twitter Streaming Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21257091.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 1, 2022
    Dataset provided by
    figshare
    Authors
    Social Media Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The file contains Tweet IDs* for COVID-19 related tweets containing at least one vaccine-related word (i.e., words that start with vaccin*, vacin*, or vax*) collected in September, 2022 from Twitter's COVID-19 Streaming Endpoint via a custom script developed by the Social Media Lab (https://socialmedialab.ca/).Visit our interactive dashboard at https://stream.covid19misinfo.org/ for a preview and some general stats about this COVID-19 Twitter streaming dataset.For more info about Twitter's COVID-19 Streaming Endpoint, visit https://developer.twitter.com/en/docs/labs/covid19-stream/overviewNote: In accordance with Twitter API Terms, the dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). To recollect tweets contained in this dataset, you can use programs such as Hydrator (https://github.com/DocNow/hydrator/) or the Python library Twarc (https://github.com/DocNow/twarc/).

  19. w

    Historical statistics notices and dataset on monthly wholesale fruit and...

    • gov.uk
    Updated Sep 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Environment, Food & Rural Affairs (2016). Historical statistics notices and dataset on monthly wholesale fruit and vegetable prices 2016 [Dataset]. https://www.gov.uk/government/statistics/historic-statistics-notices-on-wholesale-fruit-and-vegetable-prices-2016
    Explore at:
    Dataset updated
    Sep 1, 2016
    Dataset provided by
    GOV.UK
    Authors
    Department for Environment, Food & Rural Affairs
    Description

    This publication gives previously published copies of the monthly National Statistics publication on wholesale fruit and vegetable prices that showed figures for 2016. Each publication gives the figures available at that time. The figures are subject to revision each month as new information becomes available. This publication also contains the previously published monthly dataset on wholesale fruit and vegetable prices which gives prices up to July 2016.

    The latest weekly data sets are available here.

    The publications give the average wholesale prices of selected home-grown horticultural produce. The prices are national averages of the most usual prices charged by wholesalers for selected home-grown fruit and vegetables at the wholesale markets in Birmingham, Bristol, Liverpool and New Spitalfields. For selected home-grown cut flowers and flowering pot plants the average also includes information from the wholesale market at New Covent Garden up to February 2016.

    Defra statistics: prices

    Email mailto:prices@defra.gov.uk">prices@defra.gov.uk

    <p class="govuk-body">You can also contact us via Twitter: <a href="https://twitter.com/DefraStats" class="govuk-link">https://twitter.com/DefraStats</a></p>
    

  20. Twitter users in Indonesia 2019-2028

    • statista.com
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department (2025). Twitter users in Indonesia 2019-2028 [Dataset]. https://www.statista.com/topics/8306/social-media-in-indonesia/
    Explore at:
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Area covered
    Indonesia
    Description

    The number of Twitter users in Indonesia was forecast to continuously increase between 2024 and 2028 by in total 1.4 million users (+6.14 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 24.25 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Malaysia and Singapore.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell (2023). A Twitter Dataset of 70+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3732460
Organization logo

A Twitter Dataset of 70+ million tweets related to COVID-19

Explore at:
csv, tsv, zipAvailable download formats
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell
Description

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

Search
Clear search
Close search
Google apps
Main menu