38 datasets found

A Twitter Dataset of 70+ million tweets related to COVID-19
zenodo.org
csv, tsv, zip
Updated Apr 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell (2023). A Twitter Dataset of 70+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3732460
Explore at:
csv, tsv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3732460
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Twitter Friends
kaggle.com
Updated Sep 2, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hubert Wassner (2016). Twitter Friends [Dataset]. https://www.kaggle.com/datasets/hwassner/TwitterFriends/discussion?sortBy=recent
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hubert Wassner
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Twitter Friends and hashtags

Context

This datasets is an extract of a wider database aimed at collecting Twitter user's friends (other accound one follows). The global goal is to study user's interest thru who they follow and connection to the hashtag they've used.

Content

It's a list of Twitter user's informations. In the JSON format one twitter user is stored in one object of this more that 40.000 objects list. Each object holds :

avatar : URL to the profile picture

followerCount : the number of followers of this user

friendsCount : the number of people following this user.

friendName : stores the @name (without the '@') of the user (beware this name can be changed by the user)

id : user ID, this number can not change (you can retrieve screen name with this service : https://tweeterid.com/)

friends : the list of IDs the user follows (data stored is IDs of users followed by this user)

lang : the language declared by the user (in this dataset there is only "en" (english))

lastSeen : the time stamp of the date when this user have post his last tweet.

tags : the hashtags (whith or without #) used by the user. It's the "trending topic" the user tweeted about.

tweetID : Id of the last tweet posted by this user.

You also have the CSV format which uses the same naming convention.

These users are selected because they tweeted on Twitter trending topics, I've selected users that have at least 100 followers and following at least 100 other account (in order to filter out spam and non-informative/empty accounts).

Acknowledgements

This data set is build by Hubert Wassner (me) using the Twitter public API. More data can be obtained on request (hubert.wassner AT gmail.com), at this time I've collected over 5 milions in different languages. Some more information can be found here (in french only) : http://wassner.blogspot.fr/2016/06/recuperer-des-profils-twitter-par.html

Past Research

No public research have been done (until now) on this dataset. I made a private application which is described here : http://wassner.blogspot.fr/2016/09/twitter-profiling.html (in French) which uses the full dataset (Millions of full profiles).

Inspiration

On can analyse a lot of stuff with this datasets :

stats about followers & followings

manyfold learning or unsupervised learning from friend list

hashtag prediction from friend list

Contact

Feel free to ask any question (or help request) via Twitter : @hwassner

Enjoy! ;)
A Twitter Dataset of 150+ million tweets related to COVID-19 for open...
zenodo.org
application/gzip, csv +1
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 150+ million tweets related to COVID-19 for open research [Dataset]. http://doi.org/10.5281/zenodo.3738018
Explore at:
application/gzip, csv, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3738018
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (152,920,832 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (30,990,645 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.
A Twitter Dataset of 40+ million tweets related to COVID-19
zenodo.org
explore.openaire.eu
csv, tsv
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla (2023). A Twitter Dataset of 40+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3723940
Explore at:
tsv, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3723940
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 22nd which yielded over 4 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (40,823,816 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (7,479,940 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
w
Unleashed Twitter Statistics
data.wu.ac.at
researchdata.edu.au
csv
Updated Oct 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
South Australian Governments (2016). Unleashed Twitter Statistics [Dataset]. https://data.wu.ac.at/odso/data_gov_au/ZjZhMmUyNmYtMzg3MC00MmNiLWE2MzktZjI4NmFmMTVmYTYy
Explore at:
csvAvailable download formats
Dataset updated
Oct 27, 2016
Dataset provided by
South Australian Governments
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
This dataset contains statistics related to the Unleashed Twitter account (@SAUnleashed). Unleashed is an open data competition, an initiative of the Office for Digital Government, Department of the Premier and Cabinet. The data is used to monitor the level of engagement activity with the audience, and make the communication effective in regards to the event.
Data from: A large-scale COVID-19 Twitter chatter dataset for open...
zenodo.org
explore.openaire.eu
+1more
application/gzip, csv +1
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration [Dataset]. http://doi.org/10.5281/zenodo.3766929
Explore at:
application/gzip, csv, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3766929
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (230,961,781 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (52,026,197 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.
s
Covid-19 Twitter chatter dataset for scientific use
marketplace.sshopencloud.eu
Updated Apr 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Covid-19 Twitter chatter dataset for scientific use [Dataset]. https://marketplace.sshopencloud.eu/dataset/JicoPW
Explore at:
Dataset updated
Apr 24, 2020
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full dataset, and a cleaned version with no retweets. There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms, the top 1000 bigrams, and the top 1000 trigrams. Some general statistics per day are included for both datasets. We will continue to update the dataset every two days here and weekly in Zenodo. For more information on processing and visualizations please visit: www.panacealab.org/covid19
f
October 2022 Covid-19 Twitter Streaming Dataset
figshare.com
application/gzip
Updated Nov 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Media Lab (2022). October 2022 Covid-19 Twitter Streaming Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21442044.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21442044.v1
Dataset updated
Nov 1, 2022
Dataset provided by
figshare
Authors
Social Media Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The file contains Tweet IDs* for COVID-19 related tweets collected in October, 2022 from Twitter's COVID-19 Streaming Endpoint via a custom script developed by the Social Media Lab (https://socialmedialab.ca/).Visit our interactive dashboard at https://stream.covid19misinfo.org/ for a preview and some general stats about this COVID-19 Twitter streaming dataset.For more info about Twitter's COVID-19 Streaming Endpoint, visit https://developer.twitter.com/en/docs/labs/covid19-stream/overviewNote: In accordance with Twitter API Terms, the dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). To recollect tweets contained in this dataset, you can use programs such as Hydrator (https://github.com/DocNow/hydrator/) or the Python library Twarc (https://github.com/DocNow/twarc/).
COVID-19 Twitter chatter
kaggle.com
zip
Updated Jan 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Storm (2021). COVID-19 Twitter chatter [Dataset]. https://www.kaggle.com/paulrohan2020/covid19-twitter-chatter
Explore at:
zip(7099304839 bytes)Available download formats
Dataset updated
Jan 9, 2021
Authors
Data Storm
Description
Source

Data gathering started from March 11th yielding over 4 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (891,324,837 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (223,249,143 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.
Twitter users in the United States 2019-2028
statista.com
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). Twitter users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
Explore at:
Dataset updated
Jul 30, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United States
Description
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
f
Tweets discussing the Russia/Ukraine War
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua Watt; Bridget Smart (2023). Tweets discussing the Russia/Ukraine War [Dataset]. http://doi.org/10.6084/m9.figshare.20486910.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20486910.v5
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Joshua Watt; Bridget Smart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Russia, Ukraine
Description
We used the Twitter API (V2) to collect all tweets, retweets, quotes and replies containing case-insensitive versions of the hashtags: #(I)StandWithPutin, #(I)StandWithRussia, #(I)SupportRussia, #(I)StandWithUkraine, #(I)StandWithZelenskyy and #(I)SupportUkraine. These were obtained from February 23rd 2022 00:00:00 UTC until March 8th 2022 23:59:59 UTC, the fortnight after Russia invaded Ukraine. We queried the hashtags with and without the `I', a total of 12 query hashtags, collecting 5,203,746 tweets. The data collected predates the beginning of the Russian invasion by one day. These hashtags were chosen as they were found to be the most trending hashtags related to the Russia/Ukraine war which could be easily identified with a particular side in the conflict. We calculated Botometer results on 483,100 (26.5%) of accounts. These accounts were randomly sampled from a list of all unique users in our dataset which posted in English. This random sample leads to an approximately uniform frequency of Tweets from accounts with Botometer labels across the time frame we considered. We include the language dependent and language independent results from Botometer, including the Complete Automation Probabilities (CAP) and each of the sub-category scores for different bot types. Moreoever, we include the display scores and raw scores from Botometer for each account. More information about the Botometer scores can be found at this link: https://rapidapi.com/OSoMe/api/botometer-pro/details You can find our paper here: https://arxiv.org/abs/2208.07038
h
text-stats
huggingface.co
Updated Dec 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng (2024). text-stats [Dataset]. https://huggingface.co/datasets/agentlans/text-stats
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2024
Authors
Alan Tseng
Description
Text statistics

This dataset is a combination of the following datasets:

agentlans/text-quality-v2 agentlans/readability agentlans/twitter-sentiment-meta-analysis

The main purpose is to collect the large data into one place for easy training and evaluation.

Data Preparation and Transformation Quality Score Normalization

The dataset was enhanced with additional columns, and quality scores (n = 909 533) were normalized using Ordered Quantile… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/text-stats.
f
Analytic dataset used for the study.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Mazzeffi; Lindsay Strickland; Zachary Coffman; Braden Miller; Ebony Hilton; Lynn Kohan; Ryan Keneally; Peggy McNaull; Nabil Elkassabany (2024). Analytic dataset used for the study. [Dataset]. http://doi.org/10.1371/journal.pone.0298741.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0298741.s001
Dataset updated
Feb 8, 2024
Dataset provided by
PLOS ONE
Authors
Michael Mazzeffi; Lindsay Strickland; Zachary Coffman; Braden Miller; Ebony Hilton; Lynn Kohan; Ryan Keneally; Peggy McNaull; Nabil Elkassabany
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Twitter (recently renamed X) is used by academic anesthesiology departments as a social media platform for various purposes. We hypothesized that Twitter (X) use would be prevalent among academic anesthesiology departments and that the number of tweets would vary by region, physician faculty size, and National Institutes of Health (NIH) research funding rank. We performed a descriptive study of Twitter (X) use by academic anesthesiology departments (i.e. those with a residency program) in 2022. Original tweets were collected using a Twitter (X) analytics tool. Summary statistics were reported for tweet number and content. The median number of tweets was compared after stratifying by region, physician faculty size, and NIH funding rank. Among 166 academic anesthesiology departments, there were 73 (44.0%) that had a Twitter (X) account in 2022. There were 3,578 original tweets during the study period and the median number of tweets per department was 21 (25th-75th = 0, 75) with most tweets (55.8%) announcing general departmental news and a smaller number highlighting social events (12.5%), research (11.1%), recruiting (7.1%), DEI activities (5.2%), and trainee experiences (4.1%). There was no significant difference in the median number of tweets by region (P = 0.81). The median number of tweets differed significantly by physician faculty size (P
Z
A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for...
data.niaid.nih.gov
zenodo.org
Updated Sep 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abayomi-Alli Adebayo (2023). A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for Natural Language Processing Tasks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8347220
Explore at:
Dataset updated
Sep 17, 2023
Dataset provided by
Abayomi-Alli Adebayo
Odeyinka, Abiola Michael
Arogundade Oluwasefunmi Tale
Abayomi-Alli Ayomide
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nigeria
Description
The dataset contains tweets related to the 2023 presidential elections in Nigeria. The data was retrieved from the social media network, Twitter (Now X) between February 4th, 2023 and April 4th, 2023. The hashtags from the official handles and other popular hashtags endorsed and/or representing the candidates of each party were considered for retrieving election related tweets using an API from Twitter social media platform. Three major political parties in Nigeria were considered and they have been labelled as Party A, Party L and Party P in this dataset. The party or group called "General" contains tweets from the Independent National Electoral Commission (INEC) hashtags such as @inecnigeria and #2023election which is not directly for any political party.

The dataset has been pre-processed lightly to make it very useful to researcher for a wide range of natural language processing tasks like sentiment analysis, topic modelling, fake news detection, emotion detection, election stance, etc.

Details of the dataset collection such as hashtags, retrieved tweets, duplicates removed, and the remaining unique tweets is presented in Table 1.

Table 1: Tweets collection and duplicates removal

S/N

Party

Hash tags

Retrieved tweets

Duplicates tweets

Unique tweets

1

X

@inecnigeria

2023election

64,496

47,275

17,195

2

A

TinubuIsComing

emilokan

jagabanarmy

RenewedHope

BATKSM2023

263,870

231,036

32,832

3

L

VoteLP

NigeriaMustBeBright

PeterObiForPresident2023

ObiDatti2023

PeterObi

664,083

310,857

353,226

4

P

NigeriaDecides

VotePDP

AtikuOkowa2023

FinalPushToVictory

RecoverNigeria

387,450

318,425

66,227

1,379,899

907,593

468,480

To encourage NLP tasks, we uploaded in this Version One the following files:

The combined dataset with pre-processed tweets and their meta data but with removed duplicates are in the file labelled “Combined Dataset Pre-processed without duplicates.csv”

General statistics on each corpus is in the file labelled “Dataset Statistics.xlsx”

The preprocessed corpus from the general group with the tweet contents only is in file labelled “Preprocessed_Tweet only_GENERAL.xlsx”

The preprocessed corpus from Party A with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party A.xlsx”

The preprocessed corpus from Party L with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party L.xlsx”

The preprocessed corpus from Party P with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party P.xlsx”

The top 100 frequent tokens are in the file labelled “Top 100 Tokens and weights.xlsx”

The top frequent bigrams and their weights are in the file labelled “Top 100 Bigrams and weights.xlsx”

The top frequent trigrams and their weights are in the file labelled “Top 100 Trigrams and weights.xlsx”
Latest statistics on milk utilisation by dairies - official statistics
s3.amazonaws.com
gov.uk
+1more
Updated May 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Environment, Food & Rural Affairs (2022). Latest statistics on milk utilisation by dairies - official statistics [Dataset]. https://s3.amazonaws.com/thegovernmentsays-files/content/180/1807491.html
Explore at:
Dataset updated
May 4, 2022
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Department for Environment, Food & Rural Affairs
Description
Please note that this Official Statistics publication is no longer updated. Latest statistics on milk utilisation by dairies - national statistics replaced this publication in 2017. Historical publications can be accessed in Milk utilisation by dairies.

This monthly official statistics notice includes information on the volume of milk used by dairies in England and Wales in the production of drinking milk and milk products. The monthly official statistics on the use of milk by dairies in England and Wales are combined with similar information from Scotland and Northern Ireland to produce a dataset for the UK as a whole. This gives UK milk availability and disposals and the production of liquid drinking milk and milk products such as cheese, butter and milk powders.

Additional information

UK supplies of milk products

Production and overseas trade are brought together in the quarterly milk product supplies dataset. This provides information on how much butter, cheese, cream, condensed milk and milk powders is available for use in the UK, and gives a measure of UK self-sufficiency for these products.

Due to significant revisions to Northern Ireland data for 2016, the quarterly supplies dataset has been re-issued to maintain comparability with the monthly production data.

Structure of the UK dairy industry

Tables showing the size and structure of the UK dairy industry, both in terms of the number of enterprises producing milk products and also in terms of the volumes of production of milk, butter and cheese can be found here.

Important usage information

not designated as national statistics

if you require datasets in another format such as Excel, please get in touch, contact details are given below

Next update: see the statistics release calendar

For further information please contact:
Julie.Rumsey@defra.gsi.gov.uk
https://www.twitter.com/@defrastats" class="govuk-link">Twitter: @DefraStats
Social Power NBA
kaggle.com
Updated Aug 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noah Gift (2017). Social Power NBA [Dataset]. https://www.kaggle.com/noahgift/social-power-nba/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2017
Dataset provided by
Kaggle
Authors
Noah Gift
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

This data set contains combined on-court performance data for NBA players in the 2016-2017 season, alongside salary, Twitter engagement, and Wikipedia traffic data.

Further information can be found in a series of articles for IBM Developerworks: "Explore valuation and attendance using data science and machine learning" and "Exploring the individual NBA players".

A talk about this dataset has slides from March, 2018, Strata:

https://www.slideshare.net/noahgift/social-power-andinfluenceinthenba-89807740?qid=3f9f835a-f3d7-4174-8a8c-c97f9c82e614&v=&b=&from_search=1

Further reading on this dataset is in the book Pragmatic AI, in Chapter 6 or full book, Pragmatic AI: An introduction to Cloud-based Machine Learning and watch lesson 9 in Essential Machine Learning and AI with Python and Jupyter Notebook

Followup Items

You can watch a breakdown of using cluster analysis on the Pragmatic AI YouTube channel

Learn to deploy a Kaggle project into a production Machine Learning sklearn + flask + container by reading Python for Devops: Learn Ruthlessly Effective Automation, Chapter 14: MLOps and Machine learning engineering

Use social media to predict a winning season with this notebook: https://github.com/noahgift/core-stats-datascience/blob/master/Lesson2_7_Trends_Supervized_Learning.ipynb

Learn to use the cloud for data analysis.

Acknowledgement

Data sources include ESPN, Basketball-Reference, Twitter, Five-ThirtyEight, and Wikipedia. The source code for this dataset (in Python and R) can be found on GitHub. Links to more writing can be found at noahgift.com.

Inspiration

Do NBA fans know more about who the best players are, or do owners?

What is the true worth of the social media presence of athletes in the NBA?
Dataset for "Geospatial analysis of toponyms in geotagged social media...
zenodo.org
zip
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Takayuki Hiraoka; Takayuki Hiraoka; Takashi Kirimura; Takashi Kirimura; Naoya Fujiwara; Naoya Fujiwara (2024). Dataset for "Geospatial analysis of toponyms in geotagged social media posts" [Dataset]. http://doi.org/10.5281/zenodo.13860969
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13860969
Dataset updated
Oct 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Takayuki Hiraoka; Takayuki Hiraoka; Takashi Kirimura; Takashi Kirimura; Naoya Fujiwara; Naoya Fujiwara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Geotagged Twitter posts dataset

Dataset used for the research presented in the following paper: Takayuki Hiraoka, Takashi Kirimura, Naoya Fujiwara (2024) "Geospatial analysis of toponyms in geo-tagged social media posts".

We collected georeferenced Twitter posts tagged to coordinates inside the bounding box of Japan between 2012-2018. The present dataset represents the spatial distributions of all geotagged posts as well as posts containing in the text each of 24 domestic toponyms, 12 common nouns, and 6 foreign toponyms. The code used to analyze the data is available on GitHub.

Data description

selected_geotagged_tweet_data/: Number of geotagged twitter posts in each grid cell. Each csv file under this directory associates each grid cell (spanning 30 seconds of latitude and 45 secoonds of longitude, which is approximately a 1km x 1km square, specified by an 8 digit code m3code) with the number of geotagged tweets tagged to the coordinates inside that cell (tweetcount). file_names.json relates each of the toponyms studied in this work to the corresponding datafile (all denotes the full data).

population/population_center_2020.xlsx: Center of population of each municipality based on the 2020 census. Derived from data published by the Statistics Bureau of Japan on their website (Japanese)

population/census2015mesh3_totalpop_setai.csv: Resident population in each grid cell based on the 2015 census. Derived from data published by the Statistics Bureau of Japan on e-stat (Japanese)

population/economiccensus2016mesh3_jigyosyo_jugyosya.csv: Employed population in each grid cell based on the 2016 Economic Census. Derived from data published by the Statistics Bureau of Japan on e-stat (Japanese)

japan_MetropolitanEmploymentArea2015map/: Shape file for the boundaries of Metropolitan Employment Areas (MEA) in Japan. See this website for details of MEA.

ward_shapefiles/: Shape files for the boundaries of wards in large cities, published by the Statistics Bureau of Japan on e-stat
f
September 2022 Covid-19 Vaccines Twitter Streaming Dataset
figshare.com
application/gzip
Updated Oct 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Media Lab (2022). September 2022 Covid-19 Vaccines Twitter Streaming Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21257091.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21257091.v1
Dataset updated
Oct 1, 2022
Dataset provided by
figshare
Authors
Social Media Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The file contains Tweet IDs* for COVID-19 related tweets containing at least one vaccine-related word (i.e., words that start with vaccin*, vacin*, or vax*) collected in September, 2022 from Twitter's COVID-19 Streaming Endpoint via a custom script developed by the Social Media Lab (https://socialmedialab.ca/).Visit our interactive dashboard at https://stream.covid19misinfo.org/ for a preview and some general stats about this COVID-19 Twitter streaming dataset.For more info about Twitter's COVID-19 Streaming Endpoint, visit https://developer.twitter.com/en/docs/labs/covid19-stream/overviewNote: In accordance with Twitter API Terms, the dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). To recollect tweets contained in this dataset, you can use programs such as Hydrator (https://github.com/DocNow/hydrator/) or the Python library Twarc (https://github.com/DocNow/twarc/).
w
Historical statistics notices and dataset on monthly wholesale fruit and...
gov.uk
Updated Sep 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Environment, Food & Rural Affairs (2016). Historical statistics notices and dataset on monthly wholesale fruit and vegetable prices 2016 [Dataset]. https://www.gov.uk/government/statistics/historic-statistics-notices-on-wholesale-fruit-and-vegetable-prices-2016
Explore at:
Dataset updated
Sep 1, 2016
Dataset provided by
GOV.UK
Authors
Department for Environment, Food & Rural Affairs
Description
This publication gives previously published copies of the monthly National Statistics publication on wholesale fruit and vegetable prices that showed figures for 2016. Each publication gives the figures available at that time. The figures are subject to revision each month as new information becomes available. This publication also contains the previously published monthly dataset on wholesale fruit and vegetable prices which gives prices up to July 2016.

The latest weekly data sets are available here.

The publications give the average wholesale prices of selected home-grown horticultural produce. The prices are national averages of the most usual prices charged by wholesalers for selected home-grown fruit and vegetables at the wholesale markets in Birmingham, Bristol, Liverpool and New Spitalfields. For selected home-grown cut flowers and flowering pot plants the average also includes information from the wholesale market at New Covent Garden up to February 2016.

Defra statistics: prices

Email mailto:prices@defra.gov.uk">prices@defra.gov.uk

<p class="govuk-body">You can also contact us via Twitter: <a href="https://twitter.com/DefraStats" class="govuk-link">https://twitter.com/DefraStats</a></p>
Twitter users in Indonesia 2019-2028
statista.com
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). Twitter users in Indonesia 2019-2028 [Dataset]. https://www.statista.com/topics/8306/social-media-in-indonesia/
Explore at:
Dataset updated
Mar 27, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
Indonesia
Description
The number of Twitter users in Indonesia was forecast to continuously increase between 2024 and 2028 by in total 1.4 million users (+6.14 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 24.25 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Malaysia and Singapore.

Facebook

Twitter

Click to copy link

Link copied

Cite

Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell (2023). A Twitter Dataset of 70+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3732460

A Twitter Dataset of 70+ million tweets related to COVID-19

Explore at:

csv, tsv, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3732460

Dataset updated

Apr 17, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell

Description

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

Clear search

Close search

Google apps

Main menu

A Twitter Dataset of 70+ million tweets related to COVID-19

Twitter Friends

Twitter Friends and hashtags

Context

Content

Acknowledgements

Past Research

Inspiration

Contact

A Twitter Dataset of 150+ million tweets related to COVID-19 for open...

A Twitter Dataset of 40+ million tweets related to COVID-19

Unleashed Twitter Statistics

Data from: A large-scale COVID-19 Twitter chatter dataset for open...

Covid-19 Twitter chatter dataset for scientific use

October 2022 Covid-19 Twitter Streaming Dataset

COVID-19 Twitter chatter

Source

Twitter users in the United States 2019-2028

Tweets discussing the Russia/Ukraine War

text-stats

Analytic dataset used for the study.

A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for...

2023election

TinubuIsComing

emilokan

jagabanarmy

RenewedHope

BATKSM2023

VoteLP

NigeriaMustBeBright

PeterObiForPresident2023

ObiDatti2023

PeterObi

NigeriaDecides

VotePDP

AtikuOkowa2023

FinalPushToVictory

RecoverNigeria

Latest statistics on milk utilisation by dairies - official statistics

Additional information

UK supplies of milk products

Structure of the UK dairy industry

Important usage information

Social Power NBA

Context

Followup Items

Acknowledgement

Inspiration

Dataset for "Geospatial analysis of toponyms in geotagged social media...

Geotagged Twitter posts dataset

Data description

September 2022 Covid-19 Vaccines Twitter Streaming Dataset

Historical statistics notices and dataset on monthly wholesale fruit and...

Twitter users in Indonesia 2019-2028

A Twitter Dataset of 70+ million tweets related to COVID-19See More Versions

A Twitter Dataset of 70+ million tweets related to COVID-19