Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This datasets is an extract of a wider database aimed at collecting Twitter user's friends (other accound one follows). The global goal is to study user's interest thru who they follow and connection to the hashtag they've used.
It's a list of Twitter user's informations. In the JSON format one twitter user is stored in one object of this more that 40.000 objects list. Each object holds :
avatar : URL to the profile picture
followerCount : the number of followers of this user
friendsCount : the number of people following this user.
friendName : stores the @name (without the '@') of the user (beware this name can be changed by the user)
id : user ID, this number can not change (you can retrieve screen name with this service : https://tweeterid.com/)
friends : the list of IDs the user follows (data stored is IDs of users followed by this user)
lang : the language declared by the user (in this dataset there is only "en" (english))
lastSeen : the time stamp of the date when this user have post his last tweet.
tags : the hashtags (whith or without #) used by the user. It's the "trending topic" the user tweeted about.
tweetID : Id of the last tweet posted by this user.
You also have the CSV format which uses the same naming convention.
These users are selected because they tweeted on Twitter trending topics, I've selected users that have at least 100 followers and following at least 100 other account (in order to filter out spam and non-informative/empty accounts).
This data set is build by Hubert Wassner (me) using the Twitter public API. More data can be obtained on request (hubert.wassner AT gmail.com), at this time I've collected over 5 milions in different languages. Some more information can be found here (in french only) : http://wassner.blogspot.fr/2016/06/recuperer-des-profils-twitter-par.html
No public research have been done (until now) on this dataset. I made a private application which is described here : http://wassner.blogspot.fr/2016/09/twitter-profiling.html (in French) which uses the full dataset (Millions of full profiles).
On can analyse a lot of stuff with this datasets :
Feel free to ask any question (or help request) via Twitter : @hwassner
Enjoy! ;)
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (152,920,832 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (30,990,645 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 22nd which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (40,823,816 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (7,479,940 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset contains statistics related to the Unleashed Twitter account (@SAUnleashed). Unleashed is an open data competition, an initiative of the Office for Digital Government, Department of the Premier and Cabinet. The data is used to monitor the level of engagement activity with the audience, and make the communication effective in regards to the event.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (230,961,781 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (52,026,197 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full dataset, and a cleaned version with no retweets. There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms, the top 1000 bigrams, and the top 1000 trigrams. Some general statistics per day are included for both datasets. We will continue to update the dataset every two days here and weekly in Zenodo. For more information on processing and visualizations please visit: www.panacealab.org/covid19
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The file contains Tweet IDs* for COVID-19 related tweets collected in October, 2022 from Twitter's COVID-19 Streaming Endpoint via a custom script developed by the Social Media Lab (https://socialmedialab.ca/).Visit our interactive dashboard at https://stream.covid19misinfo.org/ for a preview and some general stats about this COVID-19 Twitter streaming dataset.For more info about Twitter's COVID-19 Streaming Endpoint, visit https://developer.twitter.com/en/docs/labs/covid19-stream/overviewNote: In accordance with Twitter API Terms, the dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). To recollect tweets contained in this dataset, you can use programs such as Hydrator (https://github.com/DocNow/hydrator/) or the Python library Twarc (https://github.com/DocNow/twarc/).
Data gathering started from March 11th yielding over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (891,324,837 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (223,249,143 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used the Twitter API (V2) to collect all tweets, retweets, quotes and replies containing case-insensitive versions of the hashtags: #(I)StandWithPutin, #(I)StandWithRussia, #(I)SupportRussia, #(I)StandWithUkraine, #(I)StandWithZelenskyy and #(I)SupportUkraine. These were obtained from February 23rd 2022 00:00:00 UTC until March 8th 2022 23:59:59 UTC, the fortnight after Russia invaded Ukraine. We queried the hashtags with and without the `I', a total of 12 query hashtags, collecting 5,203,746 tweets. The data collected predates the beginning of the Russian invasion by one day. These hashtags were chosen as they were found to be the most trending hashtags related to the Russia/Ukraine war which could be easily identified with a particular side in the conflict. We calculated Botometer results on 483,100 (26.5%) of accounts. These accounts were randomly sampled from a list of all unique users in our dataset which posted in English. This random sample leads to an approximately uniform frequency of Tweets from accounts with Botometer labels across the time frame we considered. We include the language dependent and language independent results from Botometer, including the Complete Automation Probabilities (CAP) and each of the sub-category scores for different bot types. Moreoever, we include the display scores and raw scores from Botometer for each account. More information about the Botometer scores can be found at this link: https://rapidapi.com/OSoMe/api/botometer-pro/details You can find our paper here: https://arxiv.org/abs/2208.07038
Text statistics
This dataset is a combination of the following datasets:
agentlans/text-quality-v2 agentlans/readability agentlans/twitter-sentiment-meta-analysis
The main purpose is to collect the large data into one place for easy training and evaluation.
Data Preparation and Transformation
Quality Score Normalization
The dataset was enhanced with additional columns, and quality scores (n = 909 533) were normalized using Ordered Quantile… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/text-stats.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Twitter (recently renamed X) is used by academic anesthesiology departments as a social media platform for various purposes. We hypothesized that Twitter (X) use would be prevalent among academic anesthesiology departments and that the number of tweets would vary by region, physician faculty size, and National Institutes of Health (NIH) research funding rank. We performed a descriptive study of Twitter (X) use by academic anesthesiology departments (i.e. those with a residency program) in 2022. Original tweets were collected using a Twitter (X) analytics tool. Summary statistics were reported for tweet number and content. The median number of tweets was compared after stratifying by region, physician faculty size, and NIH funding rank. Among 166 academic anesthesiology departments, there were 73 (44.0%) that had a Twitter (X) account in 2022. There were 3,578 original tweets during the study period and the median number of tweets per department was 21 (25th-75th = 0, 75) with most tweets (55.8%) announcing general departmental news and a smaller number highlighting social events (12.5%), research (11.1%), recruiting (7.1%), DEI activities (5.2%), and trainee experiences (4.1%). There was no significant difference in the median number of tweets by region (P = 0.81). The median number of tweets differed significantly by physician faculty size (P
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains tweets related to the 2023 presidential elections in Nigeria. The data was retrieved from the social media network, Twitter (Now X) between February 4th, 2023 and April 4th, 2023. The hashtags from the official handles and other popular hashtags endorsed and/or representing the candidates of each party were considered for retrieving election related tweets using an API from Twitter social media platform. Three major political parties in Nigeria were considered and they have been labelled as Party A, Party L and Party P in this dataset. The party or group called "General" contains tweets from the Independent National Electoral Commission (INEC) hashtags such as @inecnigeria and #2023election which is not directly for any political party.
The dataset has been pre-processed lightly to make it very useful to researcher for a wide range of natural language processing tasks like sentiment analysis, topic modelling, fake news detection, emotion detection, election stance, etc.
Details of the dataset collection such as hashtags, retrieved tweets, duplicates removed, and the remaining unique tweets is presented in Table 1.
Table 1: Tweets collection and duplicates removal
S/N
Party
Hash tags
Retrieved tweets
Duplicates tweets
Unique tweets
1
X
@inecnigeria
64,496
47,275
17,195
2
A
263,870
231,036
32,832
3
L
664,083
310,857
353,226
4
P
387,450
318,425
66,227
1,379,899
907,593
468,480
To encourage NLP tasks, we uploaded in this Version One the following files:
The combined dataset with pre-processed tweets and their meta data but with removed duplicates are in the file labelled “Combined Dataset Pre-processed without duplicates.csv”
General statistics on each corpus is in the file labelled “Dataset Statistics.xlsx”
The preprocessed corpus from the general group with the tweet contents only is in file labelled “Preprocessed_Tweet only_GENERAL.xlsx”
The preprocessed corpus from Party A with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party A.xlsx”
The preprocessed corpus from Party L with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party L.xlsx”
The preprocessed corpus from Party P with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party P.xlsx”
The top 100 frequent tokens are in the file labelled “Top 100 Tokens and weights.xlsx”
The top frequent bigrams and their weights are in the file labelled “Top 100 Bigrams and weights.xlsx”
The top frequent trigrams and their weights are in the file labelled “Top 100 Trigrams and weights.xlsx”
Please note that this Official Statistics publication is no longer updated. Latest statistics on milk utilisation by dairies - national statistics replaced this publication in 2017. Historical publications can be accessed in Milk utilisation by dairies.
This monthly official statistics notice includes information on the volume of milk used by dairies in England and Wales in the production of drinking milk and milk products. The monthly official statistics on the use of milk by dairies in England and Wales are combined with similar information from Scotland and Northern Ireland to produce a dataset for the UK as a whole. This gives UK milk availability and disposals and the production of liquid drinking milk and milk products such as cheese, butter and milk powders.
Production and overseas trade are brought together in the quarterly milk product supplies dataset. This provides information on how much butter, cheese, cream, condensed milk and milk powders is available for use in the UK, and gives a measure of UK self-sufficiency for these products.
Due to significant revisions to Northern Ireland data for 2016, the quarterly supplies dataset has been re-issued to maintain comparability with the monthly production data.
Tables showing the size and structure of the UK dairy industry, both in terms of the number of enterprises producing milk products and also in terms of the volumes of production of milk, butter and cheese can be found here.
Next update: see the statistics release calendar
For further information please contact:
Julie.Rumsey@defra.gsi.gov.uk
https://www.twitter.com/@defrastats" class="govuk-link">Twitter: @DefraStats
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data set contains combined on-court performance data for NBA players in the 2016-2017 season, alongside salary, Twitter engagement, and Wikipedia traffic data.
Further information can be found in a series of articles for IBM Developerworks: "Explore valuation and attendance using data science and machine learning" and "Exploring the individual NBA players".
A talk about this dataset has slides from March, 2018, Strata:
Further reading on this dataset is in the book Pragmatic AI, in Chapter 6 or full book, Pragmatic AI: An introduction to Cloud-based Machine Learning and watch lesson 9 in Essential Machine Learning and AI with Python and Jupyter Notebook
You can watch a breakdown of using cluster analysis on the Pragmatic AI YouTube channel
Learn to deploy a Kaggle project into a production Machine Learning sklearn + flask + container by reading Python for Devops: Learn Ruthlessly Effective Automation, Chapter 14: MLOps and Machine learning engineering
Use social media to predict a winning season with this notebook: https://github.com/noahgift/core-stats-datascience/blob/master/Lesson2_7_Trends_Supervized_Learning.ipynb
Learn to use the cloud for data analysis.
Data sources include ESPN, Basketball-Reference, Twitter, Five-ThirtyEight, and Wikipedia. The source code for this dataset (in Python and R) can be found on GitHub. Links to more writing can be found at noahgift.com.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used for the research presented in the following paper: Takayuki Hiraoka, Takashi Kirimura, Naoya Fujiwara (2024) "Geospatial analysis of toponyms in geo-tagged social media posts".
We collected georeferenced Twitter posts tagged to coordinates inside the bounding box of Japan between 2012-2018. The present dataset represents the spatial distributions of all geotagged posts as well as posts containing in the text each of 24 domestic toponyms, 12 common nouns, and 6 foreign toponyms. The code used to analyze the data is available on GitHub.
selected_geotagged_tweet_data/
: Number of geotagged twitter posts in each grid cell. Each csv file under this directory associates each grid cell (spanning 30 seconds of latitude and 45 secoonds of longitude, which is approximately a 1km x 1km square, specified by an 8 digit code m3code
) with the number of geotagged tweets tagged to the coordinates inside that cell (tweetcount
). file_names.json
relates each of the toponyms studied in this work to the corresponding datafile (all
denotes the full data). population/population_center_2020.xlsx
: Center of population of each municipality based on the 2020 census. Derived from data published by the Statistics Bureau of Japan on their website (Japanese)population/census2015mesh3_totalpop_setai.csv
: Resident population in each grid cell based on the 2015 census. Derived from data published by the Statistics Bureau of Japan on e-stat (Japanese)population/economiccensus2016mesh3_jigyosyo_jugyosya.csv
: Employed population in each grid cell based on the 2016 Economic Census. Derived from data published by the Statistics Bureau of Japan on e-stat (Japanese)japan_MetropolitanEmploymentArea2015map/
: Shape file for the boundaries of Metropolitan Employment Areas (MEA) in Japan. See this website for details of MEA.ward_shapefiles/
: Shape files for the boundaries of wards in large cities, published by the Statistics Bureau of Japan on e-statAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The file contains Tweet IDs* for COVID-19 related tweets containing at least one vaccine-related word (i.e., words that start with vaccin*, vacin*, or vax*) collected in September, 2022 from Twitter's COVID-19 Streaming Endpoint via a custom script developed by the Social Media Lab (https://socialmedialab.ca/).Visit our interactive dashboard at https://stream.covid19misinfo.org/ for a preview and some general stats about this COVID-19 Twitter streaming dataset.For more info about Twitter's COVID-19 Streaming Endpoint, visit https://developer.twitter.com/en/docs/labs/covid19-stream/overviewNote: In accordance with Twitter API Terms, the dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). To recollect tweets contained in this dataset, you can use programs such as Hydrator (https://github.com/DocNow/hydrator/) or the Python library Twarc (https://github.com/DocNow/twarc/).
This publication gives previously published copies of the monthly National Statistics publication on wholesale fruit and vegetable prices that showed figures for 2016. Each publication gives the figures available at that time. The figures are subject to revision each month as new information becomes available. This publication also contains the previously published monthly dataset on wholesale fruit and vegetable prices which gives prices up to July 2016.
The latest weekly data sets are available here.
The publications give the average wholesale prices of selected home-grown horticultural produce. The prices are national averages of the most usual prices charged by wholesalers for selected home-grown fruit and vegetables at the wholesale markets in Birmingham, Bristol, Liverpool and New Spitalfields. For selected home-grown cut flowers and flowering pot plants the average also includes information from the wholesale market at New Covent Garden up to February 2016.
Defra statistics: prices
Email mailto:prices@defra.gov.uk">prices@defra.gov.uk
<p class="govuk-body">You can also contact us via Twitter: <a href="https://twitter.com/DefraStats" class="govuk-link">https://twitter.com/DefraStats</a></p>
The number of Twitter users in Indonesia was forecast to continuously increase between 2024 and 2028 by in total 1.4 million users (+6.14 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 24.25 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Malaysia and Singapore.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.