29 datasets found

A Twitter Dataset of 100+ million tweets related to COVID-19
zenodo.org
application/gzip, csv +1
Updated Apr 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 100+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3735274
Explore at:
application/gzip, tsv, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3735274
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Twitter users in the United States 2019-2028
statista.com
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Twitter users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
Explore at:
Dataset updated
Jun 13, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United States
Description
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
Z
Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July...
data.niaid.nih.gov
live.european-language-grid.eu
+2more
Updated May 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gayo-Avello, Daniel (2020). Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3833781
Explore at:
Dataset updated
May 20, 2020
Dataset authored and provided by
Gayo-Avello, Daniel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.

The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).

It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).

Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.

The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.

To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.

In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).

In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:

March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).

June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).

September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).

December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).

March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).

June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).

September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).

December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).

March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).

June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).

September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).

December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).

March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).

June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).

The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.

At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.

In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).

Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.

For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).

If you use this dataset in any way please cite that preprint (in addition to the dataset itself).

If you need to contact me you can find me as @PFCdgayo in Twitter.
Digital Narratives of Covid-19: a Twitter Dataset
zenodo.org
live.european-language-grid.eu
+3more
zip
Updated Jun 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susanna Allés Torrent; Susanna Allés Torrent; Gimena del Rio Riande; Gimena del Rio Riande; Nidia Hernández; Nidia Hernández; Jerry Bonnell; Jerry Bonnell; Dieyun Song; Dieyun Song; Romina De León; Romina De León (2020). Digital Narratives of Covid-19: a Twitter Dataset [Dataset]. http://doi.org/10.5281/zenodo.3824950
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3824950
Dataset updated
Jun 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Susanna Allés Torrent; Susanna Allés Torrent; Gimena del Rio Riande; Gimena del Rio Riande; Nidia Hernández; Nidia Hernández; Jerry Bonnell; Jerry Bonnell; Dieyun Song; Dieyun Song; Romina De León; Romina De León
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are releasing a Twitter dataset connected to our project Digital Narratives of Covid-19 (DHCOVID) that -among other goals- aims to explore during one year (May 2020-2021) the narratives behind data about the coronavirus pandemic.

In this first version, we deliver a Twitter dataset organized as follows:

Each folder corresponds to daily data (one folder for each day): YEAR-MONTH-DAY

In every folder there are 9 different plain text files named with "dhcovid", followed by date (YEAR-MONTH-DAY), language ("en" for English, and "es" for Spanish), and region abbreviation ("fl", "ar", "mx", "co", "pe", "ec", "es"):

dhcovid_YEAR-MONTH-DAY_es_fl.txt: Dataset containing tweets geolocalized in South Florida. The geo-localization is tracked by tweet coordinates, by place, or by user information.

dhcovid_YEAR-MONTH-DAY_en_fl.txt: We are gathering only tweets in English that refer to the area of Miami and South Florida. The reason behind this choice is that there are multiple projects harvesting English data, and, our project is particularly interested in this area because of our home institution (University of Miami) and because we aim to study public conversations from a bilingual (EN/ES) point of view.

dhcovid_YEAR-MONTH-DAY_es_ar.txt: Dataset containing tweets from Argentina.

dhcovid_YEAR-MONTH-DAY_es_mx.txt: Dataset containing tweets from Mexico.

dhcovid_YEAR-MONTH-DAY_es_co.txt: Dataset containing tweets from Colombia.

dhcovid_YEAR-MONTH-DAY_es_pe.txt: Dataset containing tweets from Perú.

dhcovid_YEAR-MONTH-DAY_es_ec.txt: Dataset containing tweets from Ecuador.

dhcovid_YEAR-MONTH-DAY_es_es.txt: Dataset containing tweets from Spain.

dhcovid_YEAR-MONTH-DAY_es.txt: This dataset contains all tweets in Spanish, regardless of its geolocation.

For English, we collect all tweets with the following keywords and hashtags: covid, coronavirus, pandemic, quarantine, stayathome, outbreak, lockdown, socialdistancing. For Spanish, we search for: covid, coronavirus, pandemia, quarentena, confinamiento, quedateencasa, desescalada, distanciamiento social.

The corpus of tweets consists of a list of Tweet Ids; to obtain the original tweets, you can use "Twitter hydratator" which takes the id and download for you all metadata in a csv file.

We started collecting this Twitter dataset on April 24th, 2020 and we are adding daily data to our GitHub repository. There is a detected problem with file 2020-04-24/dhcovid_2020-04-24_es.txt, which we couldn't gather the data due to technical reasons.

For more information about our project visit https://covid.dh.miami.edu/

For more updated datasets and detailed criteria, check our GitHub Repository: https://github.com/dh-miami/narratives_covid19/
Twitter users in Brazil 2019-2028
statista.com
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Twitter users in Brazil 2019-2028 [Dataset]. https://www.statista.com/forecasts/1146589/twitter-users-in-brazil
Explore at:
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Brazil
Description
The number of Twitter users in Brazil was forecast to continuously increase between 2024 and 2028 by in total 3.4 million users (+15.79 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 24.96 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Tweets Targeting Isis
kaggle.com
zip
Updated Nov 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ActiveGalaXy (2019). Tweets Targeting Isis [Dataset]. https://www.kaggle.com/activegalaxy/isis-related-tweets
Explore at:
zip(10419329 bytes)Available download formats
Dataset updated
Nov 17, 2019
Authors
ActiveGalaXy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The image at the top of the page is a frame from today's (7/26/2016) Isis #TweetMovie from twitter, a "normal" day when two Isis operatives murdered a priest saying mass in a French church. (You can see this in the center left). A selection of data from this site is being made available here to Kaggle users.

UPDATE: An excellent study by Audrey Alexander titled Digital Decay? is now available which traces the "change over time among English-language Islamic State sympathizers on Twitter.

Intent

This data set is intended to be a counterpoise to the How Isis Uses Twitter data set. That data set contains 17k tweets alleged to originate with "100+ pro-ISIS fanboys". This new set contains 122k tweets collected on two separate days, 7/4/2016 and 7/11/2016, which contained any of the following terms, with no further editing or selection:

isis

isil

daesh

islamicstate

raqqa

Mosul

"islamic state"

This is not a perfect counterpoise as it almost surely contains a small number of pro-Isis fanboy tweets. However, unless some entity, such as Kaggle, is willing to expend significant resources on a service something like an expert level Mechanical Turk or Zooniverse, a high quality counterpoise is out of reach.

A counterpoise provides a balance or backdrop against which to measure a primary object, in this case the original pro-Isis data. So if anyone wants to discriminate between pro-Isis tweets and other tweets concerning Isis you will need to model the original pro-Isis data or signal against the counterpoise which is signal + noise. Further background and some analysis can be found in this forum thread.

This data comes from postmodernnews.com/token-tv.aspx which daily collects about 25MB of Isis tweets for the purposes of graphical display. PLEASE NOTE: This server is not currently active.

Data Details

There are several differences between the format of this data set and the pro-ISIS fanboy dataset. 1. All the twitter t.co tags have been expanded where possible 2. There are no "description, location, followers, numberstatuses" data columns.

I have also included my version of the original pro-ISIS fanboy set. This version has all the t.co links expanded where possible.
f
An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT
city.figshare.com
html
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ernesto Priego (2023). An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT [Dataset]. http://doi.org/10.6084/m9.figshare.3487103.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3487103.v1
Dataset updated
May 31, 2023
Dataset provided by
City, University of London
Authors
Ernesto Priego
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Digital Humanities 2016 conference is taking/took place in Kraków, Poland, between Sunday 11 July and Saturday 16 July 2016. #DH2016 is/was the conference official hashtag.What This Output IsThis is a CSV file containing a total of 3717 Tweets publicly published with the hashtag #DH2016 on Thursday 14 July 2016 GMT.The

archive starts with a Tweet published on Thursday July 14 2016 at 00:01:04 +0000 and ends with a Tweet published on Thursday July 14 2016 at 23:49:14 +0000 (GMT). Previous days have been shared on a different output. A breakdown of Tweets per day so far:Sunday 10 July 2016: 179 TweetsMonday 11 July 2016: 981 TweetsTuesday 12 July 2016: 2318 TweetsWednesday 13 July 2016: 4175 TweetsThursday 14 July 2016: 3717 Tweets Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using Martin Hawksey's TAGS 6.0. Only users with at least 1 follower were included in the archive. Retweets have been included (Retweets count as Tweets). The collection spreadsheet was customised to reflect the time zone and geographical location of the conference.The profile_image_url and entities_str metadata were removed before public sharing in this archive. Please bear in mind that the conference hashtag has been spammed so some Tweets colllected may be from spam accounts. Some automated refining has been performed to remove Tweets not related to the conference but the data is likely to require further refining and deduplication. Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet tagged with #dh2016 during the indicated period, and the dataset is shared for archival, comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web with the queried hashtag and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. No private personal information is shared in this dataset. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road. This dataset is shared to archive, document and encourage open educational research into scholarly activity on Twitter. Other ConsiderationsTweets published publicly by scholars during academic conferences are often tagged (labeled) with a hashtag dedicated to the conference in question.The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. Though every reason for Tweeters' use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter's Privacy and data sharing policies. Professional associations like the Modern Language Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter's search API has well-known temporal limitations for retrospective historical search and collection.Beyond individual tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. To date, collecting in real time is the only relatively accurate method to archive tweets at a small scale. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time.The CC-BY license has been applied to the output in the repository as a curated dataset. Authorial/curatorial/collection work has been performed on the file in order to make it available as part of the scholarly record. The data contained in the deposited file is otherwise freely available elsewhere through different methods and anyone not wishing to attribute the data to the creator of this output is needless to say free to do their own collection and clean their own data.
[Tweets] 2022 Brazilian Presidential Elections
zenodo.org
zip
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Raniére Juvino Santos; Lucas Raniére Juvino Santos; Leandro Balby Marinho; Leandro Balby Marinho; Claudio Campelo; Claudio Campelo (2025). [Tweets] 2022 Brazilian Presidential Elections [Dataset]. http://doi.org/10.5281/zenodo.14834669
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14834669
Dataset updated
Feb 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lucas Raniére Juvino Santos; Lucas Raniére Juvino Santos; Leandro Balby Marinho; Leandro Balby Marinho; Claudio Campelo; Claudio Campelo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 1, 2022
Area covered
Brazil
Description
2022 Brazilian Presidential Election

This dataset contains 7,015,186 tweets from 951,602 users, extracted using 91 search terms over 36 days between August 1st and December 31st, 2022.

All tweets in this dataset are in Brazilian Portuguese.

Data Usage

The dataset contains textual data from tweets, making it suitable for various NLP analyses, such as sentiment analysis, bias or stance detection, and toxic language detection. Additionally, users and tweets can be linked to create social graphs, enabling Social Network Analysis (SNA) to study polarization, communities, and other social dynamics.

Extraction Method

This data set was extracted using Twitter's (now X) official API—when Academic Research API access was still available—following the pipeline:

1. Twitter/X daily monitoring: The dataset author monitored daily political events appearing in Brazil's Trending Topics. Twitter/X has an automated system for classifying trending terms. When a term was identified as political, it was stored along with its date for later use as a search query.

2. Tweet collection using saved search terms: Once terms and their corresponding dates were recorded, tweets were extracted from 12:00 AM to 11:59 PM on the day the term entered the Trending Topics. A language filter was applied to select only tweets in Portuguese. The extraction was performed using the official Twitter/X API.

3. Data storage: The extracted data was organized by day and search term. If the same search term appeared in Trending Topics on consecutive days, a separate file was stored for each respective day.

Further Information

For more details, visit:

- The repository
- Dataset short paper:

---

DOI: 10.5281/zenodo.14834669
Twitter users in Indonesia 2019-2028
statista.com
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). Twitter users in Indonesia 2019-2028 [Dataset]. https://www.statista.com/topics/8306/social-media-in-indonesia/
Explore at:
Dataset updated
Mar 28, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
Indonesia
Description
The number of Twitter users in Indonesia was forecast to continuously increase between 2024 and 2028 by in total 1.4 million users (+6.14 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 24.25 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Malaysia and Singapore.
Complete Rxivist dataset of scraped biology preprint data
zenodo.org
data.niaid.nih.gov
bin
Updated Mar 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard J. Abdill; Richard J. Abdill; Ran Blekhman; Ran Blekhman (2023). Complete Rxivist dataset of scraped biology preprint data [Dataset]. http://doi.org/10.5281/zenodo.7688682
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7688682
Dataset updated
Mar 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Richard J. Abdill; Richard J. Abdill; Ran Blekhman; Ran Blekhman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
rxivist.org allowed readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist used a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.

Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.

Previous versions are also available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.

Version notes:

2023-03-01

The final Rxivist data upload, more than four years after the first and encompassing 223,541 preprints posted to bioRxiv and medRxiv through the end of February 2023.

2020-12-07***

In addition to bioRxiv preprints, the database now includes all medRxiv preprints as well.

The website where a preprint was posted is now recorded in a new field in the "articles" table, called "repo".

We've significantly refactored the web crawler to take advantage of developments with the bioRxiv API.

The main difference is that preprints flagged as "published" by bioRxiv are no longer recorded on the same schedule that download metrics are updated: The Rxivist database should now record published DOI entries the same day bioRxiv detects them.

Twitter metrics have returned, for the most part. Improvements with the Crossref Event Data API mean we can once again tally daily Twitter counts for all bioRxiv DOIs.

The "crossref_daily" table remains where these are recorded, and daily numbers are now up to date.

Historical daily counts have also been re-crawled to fill in the empty space that started in October 2019.

There are still several gaps that are more than a week long due to missing data from Crossref.

We have recorded available Crossref Twitter data for all papers with DOI numbers starting with "10.1101," which includes all medRxiv preprints. However, there appears to be almost no Twitter data available for medRxiv preprints.

The download metrics for article id 72514 (DOI 10.1101/2020.01.30.927871) were found to be out of date for February 2020 and are now correct. This is notable because article 72514 is the most downloaded preprint of all time; we're still looking into why this wasn't updated after the month ended.

2020-11-18

Publication checks should be back on schedule.

2020-10-26

This snapshot fixes most of the data issues found in the previous version. Indexed papers are now up to date, and download metrics are back on schedule. The check for publication status remains behind schedule, however, and the database may not include published DOIs for papers that have been flagged on bioRxiv as "published" over the last two months. Another snapshot will be posted in the next few weeks with updated publication information.

2020-09-15

A crawler error caused this snapshot to exclude all papers posted after about August 29, with some papers having download metrics that were more out of date than usual. The "last_crawled" field is accurate.

2020-09-08

This snapshot is misconfigured and will not work without modification; it has been replaced with version 2020-09-15.

2019-12-27

Several dozen papers did not have dates associated with them; that has been fixed.

Some authors have had two entries in the "authors" table for portions of 2019, one profile that was linked to their ORCID and one that was not, occasionally with almost identical "name" strings. This happened after bioRxiv began changing author names to reflect the names in the PDFs, rather than the ones manually entered into their system. These database records are mostly consolidated now, but some may remain.

2019-11-29

The Crossref Event Data API remains down; Twitter data is unavailable for dates after early October.

2019-10-31

The Crossref Event Data API is still experiencing problems; the Twitter data for October is incomplete in this snapshot.

The README file has been modified to reflect changes in the process for creating your own DB snapshots if using the newly released PostgreSQL 12.

2019-10-01

The Crossref API is back online, and the "crossref_daily" table should now include up-to-date tweet information for July through September.

About 40,000 authors were removed from the author table because the name had been removed from all preprints they had previously been associated with, likely because their name changed slightly on the bioRxiv website ("John Smith" to "J Smith" or "John M Smith"). The "author_emails" table was also modified to remove entries referring to the deleted authors. The web crawler is being updated to clean these orphaned entries more frequently.

2019-08-30

The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.

2019-07-01

A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.

We began collecting this data in the middle of May, but it has not been applied to older papers yet.

2019-05-11

The README was updated to correct a link to the Docker repository used for the pre-built images.

2019-03-21

The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.

A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)

Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.

The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.

The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.

The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.

2019-02-13.1

After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.

The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.

2019-02-13

The redundant "paper" schema has been removed.

BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.

This is the first version that has a corresponding Docker image.
COVID19 Tweets
kaggle.com
zip
Updated Jul 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Preda (2020). COVID19 Tweets [Dataset]. https://www.kaggle.com/gpreda/covid19-tweets
Explore at:
zip(8506240 bytes)Available download formats
Dataset updated
Jul 31, 2020
Authors
Gabriel Preda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

These tweets are collected using Twitter API and a Python script. A query for this high-frequency hashtag (#covid19) is run on a daily basis for a certain time period, to collect a larger number of tweets samples.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F769452%2F35db2dd68238bfd958efdabebc9fef8f%2Fcovid-19-4961257_1280-e1586986896105.jpg?generation=1595760042647275&alt=media" width="600">

The collection script can be found here: https://github.com/gabrielpreda/covid-19-tweets

Content

The tweets have #covid19 hashtag. Collection started on 25/7/2020, with an initial 17k batch and will continue on a daily basis.

Inspiration

You can use this data to dive into the subjects that use this hashtag, look to the geographical distribution, evaluate sentiments, looks to trends.
Average daily time spent on social media worldwide 2012-2024
statista.com
wwwexpressvpn.online
+1more
Updated Apr 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Average daily time spent on social media worldwide 2012-2024 [Dataset]. https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/
Explore at:
Dataset updated
Apr 10, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
How much time do people spend on social media? As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
Egypt - Arabic Political 600k Tweets
kaggle.com
zip
Updated Sep 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hazem Sayed (2019). Egypt - Arabic Political 600k Tweets [Dataset]. https://www.kaggle.com/datasets/hazemshokry/egypt-arabic-politics-tweets
Explore at:
zip(33171915 bytes)Available download formats
Dataset updated
Sep 17, 2019
Authors
Hazem Sayed
Area covered
Egypt
Description
Context

The hashtag كفايه_بقى_ياسيسى# (That’s enough Sisi) was trending in Egypt at number one on Monday hours after Egyptian actor Mohamed Ali posted his video online calling on Egyptians to post on every social media asking Abdel Fattah el-Sisi to resign. In the opposite way hashtag #هنكمل_مشوارنا_معاك_ياسيسي was also trending for couple of hours the same day.

This data set should help you to understand Egyptians behavior in a political trend in either supporting or opposition situation.

Content

Data collection date (from Twitter API): September 16, 2019

Dimension of the data set; 600K rows and 8 columns

Sentiment analysis library used: Stanford CoreNLP

Hashtag filters include:

كفاية_بقي_ياسيسي

كفايه_بقي_ياسيسي

هنكمل_مشوارنا_معاك_ياسيسي

ارحل_يا_سيسي

وائل غنيم

عدي المليون

Data format: Json Lines Data are splitted into small files ~30mb for each split.

Acknowledgements

Tweets links' and owners are hidden to keep everything anonymous. Please get in touch with me if you have a use case requires using them.

Inspiration

Compare sentiment analysis result extracted by Stanford CoreNLP with other NLP library.

Extract all links and collect all photos used for this trend.

Understand Egyptians behavior in a political trend in either supporting or opposition situation.

Compare people statistics for different hashtags.

See trend line chart and compare with other trends.

And much more...
f
fdata-02-00017-g0004_Twitter Response to Munich July 2016 Attack: Network...
frontiersin.figshare.com
figshare.com
tiff
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Bermudez; Daniel Cleven; Ralucca Gera; Erik T. Kiser; Timothy Newlin; Akrati Saxena (2023). fdata-02-00017-g0004_Twitter Response to Munich July 2016 Attack: Network Analysis of Influence.tif [Dataset]. http://doi.org/10.3389/fdata.2019.00017.s006
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.3389/fdata.2019.00017.s006
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Ivan Bermudez; Daniel Cleven; Ralucca Gera; Erik T. Kiser; Timothy Newlin; Akrati Saxena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Munich
Description
Social Media platforms in Cyberspace provide communication channels for individuals, businesses, as well as state and non-state actors (i.e., individuals and groups) to conduct messaging campaigns. What are the spheres of influence that arose around the keyword #Munich on Twitter following an active shooter event at a Munich shopping mall in July 2016? To answer that question in this work, we capture tweets utilizing #Munich beginning 1 h after the shooting was reported, and the data collection ends approximately 1 month later1. We construct both daily networks and a cumulative network from this data. We analyze community evolution using the standard Louvain algorithm, and how the communities change over time to study how they both encourage and discourage the effectiveness of an information messaging campaign. We conclude that the large communities observed in the early stage of the data disappear from the #Munich conversation within 7 days. The politically charged nature of many of these communities suggests their activity is migrated to other Twitter hashtags (i.e., conversation topics). Future analysis of Twitter activity might focus on tracking communities across topics and time.
Ethereum Classic Blockchain
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Ethereum Classic Blockchain [Dataset]. https://www.kaggle.com/datasets/bigquery/crypto-ethereum-classic
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Ethereum Classic is an open-source, public, blockchain-based distributed computing platform featuring smart contract (scripting) functionality. It provides a decentralized Turing-complete virtual machine, the Ethereum Virtual Machine (EVM), which can execute scripts using an international network of public nodes. Ethereum Classic and Ethereum have a value token called "ether", which can be transferred between participants, stored in a cryptocurrency wallet and is used to compensate participant nodes for computations performed in the Ethereum Platform.

Ethereum Classic came into existence when some members of the Ethereum community rejected the DAO hard fork on the grounds of "immutability", the principle that the blockchain cannot be changed, and decided to keep using the unforked version of Ethereum. Till this day, Etherum Classic runs the original Ethereum chain.

Content

In this dataset, you will have access to Ethereum Classic (ETC) historical block data along with transactions and traces. You can access the data from BigQuery in your notebook with bigquery-public-data.crypto_ethereum_classic dataset.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_ethereum_classic.[TABLENAME]. Fork this kernel to get started.

Acknowledgements

This dataset wouldn't be possible without the help of Allen Day, Evgeny Medvedev and Yaz Khoury. This dataset uses Blockchain ETL. Special thanks to ETC community member @donsyang for the banner image.

Inspiration

One of the main questions we wanted to answer was the Gini coefficient of ETC data. We also wanted to analyze the DAO Smart Contract before and after the DAO Hack and the resulting Hardfork. We also wanted to analyze the network during the famous 51% attack and see what sort of patterns we can spot about the attacker.
Data from: In the mood: the dynamics of collective sentiments on Twitter
zenodo.org
data.niaid.nih.gov
+1more
txt, zip
Updated May 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathaniel Charlton; Colin Singleton; Danica Vukadinović Greetham; Nathaniel Charlton; Colin Singleton; Danica Vukadinović Greetham (2022). Data from: In the mood: the dynamics of collective sentiments on Twitter [Dataset]. http://doi.org/10.5061/dryad.5302r
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.5302r
Dataset updated
May 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nathaniel Charlton; Colin Singleton; Danica Vukadinović Greetham; Nathaniel Charlton; Colin Singleton; Danica Vukadinović Greetham
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We study the relationship between the sentiment levels of Twitter users and the evolving network structure that the users created by @-mentioning each other. We use a large dataset of tweets to which we apply three sentiment scoring algorithms, including the open source SentiStrength program. Specifically we make three contributions. Firstly, we find that people who have potentially the largest communication reach (according to a dynamic centrality measure) use sentiment differently than the average user: for example, they use positive sentiment more often and negative sentiment less often. Secondly, we find that when we follow structurally stable Twitter communities over a period of months, their sentiment levels are also stable, and sudden changes in community sentiment from one day to the next can in most cases be traced to external events affecting the community. Thirdly, based on our findings, we create and calibrate a simple agent-based model that is capable of reproducing measures of emotive response comparable with those obtained from our empirical dataset.
Twitter users in Africa 2019-2028
statista.com
Updated Sep 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2022). Twitter users in Africa 2019-2028 [Dataset]. https://www.statista.com/topics/9813/internet-usage-in-africa/
Explore at:
Dataset updated
Sep 8, 2022
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
Africa
Description
The number of Twitter users in Africa was forecast to continuously increase between 2024 and 2028 by in total 28.1 million users (+100.75 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 55.96 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Australia & Oceania and North America.
Reddit users in the United States 2019-2028
statista.com
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). Reddit users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
Explore at:
Dataset updated
Jun 13, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United States
Description
The number of Reddit users in the United States was forecast to continuously increase between 2024 and 2028 by in total 10.3 million users (+5.21 percent). After the ninth consecutive increasing year, the Reddit user base is estimated to reach 208.12 million users and therefore a new peak in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Reddit users in countries like Mexico and Canada.
f
Data_Sheet_3_Probing sociodemographic influence on code-switching and...
figshare.com
frontiersin.figshare.com
pdf
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olga Kellert (2023). Data_Sheet_3_Probing sociodemographic influence on code-switching and language choice in Quebec with geolocation of tweets.PDF [Dataset]. http://doi.org/10.3389/fpsyg.2023.1137038.s003
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2023.1137038.s003
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Olga Kellert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Quebec
Description
This paper investigates the influence of the relative size of speech communities on language use in multilingual regions and cities. Due to peoples’ everyday mobility inside a city, it is still unclear whether the size of a population matters for language use on a sub-city scale. By testing the correlation between the size of a population and language use on various spatial scales, this study will contribute to a better understanding of the extent to which sociodemographic factors influence language use. The present study investigates two particular phenomena that are common to multilingual speakers, namely language mixing or Code-Switching and using multiple languages without mixing. Demographic information from a Canadian census will make predictions about the intensity of Code-Switching and language use by multilinguals in cities of Quebec and neighborhoods of Montreal. Geolocated tweets will be used to identify where these linguistic phenomena occur the most and the least. My results show that the intensity of Code-Switching and the use of English by bilinguals is influenced by the size of anglophone and francophone populations on various spatial scales such as the city level, land use level (city center vs. periphery of Montreal), and large urban zones on the sub-city level, namely the western and eastern urban zones of Montreal. However, the correlation between population figures and language use is difficult to measure and evaluate on a much smaller sub-urban scale such as the city block scale due to factors such as population figures missing from the census and people’s mobility. A qualitative evaluation of language use on a small spatial scale seems to suggest that other social influences such as the location context or topic of discussion are much more important predictors for language use than population figures. Methods will be suggested for testing this hypothesis in future research. I conclude that geographic space can provide us information about the relation between language use in multilingual cities and sociodemographic factors such as a speech community’s size and that social media is a valuable alternative data source for sociolinguistic research that offers new insights into the mechanisms of language use such as Code-Switching.
d
Event construal and temporal distance in natural language - Dataset - B2FIND...
b2find.dkrz.de
Updated Apr 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Event construal and temporal distance in natural language - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/bf45ce3f-b925-5b81-b0a3-cba9f1bee2b3
Explore at:
Dataset updated
Apr 12, 2023
Description
Construal level theory proposes that events that are temporally proximate are represented more concretely than events that are temporally distant. We tested this prediction using two large natural language text corpora. In study 1 we examined posts on Twitter that referenced the future, and found that tweets mentioning temporally proximate dates used more concrete words than those mentioning distant dates. In study 2 we obtained all New York Times articles that referenced U.S. presidential elections between 1987 and 2007. We found that the concreteness of the words in these articles increased with the temporal proximity to their corresponding election. Additionally the reduction in concreteness after the election was much greater than the increase in concreteness leading up to the election, though both changes in concreteness were well described by an exponential function. We replicated this finding with New York Times articles referencing US public holidays. Overall, our results provide strong support for the predictions of construal level theory, and additionally illustrate how large natural language datasets can be used to inform psychological theory.This network project brings together economists, psychologists, computer and complexity scientists from three leading centres for behavioural social science at Nottingham, Warwick and UEA. This group will lead a research programme with two broad objectives: to develop and test cross-disciplinary models of human behaviour and behaviour change; to draw out their implications for the formulation and evaluation of public policy. Foundational research will focus on three inter-related themes: understanding individual behaviour and behaviour change; understanding social and interactive behaviour; rethinking the foundations of policy analysis. The project will explore implications of the basic science for policy via a series of applied projects connecting naturally with the three themes. These will include: the determinants of consumer credit behaviour; the formation of social values; strategies for evaluation of policies affecting health and safety. The research will integrate theoretical perspectives from multiple disciplines and utilise a wide range of complementary methodologies including: theoretical modeling of individuals, groups and complex systems; conceptual analysis; lab and field experiments; analysis of large data sets. The Network will promote high quality cross-disciplinary research and serve as a policy forum for understanding behaviour and behaviour change. Experimental data. In study 1, we collected and analyzed millions of time-indexed posts on Twitter. In this study we obtained a large number of tweets that referenced dates in the future, and were able to use these tweets to determine the concreteness of the language used to describe events at these dates. This allowed us to observe how psychological distance influences everyday discourse, and put the key assumptions of the CLT to a real-world test. In study 2, we analyzed word concreteness in news articles using the New York Times (NYT) Annotated Corpus (Sandhaus, 2008). This corpus contains over 1.8 million NYT articles written between 1987 and 2007. Importantly for our purposes, these articles are tagged with keywords describing the topics of the articles. In this study we obtained all NYT articles written before and after the 1988, 1992, 1996, 2000, and 2004 US Presidential elections, which were tagged as pertaining to these elections. We subsequently tested how the concreteness of the words used in the articles varied as a function of temporal distance to the election they reference. We also performed this analysis with NYT articles referencing three popular public holidays. Unlike study 1 and prior work (such as Snefjella & Kuperman, 2015), study 2 allowed us to examine the influence of temporal distance in the past and in the future, while controlling for the exact time when specific events occurred.

Facebook

Twitter

Click to copy link

Link copied

Cite

Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 100+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3735274

A Twitter Dataset of 100+ million tweets related to COVID-19

Explore at:

application/gzip, tsv, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3735274

Dataset updated

Apr 17, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding

Description

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

Clear search

Close search

Google apps

Main menu

A Twitter Dataset of 100+ million tweets related to COVID-19

Twitter users in the United States 2019-2028

Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July...

Digital Narratives of Covid-19: a Twitter Dataset

Twitter users in Brazil 2019-2028

Tweets Targeting Isis

Context

Intent

Data Details

An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT

[Tweets] 2022 Brazilian Presidential Elections

2022 Brazilian Presidential Election

Data Usage

Extraction Method

Further Information

Twitter users in Indonesia 2019-2028

Complete Rxivist dataset of scraped biology preprint data

COVID19 Tweets

Context

Content

Inspiration

Average daily time spent on social media worldwide 2012-2024

Egypt - Arabic Political 600k Tweets

Context

Content

Acknowledgements

Inspiration

fdata-02-00017-g0004_Twitter Response to Munich July 2016 Attack: Network...

Ethereum Classic Blockchain

Context

Content

Querying BigQuery tables

Acknowledgements

Inspiration

Data from: In the mood: the dynamics of collective sentiments on Twitter

Twitter users in Africa 2019-2028

Reddit users in the United States 2019-2028

Data_Sheet_3_Probing sociodemographic influence on code-switching and...

Event construal and temporal distance in natural language - Dataset - B2FIND...

A Twitter Dataset of 100+ million tweets related to COVID-19See More Versions

A Twitter Dataset of 100+ million tweets related to COVID-19