14 datasets found

A Twitter Dataset of 100+ million tweets related to COVID-19
zenodo.org
application/gzip, csv +1
Updated Apr 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 100+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3735274
Explore at:
application/gzip, tsv, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3735274
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Z
Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July...
data.niaid.nih.gov
live.european-language-grid.eu
+2more
Updated May 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gayo-Avello, Daniel (2020). Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3833781
Explore at:
Dataset updated
May 20, 2020
Dataset authored and provided by
Gayo-Avello, Daniel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.

The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).

It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).

Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.

The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.

To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.

In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).

In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:

March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).

June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).

September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).

December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).

March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).

June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).

September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).

December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).

March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).

June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).

September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).

December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).

March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).

June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).

The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.

At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.

In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).

Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.

For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).

If you use this dataset in any way please cite that preprint (in addition to the dataset itself).

If you need to contact me you can find me as @PFCdgayo in Twitter.
Digital Narratives of Covid-19: a Twitter Dataset
zenodo.org
live.european-language-grid.eu
+3more
zip
Updated Jun 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susanna Allés Torrent; Susanna Allés Torrent; Gimena del Rio Riande; Gimena del Rio Riande; Nidia Hernández; Nidia Hernández; Jerry Bonnell; Jerry Bonnell; Dieyun Song; Dieyun Song; Romina De León; Romina De León (2020). Digital Narratives of Covid-19: a Twitter Dataset [Dataset]. http://doi.org/10.5281/zenodo.3824950
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3824950
Dataset updated
Jun 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Susanna Allés Torrent; Susanna Allés Torrent; Gimena del Rio Riande; Gimena del Rio Riande; Nidia Hernández; Nidia Hernández; Jerry Bonnell; Jerry Bonnell; Dieyun Song; Dieyun Song; Romina De León; Romina De León
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are releasing a Twitter dataset connected to our project Digital Narratives of Covid-19 (DHCOVID) that -among other goals- aims to explore during one year (May 2020-2021) the narratives behind data about the coronavirus pandemic.

In this first version, we deliver a Twitter dataset organized as follows:

Each folder corresponds to daily data (one folder for each day): YEAR-MONTH-DAY

In every folder there are 9 different plain text files named with "dhcovid", followed by date (YEAR-MONTH-DAY), language ("en" for English, and "es" for Spanish), and region abbreviation ("fl", "ar", "mx", "co", "pe", "ec", "es"):

dhcovid_YEAR-MONTH-DAY_es_fl.txt: Dataset containing tweets geolocalized in South Florida. The geo-localization is tracked by tweet coordinates, by place, or by user information.

dhcovid_YEAR-MONTH-DAY_en_fl.txt: We are gathering only tweets in English that refer to the area of Miami and South Florida. The reason behind this choice is that there are multiple projects harvesting English data, and, our project is particularly interested in this area because of our home institution (University of Miami) and because we aim to study public conversations from a bilingual (EN/ES) point of view.

dhcovid_YEAR-MONTH-DAY_es_ar.txt: Dataset containing tweets from Argentina.

dhcovid_YEAR-MONTH-DAY_es_mx.txt: Dataset containing tweets from Mexico.

dhcovid_YEAR-MONTH-DAY_es_co.txt: Dataset containing tweets from Colombia.

dhcovid_YEAR-MONTH-DAY_es_pe.txt: Dataset containing tweets from Perú.

dhcovid_YEAR-MONTH-DAY_es_ec.txt: Dataset containing tweets from Ecuador.

dhcovid_YEAR-MONTH-DAY_es_es.txt: Dataset containing tweets from Spain.

dhcovid_YEAR-MONTH-DAY_es.txt: This dataset contains all tweets in Spanish, regardless of its geolocation.

For English, we collect all tweets with the following keywords and hashtags: covid, coronavirus, pandemic, quarantine, stayathome, outbreak, lockdown, socialdistancing. For Spanish, we search for: covid, coronavirus, pandemia, quarentena, confinamiento, quedateencasa, desescalada, distanciamiento social.

The corpus of tweets consists of a list of Tweet Ids; to obtain the original tweets, you can use "Twitter hydratator" which takes the id and download for you all metadata in a csv file.

We started collecting this Twitter dataset on April 24th, 2020 and we are adding daily data to our GitHub repository. There is a detected problem with file 2020-04-24/dhcovid_2020-04-24_es.txt, which we couldn't gather the data due to technical reasons.

For more information about our project visit https://covid.dh.miami.edu/

For more updated datasets and detailed criteria, check our GitHub Repository: https://github.com/dh-miami/narratives_covid19/
c
Geotagged Twitter posts from the United States: A tweet collection to...
datacatalogue.cessda.eu
search.gesis.org
+1more
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pfeffer, Jürgen; Morstatter, Fred (2024). Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness [Dataset]. http://doi.org/10.7802/1166
Explore at:
Unique identifier
https://doi.org/10.7802/1166
Dataset updated
Dec 6, 2024
Dataset provided by
Carnegie Mellon University
Arizona State University
Authors
Pfeffer, Jürgen; Morstatter, Fred
Area covered
United States
Measurement technique
Recording
Description
This dataset consists of IDs of geotagged Twitter posts from within the United States. They are provided as files per day and state as well as per day and county. In addition, files containing the aggregated number of hashtags from these tweets are provided per day and state and per day and county. This data is organized as a ZIP-file per month containing several zip-files per day which hold the txt-files with the ID/hash information.

Also part of the dataset are two shapefiles for the US counties and states and Python scripts for the data collection and sorting geotags into counties.
f
An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT
city.figshare.com
html
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ernesto Priego (2023). An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT [Dataset]. http://doi.org/10.6084/m9.figshare.3487103.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3487103.v1
Dataset updated
May 31, 2023
Dataset provided by
City, University of London
Authors
Ernesto Priego
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Digital Humanities 2016 conference is taking/took place in Kraków, Poland, between Sunday 11 July and Saturday 16 July 2016. #DH2016 is/was the conference official hashtag.What This Output IsThis is a CSV file containing a total of 3717 Tweets publicly published with the hashtag #DH2016 on Thursday 14 July 2016 GMT.The

archive starts with a Tweet published on Thursday July 14 2016 at 00:01:04 +0000 and ends with a Tweet published on Thursday July 14 2016 at 23:49:14 +0000 (GMT). Previous days have been shared on a different output. A breakdown of Tweets per day so far:Sunday 10 July 2016: 179 TweetsMonday 11 July 2016: 981 TweetsTuesday 12 July 2016: 2318 TweetsWednesday 13 July 2016: 4175 TweetsThursday 14 July 2016: 3717 Tweets Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using Martin Hawksey's TAGS 6.0. Only users with at least 1 follower were included in the archive. Retweets have been included (Retweets count as Tweets). The collection spreadsheet was customised to reflect the time zone and geographical location of the conference.The profile_image_url and entities_str metadata were removed before public sharing in this archive. Please bear in mind that the conference hashtag has been spammed so some Tweets colllected may be from spam accounts. Some automated refining has been performed to remove Tweets not related to the conference but the data is likely to require further refining and deduplication. Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet tagged with #dh2016 during the indicated period, and the dataset is shared for archival, comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web with the queried hashtag and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. No private personal information is shared in this dataset. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road. This dataset is shared to archive, document and encourage open educational research into scholarly activity on Twitter. Other ConsiderationsTweets published publicly by scholars during academic conferences are often tagged (labeled) with a hashtag dedicated to the conference in question.The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. Though every reason for Tweeters' use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter's Privacy and data sharing policies. Professional associations like the Modern Language Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter's search API has well-known temporal limitations for retrospective historical search and collection.Beyond individual tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. To date, collecting in real time is the only relatively accurate method to archive tweets at a small scale. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time.The CC-BY license has been applied to the output in the repository as a curated dataset. Authorial/curatorial/collection work has been performed on the file in order to make it available as part of the scholarly record. The data contained in the deposited file is otherwise freely available elsewhere through different methods and anyone not wishing to attribute the data to the creator of this output is needless to say free to do their own collection and clean their own data.
[Tweets] 2022 Brazilian Presidential Elections
zenodo.org
zip
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Raniére Juvino Santos; Lucas Raniére Juvino Santos; Leandro Balby Marinho; Leandro Balby Marinho; Claudio Campelo; Claudio Campelo (2025). [Tweets] 2022 Brazilian Presidential Elections [Dataset]. http://doi.org/10.5281/zenodo.14834669
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14834669
Dataset updated
Feb 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lucas Raniére Juvino Santos; Lucas Raniére Juvino Santos; Leandro Balby Marinho; Leandro Balby Marinho; Claudio Campelo; Claudio Campelo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 1, 2022
Area covered
Brazil
Description
2022 Brazilian Presidential Election

This dataset contains 7,015,186 tweets from 951,602 users, extracted using 91 search terms over 36 days between August 1st and December 31st, 2022.

All tweets in this dataset are in Brazilian Portuguese.

Data Usage

The dataset contains textual data from tweets, making it suitable for various NLP analyses, such as sentiment analysis, bias or stance detection, and toxic language detection. Additionally, users and tweets can be linked to create social graphs, enabling Social Network Analysis (SNA) to study polarization, communities, and other social dynamics.

Extraction Method

This data set was extracted using Twitter's (now X) official API—when Academic Research API access was still available—following the pipeline:

1. Twitter/X daily monitoring: The dataset author monitored daily political events appearing in Brazil's Trending Topics. Twitter/X has an automated system for classifying trending terms. When a term was identified as political, it was stored along with its date for later use as a search query.

2. Tweet collection using saved search terms: Once terms and their corresponding dates were recorded, tweets were extracted from 12:00 AM to 11:59 PM on the day the term entered the Trending Topics. A language filter was applied to select only tweets in Portuguese. The extraction was performed using the official Twitter/X API.

3. Data storage: The extracted data was organized by day and search term. If the same search term appeared in Trending Topics on consecutive days, a separate file was stored for each respective day.

Further Information

For more details, visit:

- The repository
- Dataset short paper:

---

DOI: 10.5281/zenodo.14834669
U
Ask Boris social media monitoring
data.ubdc.ac.uk
cloud.csiss.gmu.edu
+1more
pdf
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Greater London Authority (2023). Ask Boris social media monitoring [Dataset]. https://data.ubdc.ac.uk/dataset/ask-boris-social-media-monitoring
Explore at:
pdfAvailable download formats
Dataset updated
Nov 8, 2023
Dataset provided by
Greater London Authority
Description
Monthly Tweetreach reports monitoring the online metrics for the Mayor of London's 'Ask Boris' Twitter sessions. Each report is generated against the search term, hashtag #askboris and is run from the day before the session starts to the end of the day the session takes place. Data includes:

Reach: The number of unique Twitter accounts that received tweets about the session

Exposure: The number of impressions generated by tweets in the report

Activity: Total number of tweets, contributors, time period and volume

Type: Number ot tweets, retweet and replies

Timeline: A full list of tweets

Notes: A full description of Tweetreach analytics and descriptors is available on www.tweetreach.com or in the article Understanding the TweetReach snapshot report

Please note that due to limitations with the listening tool not all tweets from Ask Boris sessions are captured in the reports.

Tweet Reach report - 28 June 2012

Tweet Reach report - 20 July 2012

Tweet Reach report - 30 August 2012

Tweet Reach report - 28 September 2012

Tweet Reach report - 29 October 2012

Tweet Reach report - 23 November 2012

Tweet Reach report - 20 December 2012

Tweet Reach report - 18 January 2013

Tweet Reach report - 25 February 2013

Tweet Reach report - 22 March 2013

Tweet Reach report - 26 April 2013

Tweet Reach report - 20 June 2013

Tweet Reach report - 18** July 2013**

Tweet Reach report - 29 August 2013

Tweet Reach report - 23 September 2013

Tweet Reach report - 22 October 2013

Tweet Reach report - 25 November 2013

Tweet Reach report - 13 December 2013

Tweet Reach report - 15 January 2014

Tweet Reach report - 13 February 2014

Tweet Reach report - 27 March 2014

Tweet Reach report - 29 May 2014

Tweet Reach report - 26 June 2014

Tweet Reach report - 16 July 2014

Tweet Reach report - 5 August 2014

Tweet Reach report - 11 September 2014

Tweet Reach report - 20 October 2014

Tweet Reach report - 10 November 2014

Tweet Reach report - 19 December 2014

Tweet Reach report - 20 January 2015
Tweets Targeting Isis
kaggle.com
zip
Updated Nov 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ActiveGalaXy (2019). Tweets Targeting Isis [Dataset]. https://www.kaggle.com/activegalaxy/isis-related-tweets
Explore at:
zip(10419329 bytes)Available download formats
Dataset updated
Nov 17, 2019
Authors
ActiveGalaXy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The image at the top of the page is a frame from today's (7/26/2016) Isis #TweetMovie from twitter, a "normal" day when two Isis operatives murdered a priest saying mass in a French church. (You can see this in the center left). A selection of data from this site is being made available here to Kaggle users.

UPDATE: An excellent study by Audrey Alexander titled Digital Decay? is now available which traces the "change over time among English-language Islamic State sympathizers on Twitter.

Intent

This data set is intended to be a counterpoise to the How Isis Uses Twitter data set. That data set contains 17k tweets alleged to originate with "100+ pro-ISIS fanboys". This new set contains 122k tweets collected on two separate days, 7/4/2016 and 7/11/2016, which contained any of the following terms, with no further editing or selection:

isis

isil

daesh

islamicstate

raqqa

Mosul

"islamic state"

This is not a perfect counterpoise as it almost surely contains a small number of pro-Isis fanboy tweets. However, unless some entity, such as Kaggle, is willing to expend significant resources on a service something like an expert level Mechanical Turk or Zooniverse, a high quality counterpoise is out of reach.

A counterpoise provides a balance or backdrop against which to measure a primary object, in this case the original pro-Isis data. So if anyone wants to discriminate between pro-Isis tweets and other tweets concerning Isis you will need to model the original pro-Isis data or signal against the counterpoise which is signal + noise. Further background and some analysis can be found in this forum thread.

This data comes from postmodernnews.com/token-tv.aspx which daily collects about 25MB of Isis tweets for the purposes of graphical display. PLEASE NOTE: This server is not currently active.

Data Details

There are several differences between the format of this data set and the pro-ISIS fanboy dataset. 1. All the twitter t.co tags have been expanded where possible 2. There are no "description, location, followers, numberstatuses" data columns.

I have also included my version of the original pro-ISIS fanboy set. This version has all the t.co links expanded where possible.
h
x_dataset_34576
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
x_dataset_34576 [Dataset]. https://huggingface.co/datasets/icedwind/x_dataset_34576
Explore at:
Authors
Felix
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Bittensor Subnet 13 X (Twitter) Dataset

Dataset Summary

This dataset is part of the Bittensor Subnet 13 decentralized network, containing preprocessed data from X (formerly Twitter). The data is continuously updated by network miners, providing a real-time stream of tweets for various analytical and machine learning tasks. For more information about the dataset, please visit the official repository.

Supported Tasks

The versatility of this… See the full description on the dataset page: https://huggingface.co/datasets/icedwind/x_dataset_34576.

Data from: Based and confused: Tracing the political connotations of a...

zenodo.org

bin, csv

Updated May 26, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Sal Hagen; Sal Hagen; Daniël de Zeeuw; Daniël de Zeeuw (2023). Based and confused: Tracing the political connotations of a memetic phrase across the web [Dataset]. http://doi.org/10.5281/zenodo.7100937

Explore at:

csv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7100937

Dataset updated

May 26, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Sal Hagen; Sal Hagen; Daniël de Zeeuw; Daniël de Zeeuw

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Datasets and details for the paper "Based and confused: Tracing the political connotations of a memetic phrase across the web"

Datasets

Datasets for the case study on the spread of the vernacular term "based" across 4chan/pol/, Reddit, and Twitter. Data was gathered in November 2021. All files are anonymised as much as possible. They contain:

Comments on 4chan and Reddit mentioning "based". False positives were filtered out until the precision was over 0.9. See filtering steps below.
- Reddit: Comments mentioning "based" between 2010-01-01 and 2021-11-01. Collected via the Pushshift API. Data from deleted subreddits may be absent in their last month.
- 4chan/pol/: Posts and comments mentioning "based" between 2013-11-28 and 2021-11-01. Data derived from 4plebs and 4CAT.
- Twitter data was not included for ethical reasons, but constitutes: Tweets mentioning "based" between 2010-01-01 and 2021-11-01. Collected via the Twitter v2 API. Retweets or promoted tweets not allowed.
Counts per day and month for the datasets above.
Sampled and annotated posts from three snapshots. Includes the 200 most-liked or highest-scoring posts from Twitter and Reddit, and a random sample for 4chan/pol/. Annotated with general comments, who was deemed based, and whether the post concerns meta-discussion on the term. Snapshots:
- July 2012
- January 2017
- September 2021
The most-common words between "based " and "pilled" on all three platforms, used for the ternary plot in the paper.

Queries

Table 2 below details the queries we carried out for the collection of the initial datasets. For all platforms, we chose to retain non-English languages since the diffusion of the term in other languages was also deemed relevant.

Initial queries for the datasets
source	query	query type
Twitter	(#based OR (based (pilled OR pill OR redpilled OR redpill OR chad OR virgin OR cringe OR cringy OR triggered OR trigger OR tbh OR lol OR lmao OR wtf OR swag OR nigga OR finna OR bitch OR rare) ) OR " is based" OR "that\'s based" OR "based as fuck" OR "based af" OR "too based" OR "fucking based" "extremely based" OR "totally based" OR "incredibly based" OR "very based" OR "so based" OR "pretty based" OR "quite based" OR "kinda based" OR "kind of based" OR "fairly based" OR "based ngl" OR "as based as" OR "thank you based " OR "stay based" OR "based god") -"based in"-"based off"-"based * off"-"based around"-"based * around"-"based on"-"based * on"-"based out of"-"based upon"-"based * upon"-"based at"-"based from"-"is based by"-"is based of"-"on which * is based"-"upon which * is based"-"which is based there"-"is based all over"-"based more on"-"plant based"-"text based"-"turn based"-"need based"-"evidence based"-"community based" -"web based" -is:retweet -is:nullcast	Twitter v2 API
Reddit	based -"based in" -"based off" -"based around" -"based on" -"based them on" -"based it on" -"evidence based"	Pushshift API
4chan/pol/	lower(body) LIKE '%based%' AND lower(body) NOT SIMILAR TO '%(-based\|debased\|based in \|based off \|based around \|based on \|based them on\|based it on\|based her on\|based him on\|based only on\|based completely on\|based solely on\|based purely on\|based entirely on\|based not on \|based not simply on\|based entirely around\|based out of\|based upon \|based at \|is based by \|is based of\|on which it is based\|on which this is based\|which is based there\|is based all over\|which it is based\|is based of \|based firmly on\|based off \|based solely off\|based more on\|plant based\|text based\|turn based\|need based\|evidence based\|community based\|home based\|internet based\|web based\|physics based)%'	PostgreSQL

Data gaps

There were some data gaps for 4chan/pol/ and Reddit. /pol/ data was missing because of gaps in the archives (mostly due to outages). The following time periods are incomplete or missing entirely:

15 - 16 April 2019
14 - 15 December 2019
3 - 10 December 2020
29 March 2021
10 - 12 April 2021
16 - 18 August 2021
11 October 2021

The 4plebs archive moreover only started in November 2013, meaning the first two years of /pol/’s existence are missing.

The data returned by the Pushshift API did not return posts for certain dates. We somewhat mitigated this by also retrieving data through the new Beta endpoint. However, the following time periods were still missing data:

1 - 30 September 2017
1 February - 31 March 2018
5 - 6 November 2020
23 March 2021 through 27 March 2021
10 - 13 April 2021

Filtering

Afterward initial data collection, we carried out several rounds of filtering to get rid of remaining false positives. For 4chan/pol/, we only needed to do this filtering once (attaining 0.95 precision), while for Twitter we carried out eight rounds (0.92 precision). For Reddit, we formulated nearly 500 exclusions but failed to generate a precision over 0.9. We thus had to do more rigorous filtering. We observed that longer comments were more likely to be false positives, so we removed all comments over 350 characters long. We settled on this number on the basis of our first sample; almost no true positives were over 350 characters long. Furthermore, we removed all comments except for those wherein based was used as a standalone word (thus excluding e.g. “plant-based”), at the start or end of a sentence, in capitals, or in conjunction with certain keywords or in certain phrases (e.g. “kinda based”). We also deleted posts by bot accounts by (rather crudely) removing posts of usernames including ‘bot’ or ‘auto’. This finally led to a precision of 0.9.

Filters used for the 4chan/pol/ data.
-based\|location based

Filters used for the Twitter data.
@-mentions with “based” "on which "where "wherever #based #customer\| alkaline based\| anime based \| are based near \| astrology based \| at the based of\| b0Iuip5wnA\| based economy\| based game \| based locally\| based my name \| based near \| based not upon\| based points\| based purely off\| based quite near \| based solely off\| based soy source\| based upstairs\| blast based\| class based\| clearly based of this\| combat based\| condition based\| dos based\| emotional based\| eth based\| fact based\| gender based\| he based his \| he's based in \| indian based\| is based for fans\| is based lies\| is based near \| is based not around \| is based not on \| is based once again on \| is based there\| is based within\| issue based\| jersey based\| listen to 01 we rare\| music based\| oil based\| on which it's based\| page based 1000\| paper based\| park based \| pc based\| pic based\| pill based regimen\| puzzle based\| sex based \| she based her \| she's based in \| skill based\| story based\| they based their \| they're based in\| toronto based\| trigger on a new yoga 2\| u.s. based\| universal press\| us based\| value based\| we're based in \| where you based?\| you're based in \|#alkaline #based\|#apps #based\|#based #acidic\|#flash #based\|#home #based\|#miami #based\|#piano #based\|#value #based\|american based\|australia based\|australian based\|based my decision\|based entirely around\|based entirely on\|based exactly on \|based her announcement\|based her decision\|based her off\|based him off\|based his announcement\|based his decision\|based largely on\|based less on\|based mostly on\|based my guess\|based only around\|based only on\|based partly on\|based partly upon\|based purely on \|based solely around\|based solely on\|based strictly on\|based the announcement\|based the decision\|based their announcement\|based their decision\|based, not upon\|battery based\|behavior based\|behaviour based\|blockchain based\|book based series\|canon based\|character based\|cloud based\|commision based\|component based\|computer based\|confusion based\|content based\|depression based\|dev based\|dnd based\|factually based\|faith based\|fear based\|flash based\|flintstones based\|flour based\|home based\|homin based\|i based my\|interaction based\|is based circa\|is based competely on\|is based entirely off\|is based here\|is based more on\|is based outta\|is based totally on \|is based up here\|is based way more on\|live conferences with r3\|living based of\|london based\|luck based\|malex based\|market based\|miami based\|needs based\|nyc based\|on which the film is based\|opinion based\|piano based\|point based\|potato based\|premise is based\|region based\|religious based\|science based\|she is based there\|slavery based show\|softball based\|thanks richard clark\|u.k. based\|uk based\|vendor based\|vodka based\|volunteer based\|water based\|where he is based\|where the disney film is based\|where the military is based\|who are based there\|who is based there\|wordpress cms

Filters used for the Reddit data.
Allowed all posts: `<ol> <li> <p>“based” and one of the following`

d
Twitter/X accounts ot the candidates in the 2023 German state election of...
b2find.dkrz.de
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Twitter/X accounts ot the candidates in the 2023 German state election of Bavaria - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/73d56910-20b5-5f20-a7ed-6318df26804f
Explore at:
Dataset updated
Jul 1, 2024
Area covered
Germany, Bavaria
Description
The research project, SPARTA (Society, Politics, and Risk with Twitter Analysis), funded by dtec.bw (which is funded by the European Union – NextGenerationEU), monitors the 2023 state election campaign in Bavaria live as it unfolds on Twitter/X. From September 4 to the election day on October 8, 2023, we collect and analyze all German-language posts and reposts related to the election and its central actors in real time. We publish the results in a nowcasting fashion on the project’s WebApp (https://dtecbw.de/sparta/). Among other findings, we present the stances expressed toward the main parties and their leading candidates. We also illustrate the salient issues discussed as well as the most frequently used hashtags by the election Twittersphere (for example, all tweets addressing the election and its central actors), political parties, leading candidates, and candidates for a mandate in the state parliament. We also measure the extent of negative campaigning and personalization. To enable real-time analyses of the election campaign, we created a dataset with the Twitter/X handles of all candidates for a mandate in the state parliament in August 2023. The dataset contains the Twitter/X handles and additional information about the candidates from six parties: CSU, Bündnis 90/Die Grünen, Freie Wähler, AfD, SPD, and FDP.
0-3 Liverpool vs Watford 02/29/2020
kaggle.com
Updated Mar 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hazem Sayed (2020). 0-3 Liverpool vs Watford 02/29/2020 [Dataset]. https://www.kaggle.com/hazemshokry/03-liverpool-vs-watford-02292020/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hazem Sayed
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

One of the greatest unbeaten streaks in the history of the English premier league had to end somewhere, the dream of going 50 games unbeaten and taking Arsenal’s place in the record books is over. Stopped by an astonishing performance from a Watford side that began the day in 19th place and played. The clock is stopped for Liverpool at 44 games undefeated going back to Jan 3 last year, and this is the end of a remarkable 18 straight wins in the league.

Content

Data collection date (from Twitter API): Feb 29, 2020 Dimension of the data set; 260K rows and 4 columns in CSV format. Hashtag filters include: Liverpool WATLIV ليفربول LFC

Acknowledgements

Tweets links' and owners are hidden to keep everything anonymous. Please get in touch with me if you have a use case requires using them.

Inspiration

Create sentiment analysis to identify the tone for Liverpool and Watford fans.

See how many people are sad and what are the guessed reasons for loss?

Identify the reason for this win.

See are Liverpool fans ambitious about next match and expect this match's result?

And much more…
f
DHCrowdScribe: Building Scholarly Resources for Wider Public Engagement. A...
city.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ernesto Priego (2023). DHCrowdScribe: Building Scholarly Resources for Wider Public Engagement. A #DHCOxf Archive [Dataset]. http://doi.org/10.6084/m9.figshare.1057904.v5
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1057904.v5
Dataset updated
May 30, 2023
Dataset provided by
City, University of London
Authors
Ernesto Priego
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
'Building Scholarly Resources for Wider Public Engagement' was a full day workshop that took place at the Radcliffe Observatory Quarter, Oxford University, Oxford, on Friday 13 June 2014. The hashtag for the event was #DHCOxf. It was organised by DHCrowdscribe, the online hub for the output of the AHRC-funded Collaborative Skills Project ‘Promoting Interdisciplinary Engagement in the Digital Humanities’. Speakers were: Matt Vitins and Anna Crowe ( Legal and Ethical Issues in the Digital Humanities), Dr Stuart Dunn (Crowdsourcing), Dr Robert Simpson (Zooniverse), Dr Ernesto Priego and Dr James Baker (Sharing Data from a Researcher’s Perspective), Michael Popham, Dr Ylva Berglund Prytz (Digitising the Humanities and Engaging with the Public), Judith Siefring (Early English Books Online Text Creation Partnership), David Tomkins (Bodleian Digital Library), Dr Robert Mcnamee (Electronic Enlightenment Project), Dr Stewart Brookes (‘GettingMedieval, Getting Palaeography: The DigiPal Database of Anglo-Saxon Manuscripts), Dr Michael Athanson (ArcGIS and Mapping the Humanities) Professor David de Roure (Scholarly Social Machines), and Professor Howard Hotson. This .XLS file contains Tweets tagged with #DHCOxf (case not sensitive). The archive shared here contains 692 Tweets dated 13 June 2014 (the day the event took place). There were definitely more Tweets tagged #DHCOxf, but this was the closest I got to compiling a more or less complete set dated 13 June 2014. The Tweets contained in this file were collected using Martin Hawksey’s TAGS 5.1. The file contains two sheets: Sheet 0. The 'Cite Me' sheet, including procedence of the file, citation information, information about its contents, the methods employed and some context. Sheet 1. The Archive containing 692 Tweets dated 13 June 2014. To avoid spam only users with at least 2 followers were included in the archive. Retweets have been included. Please note that both research and experience show that the Twitter search API isn't 100% reliable. Large tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (González-Bailón, Sandra, et al. 2012). Therefore, it cannot be guaranteed this file contains each and every tweet tagged with #DHCOxf during the indicated period. Some deduplication and refining has been performed to avoid spam tweets and duplication. Some characters in some Tweets' text might not have been decoded correctly. Please note the data in this file is likely to require further refining and even deduplication. The data is shared as is. If you use or refer to this data in any way please cite and link back using the citation information above. All the data collected in this small dataset was willingly made freely, openly and publicly available online by users via Twitter and therefore was and still is openly and freely available through several other methods and services. It has been shared here in a curated form for educational and research use and no copyright or privacy infringement is intended or should be inferred. This file was created and shared by Ernesto Priego (Centre for Information Science, City University London) with a Creative Commons- Attribution license (CC-BY).

If you use or refer to this data in any way please cite and link back using the citation information above.

[Please make sure you are looking at the latest version of the file as earlier versions contained unfortunate typos].
s
Which Politicians Receive Abuse?
orda.shef.ac.uk
figshare.shef.ac.uk
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Genevieve Gorrell; Mehmet Bakir; Ian Roberts; Mark Greenwood; Kalina Bontcheva (2023). Which Politicians Receive Abuse? [Dataset]. http://doi.org/10.15131/shef.data.12340994.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.12340994.v1
Dataset updated
May 30, 2023
Dataset provided by
The University of Sheffield
Authors
Genevieve Gorrell; Mehmet Bakir; Ian Roberts; Mark Greenwood; Kalina Bontcheva
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The spreadsheets contain aggregate statistics for abusive language found in tweets to UK politicians in 2019. An overview spreadsheet is provided for each of the months of January to November ("per-mp-xxx-2019.csv" where xxx is the abbreviation for the month), with one row per MP, and a spreadsheet with data per day is provided for the campaign period of the UK 2019 general election, with one row per candidate, starting at the beginning of November and finishing on December 15th, a few days after the election ("campaign-period-per-cand-per-day.csv"). These spreadsheets list, for each individual, gender, party, the start and end times of the counts, tweets authored, retweets by the individual, replies by the individual, the number of times the individual was retweeted, replies received by the individual ("replyTo"), abusive tweets received in total and abusive tweets received in each of the categories sexist, racist and political.Two additional spreadsheets focus on topics; "topics-of-cands.csv" and "topics-of-replies.csv". In the first, counts of tweets mentioning each of a set of topics are given, alongside counts of abusive tweets mentioning each topic, in tweets by each candidate. In the second, the counts are of replies received when a candidate mentions a topic, alongside abusive replies received when they mentioned that topic.The data complement the forthcoming paper "Which Politicians Receive Abuse? Four Factors Illuminated in the UK General Election 2019", by Genevieve Gorrell, Mehmet E Bakir, Ian Roberts, Mark A Greenwood and Kalina Bontcheva. The way the data were acquired is described more fully in the paper.Ethics approval was granted to collect the data through application 25371 at the University of Sheffield.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 100+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3735274

A Twitter Dataset of 100+ million tweets related to COVID-19

Explore at:

application/gzip, tsv, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3735274

Dataset updated

Apr 17, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding

Description

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

Clear search

Close search

Google apps

Main menu

A Twitter Dataset of 100+ million tweets related to COVID-19

Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July...

Digital Narratives of Covid-19: a Twitter Dataset

Geotagged Twitter posts from the United States: A tweet collection to...

An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT

[Tweets] 2022 Brazilian Presidential Elections

2022 Brazilian Presidential Election

Data Usage

Extraction Method

Further Information

Ask Boris social media monitoring

Tweets Targeting Isis

Context

Intent

Data Details

x_dataset_34576

Data from: Based and confused: Tracing the political connotations of a...

Twitter/X accounts ot the candidates in the 2023 German state election of...

0-3 Liverpool vs Watford 02/29/2020

Context

Content

Acknowledgements

Inspiration

DHCrowdScribe: Building Scholarly Resources for Wider Public Engagement. A...

If you use or refer to this data in any way please cite and link back using the citation information above.

Which Politicians Receive Abuse?

A Twitter Dataset of 100+ million tweets related to COVID-19See More Versions

A Twitter Dataset of 100+ million tweets related to COVID-19