Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.
The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).
It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).
Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.
The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.
To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.
In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).
In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:
March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).
June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).
September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).
December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).
March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).
June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).
September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).
December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).
March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).
June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).
September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).
December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).
March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).
June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).
The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.
At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.
In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).
Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.
For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).
If you use this dataset in any way please cite that preprint (in addition to the dataset itself).
If you need to contact me you can find me as @PFCdgayo in Twitter.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are releasing a Twitter dataset connected to our project Digital Narratives of Covid-19 (DHCOVID) that -among other goals- aims to explore during one year (May 2020-2021) the narratives behind data about the coronavirus pandemic.
In this first version, we deliver a Twitter dataset organized as follows:
For English, we collect all tweets with the following keywords and hashtags: covid, coronavirus, pandemic, quarantine, stayathome, outbreak, lockdown, socialdistancing. For Spanish, we search for: covid, coronavirus, pandemia, quarentena, confinamiento, quedateencasa, desescalada, distanciamiento social.
The corpus of tweets consists of a list of Tweet Ids; to obtain the original tweets, you can use "Twitter hydratator" which takes the id and download for you all metadata in a csv file.
We started collecting this Twitter dataset on April 24th, 2020 and we are adding daily data to our GitHub repository. There is a detected problem with file 2020-04-24/dhcovid_2020-04-24_es.txt, which we couldn't gather the data due to technical reasons.
For more information about our project visit https://covid.dh.miami.edu/
For more updated datasets and detailed criteria, check our GitHub Repository: https://github.com/dh-miami/narratives_covid19/
This dataset consists of IDs of geotagged Twitter posts from within the United States. They are provided as files per day and state as well as per day and county. In addition, files containing the aggregated number of hashtags from these tweets are provided per day and state and per day and county. This data is organized as a ZIP-file per month containing several zip-files per day which hold the txt-files with the ID/hash information.
Also part of the dataset are two shapefiles for the US counties and states and Python scripts for the data collection and sorting geotags into counties.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe Digital Humanities 2016 conference is taking/took place in Kraków, Poland, between Sunday 11 July and Saturday 16 July 2016. #DH2016 is/was the conference official hashtag.What This Output IsThis is a CSV file containing a total of 3717 Tweets publicly published with the hashtag #DH2016 on Thursday 14 July 2016 GMT.The
archive starts with a Tweet published on Thursday July 14 2016 at 00:01:04 +0000 and ends with a Tweet published on Thursday July 14 2016 at 23:49:14 +0000 (GMT). Previous days have been shared on a different output. A breakdown of Tweets per day so far:Sunday 10 July 2016: 179 TweetsMonday 11 July 2016: 981 TweetsTuesday 12 July 2016: 2318 TweetsWednesday 13 July 2016: 4175 TweetsThursday 14 July 2016: 3717 Tweets Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using Martin Hawksey's TAGS 6.0. Only users with at least 1 follower were included in the archive. Retweets have been included (Retweets count as Tweets). The collection spreadsheet was customised to reflect the time zone and geographical location of the conference.The profile_image_url and entities_str metadata were removed before public sharing in this archive. Please bear in mind that the conference hashtag has been spammed so some Tweets colllected may be from spam accounts. Some automated refining has been performed to remove Tweets not related to the conference but the data is likely to require further refining and deduplication. Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet tagged with #dh2016 during the indicated period, and the dataset is shared for archival, comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web with the queried hashtag and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. No private personal information is shared in this dataset. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road. This dataset is shared to archive, document and encourage open educational research into scholarly activity on Twitter. Other ConsiderationsTweets published publicly by scholars during academic conferences are often tagged (labeled) with a hashtag dedicated to the conference in question.The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. Though every reason for Tweeters' use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter's Privacy and data sharing policies. Professional associations like the Modern Language Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter's search API has well-known temporal limitations for retrospective historical search and collection.Beyond individual tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. To date, collecting in real time is the only relatively accurate method to archive tweets at a small scale. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time.The CC-BY license has been applied to the output in the repository as a curated dataset. Authorial/curatorial/collection work has been performed on the file in order to make it available as part of the scholarly record. The data contained in the deposited file is otherwise freely available elsewhere through different methods and anyone not wishing to attribute the data to the creator of this output is needless to say free to do their own collection and clean their own data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 7,015,186 tweets from 951,602 users, extracted using 91 search terms over 36 days between August 1st and December 31st, 2022.
All tweets in this dataset are in Brazilian Portuguese.
The dataset contains textual data from tweets, making it suitable for various NLP analyses, such as sentiment analysis, bias or stance detection, and toxic language detection. Additionally, users and tweets can be linked to create social graphs, enabling Social Network Analysis (SNA) to study polarization, communities, and other social dynamics.
This data set was extracted using Twitter's (now X) official API—when Academic Research API access was still available—following the pipeline:
1. Twitter/X daily monitoring: The dataset author monitored daily political events appearing in Brazil's Trending Topics. Twitter/X has an automated system for classifying trending terms. When a term was identified as political, it was stored along with its date for later use as a search query.
2. Tweet collection using saved search terms: Once terms and their corresponding dates were recorded, tweets were extracted from 12:00 AM to 11:59 PM on the day the term entered the Trending Topics. A language filter was applied to select only tweets in Portuguese. The extraction was performed using the official Twitter/X API.
3. Data storage: The extracted data was organized by day and search term. If the same search term appeared in Trending Topics on consecutive days, a separate file was stored for each respective day.
For more details, visit:
- The repository
- Dataset short paper:
---
DOI: 10.5281/zenodo.14834669
Monthly Tweetreach reports monitoring the online metrics for the Mayor of London's 'Ask Boris' Twitter sessions. Each report is generated against the search term, hashtag #askboris and is run from the day before the session starts to the end of the day the session takes place. Data includes:
Reach: The number of unique Twitter accounts that received tweets about the session
Exposure: The number of impressions generated by tweets in the report
Activity: Total number of tweets, contributors, time period and volume
Type: Number ot tweets, retweet and replies
Timeline: A full list of tweets
Notes: A full description of Tweetreach analytics and descriptors is available on www.tweetreach.com or in the article Understanding the TweetReach snapshot report
Please note that due to limitations with the listening tool not all tweets from Ask Boris sessions are captured in the reports.
Tweet Reach report - 28 June 2012
Tweet Reach report - 20 July 2012
Tweet Reach report - 30 August 2012
Tweet Reach report - 28 September 2012
Tweet Reach report - 29 October 2012
Tweet Reach report - 23 November 2012
Tweet Reach report - 20 December 2012
Tweet Reach report - 18 January 2013
Tweet Reach report - 25 February 2013
Tweet Reach report - 22 March 2013
Tweet Reach report - 26 April 2013
Tweet Reach report - 20 June 2013
Tweet Reach report - 18** July 2013**
Tweet Reach report - 29 August 2013
Tweet Reach report - 23 September 2013
Tweet Reach report - 22 October 2013
Tweet Reach report - 25 November 2013
Tweet Reach report - 13 December 2013
Tweet Reach report - 15 January 2014
Tweet Reach report - 13 February 2014
Tweet Reach report - 27 March 2014
Tweet Reach report - 29 May 2014
Tweet Reach report - 26 June 2014
Tweet Reach report - 16 July 2014
Tweet Reach report - 5 August 2014
Tweet Reach report - 11 September 2014
Tweet Reach report - 20 October 2014
Tweet Reach report - 10 November 2014
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The image at the top of the page is a frame from today's (7/26/2016) Isis #TweetMovie from twitter, a "normal" day when two Isis operatives murdered a priest saying mass in a French church. (You can see this in the center left). A selection of data from this site is being made available here to Kaggle users.
UPDATE: An excellent study by Audrey Alexander titled Digital Decay? is now available which traces the "change over time among English-language Islamic State sympathizers on Twitter.
This data set is intended to be a counterpoise to the How Isis Uses Twitter data set. That data set contains 17k tweets alleged to originate with "100+ pro-ISIS fanboys". This new set contains 122k tweets collected on two separate days, 7/4/2016 and 7/11/2016, which contained any of the following terms, with no further editing or selection:
This is not a perfect counterpoise as it almost surely contains a small number of pro-Isis fanboy tweets. However, unless some entity, such as Kaggle, is willing to expend significant resources on a service something like an expert level Mechanical Turk or Zooniverse, a high quality counterpoise is out of reach.
A counterpoise provides a balance or backdrop against which to measure a primary object, in this case the original pro-Isis data. So if anyone wants to discriminate between pro-Isis tweets and other tweets concerning Isis you will need to model the original pro-Isis data or signal against the counterpoise which is signal + noise. Further background and some analysis can be found in this forum thread.
This data comes from postmodernnews.com/token-tv.aspx which daily collects about 25MB of Isis tweets for the purposes of graphical display. PLEASE NOTE: This server is not currently active.
There are several differences between the format of this data set and the pro-ISIS fanboy dataset. 1. All the twitter t.co tags have been expanded where possible 2. There are no "description, location, followers, numberstatuses" data columns.
I have also included my version of the original pro-ISIS fanboy set. This version has all the t.co links expanded where possible.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Bittensor Subnet 13 X (Twitter) Dataset
Dataset Summary
This dataset is part of the Bittensor Subnet 13 decentralized network, containing preprocessed data from X (formerly Twitter). The data is continuously updated by network miners, providing a real-time stream of tweets for various analytical and machine learning tasks. For more information about the dataset, please visit the official repository.
Supported Tasks
The versatility of this… See the full description on the dataset page: https://huggingface.co/datasets/icedwind/x_dataset_34576.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets and details for the paper "Based and confused: Tracing the political connotations of a memetic phrase across the web"
Datasets
Datasets for the case study on the spread of the vernacular term "based" across 4chan/pol/, Reddit, and Twitter. Data was gathered in November 2021. All files are anonymised as much as possible. They contain:
Queries
Table 2 below details the queries we carried out for the collection of the initial datasets. For all platforms, we chose to retain non-English languages since the diffusion of the term in other languages was also deemed relevant.
source | query | query type |
(#based OR (based (pilled OR pill OR redpilled OR redpill OR chad OR virgin OR cringe OR cringy OR triggered OR trigger OR tbh OR lol OR lmao OR wtf OR swag OR nigga OR finna OR bitch OR rare) ) OR " is based" OR "that\'s based" OR "based as fuck" OR "based af" OR "too based" OR "fucking based" "extremely based" OR "totally based" OR "incredibly based" OR "very based" OR "so based" OR "pretty based" OR "quite based" OR "kinda based" OR "kind of based" OR "fairly based" OR "based ngl" OR "as based as" OR "thank you based " OR "stay based" OR "based god") -"based in"-"based off"-"based * off"-"based around"-"based * around"-"based on"-"based * on"-"based out of"-"based upon"-"based * upon"-"based at"-"based from"-"is based by"-"is based of"-"on which * is based"-"upon which * is based"-"which is based there"-"is based all over"-"based more on"-"plant based"-"text based"-"turn based"-"need based"-"evidence based"-"community based" -"web based" -is:retweet -is:nullcast | Twitter v2 API | |
based -"based in" -"based off" -"based around" -"based on" -"based them on" -"based it on" -"evidence based" | Pushshift API | |
4chan/pol/ | lower(body) LIKE '%based%' AND lower(body) NOT SIMILAR TO '%(-based|debased|based in |based off |based around |based on |based them on|based it on|based her on|based him on|based only on|based completely on|based solely on|based purely on|based entirely on|based not on |based not simply on|based entirely around|based out of|based upon |based at |is based by |is based of|on which it is based|on which this is based|which is based there|is based all over|which it is based|is based of |based firmly on|based off |based solely off|based more on|plant based|text based|turn based|need based|evidence based|community based|home based|internet based|web based|physics based)%' | PostgreSQL |
Data gaps
There were some data gaps for 4chan/pol/ and Reddit. /pol/ data was missing because of gaps in the archives (mostly due to outages). The following time periods are incomplete or missing entirely:
15 - 16 April 2019
14 - 15 December 2019
3 - 10 December 2020
29 March 2021
10 - 12 April 2021
16 - 18 August 2021
11 October 2021
The 4plebs archive moreover only started in November 2013, meaning the first two years of /pol/’s existence are missing.
The data returned by the Pushshift API did not return posts for certain dates. We somewhat mitigated this by also retrieving data through the new Beta endpoint. However, the following time periods were still missing data:
1 - 30 September 2017
1 February - 31 March 2018
5 - 6 November 2020
23 March 2021 through 27 March 2021
10 - 13 April 2021
Filtering
Afterward initial data collection, we carried out several rounds of filtering to get rid of remaining false positives. For 4chan/pol/, we only needed to do this filtering once (attaining 0.95 precision), while for Twitter we carried out eight rounds (0.92 precision). For Reddit, we formulated nearly 500 exclusions but failed to generate a precision over 0.9. We thus had to do more rigorous filtering. We observed that longer comments were more likely to be false positives, so we removed all comments over 350 characters long. We settled on this number on the basis of our first sample; almost no true positives were over 350 characters long. Furthermore, we removed all comments except for those wherein based was used as a standalone word (thus excluding e.g. “plant-based”), at the start or end of a sentence, in capitals, or in conjunction with certain keywords or in certain phrases (e.g. “kinda based”). We also deleted posts by bot accounts by (rather crudely) removing posts of usernames including ‘bot’ or ‘auto’. This finally led to a precision of 0.9.
-based|location based |
@-mentions with “based” "on which "where "wherever #based #customer| alkaline based| anime based | are based near | astrology based | at the based of| b0Iuip5wnA| based economy| based game | based locally| based my name | based near | based not upon| based points| based purely off| based quite near | based solely off| based soy source| based upstairs| blast based| class based| clearly based of this| combat based| condition based| dos based| emotional based| eth based| fact based| gender based| he based his | he's based in | indian based| is based for fans| is based lies| is based near | is based not around | is based not on | is based once again on | is based there| is based within| issue based| jersey based| listen to 01 we rare| music based| oil based| on which it's based| page based 1000| paper based| park based | pc based| pic based| pill based regimen| puzzle based| sex based | she based her | she's based in | skill based| story based| they based their | they're based in| toronto based| trigger on a new yoga 2| u.s. based| universal press| us based| value based| we're based in | where you based?| you're based in |#alkaline #based|#apps #based|#based #acidic|#flash #based|#home #based|#miami #based|#piano #based|#value #based|american based|australia based|australian based|based my decision|based entirely around|based entirely on|based exactly on |based her announcement|based her decision|based her off|based him off|based his announcement|based his decision|based largely on|based less on|based mostly on|based my guess|based only around|based only on|based partly on|based partly upon|based purely on |based solely around|based solely on|based strictly on|based the announcement|based the decision|based their announcement|based their decision|based, not upon|battery based|behavior based|behaviour based|blockchain based|book based series|canon based|character based|cloud based|commision based|component based|computer based|confusion based|content based|depression based|dev based|dnd based|factually based|faith based|fear based|flash based|flintstones based|flour based|home based|homin based|i based my|interaction based|is based circa|is based competely on|is based entirely off|is based here|is based more on|is based outta|is based totally on |is based up here|is based way more on|live conferences with r3|living based of|london based|luck based|malex based|market based|miami based|needs based|nyc based|on which the film is based|opinion based|piano based|point based|potato based|premise is based|region based|religious based|science based|she is based there|slavery based show|softball based|thanks richard clark|u.k. based|uk based|vendor based|vodka based|volunteer based|water based|where he is based|where the disney film is based|where the military is based|who are based there|who is based there|wordpress cms |
Allowed all posts:
|
The research project, SPARTA (Society, Politics, and Risk with Twitter Analysis), funded by dtec.bw (which is funded by the European Union – NextGenerationEU), monitors the 2023 state election campaign in Bavaria live as it unfolds on Twitter/X. From September 4 to the election day on October 8, 2023, we collect and analyze all German-language posts and reposts related to the election and its central actors in real time. We publish the results in a nowcasting fashion on the project’s WebApp (https://dtecbw.de/sparta/). Among other findings, we present the stances expressed toward the main parties and their leading candidates. We also illustrate the salient issues discussed as well as the most frequently used hashtags by the election Twittersphere (for example, all tweets addressing the election and its central actors), political parties, leading candidates, and candidates for a mandate in the state parliament. We also measure the extent of negative campaigning and personalization. To enable real-time analyses of the election campaign, we created a dataset with the Twitter/X handles of all candidates for a mandate in the state parliament in August 2023. The dataset contains the Twitter/X handles and additional information about the candidates from six parties: CSU, Bündnis 90/Die Grünen, Freie Wähler, AfD, SPD, and FDP.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
One of the greatest unbeaten streaks in the history of the English premier league had to end somewhere, the dream of going 50 games unbeaten and taking Arsenal’s place in the record books is over. Stopped by an astonishing performance from a Watford side that began the day in 19th place and played. The clock is stopped for Liverpool at 44 games undefeated going back to Jan 3 last year, and this is the end of a remarkable 18 straight wins in the league.
Data collection date (from Twitter API): Feb 29, 2020
Dimension of the data set; 260K rows and 4 columns in CSV format.
Hashtag filters include:
Liverpool
WATLIV
ليفربول
LFC
Tweets links' and owners are hidden to keep everything anonymous. Please get in touch with me if you have a use case requires using them.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
'Building Scholarly Resources for Wider Public Engagement' was a full day workshop that took place at the Radcliffe Observatory Quarter, Oxford University, Oxford, on Friday 13 June 2014. The hashtag for the event was #DHCOxf. It was organised by DHCrowdscribe, the online hub for the output of the AHRC-funded Collaborative Skills Project ‘Promoting Interdisciplinary Engagement in the Digital Humanities’. Speakers were: Matt Vitins and Anna Crowe ( Legal and Ethical Issues in the Digital Humanities), Dr Stuart Dunn (Crowdsourcing), Dr Robert Simpson (Zooniverse), Dr Ernesto Priego and Dr James Baker (Sharing Data from a Researcher’s Perspective), Michael Popham, Dr Ylva Berglund Prytz (Digitising the Humanities and Engaging with the Public), Judith Siefring (Early English Books Online Text Creation Partnership), David Tomkins (Bodleian Digital Library), Dr Robert Mcnamee (Electronic Enlightenment Project), Dr Stewart Brookes (‘GettingMedieval, Getting Palaeography: The DigiPal Database of Anglo-Saxon Manuscripts), Dr Michael Athanson (ArcGIS and Mapping the Humanities) Professor David de Roure (Scholarly Social Machines), and Professor Howard Hotson. This .XLS file contains Tweets tagged with #DHCOxf (case not sensitive). The archive shared here contains 692 Tweets dated 13 June 2014 (the day the event took place). There were definitely more Tweets tagged #DHCOxf, but this was the closest I got to compiling a more or less complete set dated 13 June 2014. The Tweets contained in this file were collected using Martin Hawksey’s TAGS 5.1. The file contains two sheets: Sheet 0. The 'Cite Me' sheet, including procedence of the file, citation information, information about its contents, the methods employed and some context. Sheet 1. The Archive containing 692 Tweets dated 13 June 2014. To avoid spam only users with at least 2 followers were included in the archive. Retweets have been included. Please note that both research and experience show that the Twitter search API isn't 100% reliable. Large tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (González-Bailón, Sandra, et al. 2012). Therefore, it cannot be guaranteed this file contains each and every tweet tagged with #DHCOxf during the indicated period. Some deduplication and refining has been performed to avoid spam tweets and duplication. Some characters in some Tweets' text might not have been decoded correctly. Please note the data in this file is likely to require further refining and even deduplication. The data is shared as is. If you use or refer to this data in any way please cite and link back using the citation information above. All the data collected in this small dataset was willingly made freely, openly and publicly available online by users via Twitter and therefore was and still is openly and freely available through several other methods and services. It has been shared here in a curated form for educational and research use and no copyright or privacy infringement is intended or should be inferred. This file was created and shared by Ernesto Priego (Centre for Information Science, City University London) with a Creative Commons- Attribution license (CC-BY).
[Please make sure you are looking at the latest version of the file as earlier versions contained unfortunate typos].
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The spreadsheets contain aggregate statistics for abusive language found in tweets to UK politicians in 2019. An overview spreadsheet is provided for each of the months of January to November ("per-mp-xxx-2019.csv" where xxx is the abbreviation for the month), with one row per MP, and a spreadsheet with data per day is provided for the campaign period of the UK 2019 general election, with one row per candidate, starting at the beginning of November and finishing on December 15th, a few days after the election ("campaign-period-per-cand-per-day.csv"). These spreadsheets list, for each individual, gender, party, the start and end times of the counts, tweets authored, retweets by the individual, replies by the individual, the number of times the individual was retweeted, replies received by the individual ("replyTo"), abusive tweets received in total and abusive tweets received in each of the categories sexist, racist and political.Two additional spreadsheets focus on topics; "topics-of-cands.csv" and "topics-of-replies.csv". In the first, counts of tweets mentioning each of a set of topics are given, alongside counts of abusive tweets mentioning each topic, in tweets by each candidate. In the second, the counts are of replies received when a candidate mentions a topic, alongside abusive replies received when they mentioned that topic.The data complement the forthcoming paper "Which Politicians Receive Abuse? Four Factors Illuminated in the UK General Election 2019", by Genevieve Gorrell, Mehmet E Bakir, Ian Roberts, Mark A Greenwood and Kalina Bontcheva. The way the data were acquired is described more fully in the paper.Ethics approval was granted to collect the data through application 25371 at the University of Sheffield.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.