25 datasets found

Z
Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July...
data.niaid.nih.gov
live.european-language-grid.eu
+2more
Updated May 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gayo-Avello, Daniel (2020). Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3833781
Explore at:
Dataset updated
May 20, 2020
Dataset authored and provided by
Gayo-Avello, Daniel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.

The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).

It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).

Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.

The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.

To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.

In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).

In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:

March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).

June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).

September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).

December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).

March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).

June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).

September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).

December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).

March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).

June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).

September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).

December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).

March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).

June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).

The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.

At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.

In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).

Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.

For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).

If you use this dataset in any way please cite that preprint (in addition to the dataset itself).

If you need to contact me you can find me as @PFCdgayo in Twitter.
P
Famous Keyword Twitter Replies Dataset
paperswithcode.com
Updated Jun 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Famous Keyword Twitter Replies Dataset [Dataset]. https://paperswithcode.com/dataset/famous-keyword-twitter-replies
Explore at:
Dataset updated
Jun 16, 2023
Description
The "Famous Keyword Twitter Replies Dataset" is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies. This dataset contains five essential columns that provide valuable insights into the Twitter conversation dynamics:

Keyword: This column represents the specific keyword or topic of interest that generated the original tweet. It helps identify the context or subject matter around which the conversation revolves.

Main_tweet: The main_tweet column contains the original tweet related to the keyword. It serves as the starting point or focal point of the conversation and often provides essential information or opinions on the given topic.

Main_likes: This column provides the number of likes received by the main_tweet. Likes serve as a measure of engagement and indicate the level of popularity or resonance of the original tweet within the Twitter community.

Reply: The reply column consists of the replies or responses to the main_tweet. These replies may include comments, opinions, additional information, or discussions related to the keyword or the original tweet itself. The replies help capture the diverse perspectives and conversations that emerge in response to the main_tweet.

Reply_likes: This column records the number of likes received by each reply. Similar to the main_likes column, the reply_likes column measures the level of engagement and popularity of individual replies. It enables the identification of particularly noteworthy or well-received replies within the dataset.

By analyzing this "Famous Keyword Twitter Replies Dataset," researchers, analysts, and data scientists can gain valuable insights into how popular keywords spark discussions on Twitter and how these discussions evolve through replies.

The dataset's information on likes allows for the evaluation of tweet and reply popularity, helping to identify influential or impactful content.

This dataset serves as a valuable resource for various applications, including sentiment analysis, trend identification, opinion mining, and understanding social media dynamics.

Number of tweets for each pairs of tweet and reply

Total has 17255 pairs of tweet/reply
Data from: Google Analytics & Twitter dataset from a movies, TV series and...
figshare.com
portalcientificovalencia.univeuropea.com
txt
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Yeste (2024). Google Analytics & Twitter dataset from a movies, TV series and videogames website [Dataset]. http://doi.org/10.6084/m9.figshare.16553061.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16553061.v4
Dataset updated
Feb 7, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Víctor Yeste
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
(🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2024). (🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset [Dataset]. https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows
Explore at:
zip(18174367560 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Ukraine
Description
IMPORTANT (02-Apr-2024)

Kaggle has fixed the issue with gzip files and Version 510 should now reflect properly working files

IMPORTANT (28-Mar-2024)

Please use the version 508 of the dataset, as 509 is broken. See link below of the dataset that is properly working https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows/versions/508

Context

The context and history of the current ongoing conflict can be found https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine.

Announcement

[Jun 16] (🌇Sunset) Twitter has finally pulled the plug on all of my remaining TWITTER API accounts as part of their efforts for developers to migrate to the new API. The last tweets that I pulled was dated last Jun 14, and no more data from Jun 15 onwards. It was fun til it lasted and I hope that this dataset was able and will continue to help a lot. I'll just leave the dataset here for future download and reference. Thank you all!

[Apr 19] Two additional developer accounts have been permanently suspended, expect a lower throughtput in the next few weeks. I will pull data til they ban my last account.

[Apr 08] I woke up this morning and saw that Twitter has banned/ permanently suspended 4 of my developer accounts, I have around a few more but it is just a matter of time till all my accounts will most likely get banned as well. This was a fun project that I maintained for as long as I can. I will pull data til my last account gets banned.

[Feb 26] I've started to pull in RETWEETS again, so I am expecting a significant amount of throughput in tweets again on top of the dedicated processes that I have that gets NONRETWEETS. If you don't want RETWEETS, just filter them out.

[Feb 24] It's been a year since I started getting tweets of this conflict and had no idea that a year later this is still ongoing. Almost everyone assumed that Ukraine will crumble in a matter of days, but it is not the case. To those who have been using my dataset, i hope that I am helping all of you in one way or another. Ill do my best to maintain updating this dataset as long as I can.

[Feb 02] I seem to be getting less tweets as my crawlers are getting throttled, i used to get 2500 tweets per 15 mins but around 2-3 of my crawlers are getting throttling limit errors. There may be some kind of update that Twitter has done about rate limits or something similar. Will try to find ways to increase the throughput again.

[Jan 02] For all new datasets, it will now be prefixed by a year, so for Jan 01, 2023, it will be 20230101_XXXX.

[Dec 28] For those looking for a cleaned version of my dataset, with the retweets removed from before Aug 08, here is a dataset by @@vbmokin https://www.kaggle.com/datasets/vbmokin/russian-invasion-ukraine-without-retweets

[Nov 19] I noticed that one of my developer accounts, which ISNT TWEETING ANYTHING and just pulling data out of twitter has been permanently banned by Twitter.com, thus the decrease of unique tweets. I will try to come up with a solution to increase my throughput and signup for a new developer account.

[Oct 19] I just noticed that this dataset is finally "GOLD", after roughly seven months since I first uploaded my gzipped csv files.

[Oct 11] Sudden spike in number of tweets revolving around most recent development(s) about the Kerch Bridge explosion and the response from Russia.

[Aug 19- IMPORTANT] I raised the missing dataset issue to Kaggle team and they confirmed it was a bug brought by a ReactJs upgrade, the conversation and details can be seen here https://www.kaggle.com/discussions/product-feedback/345915 . It has been fixed already and I've reuploaded all the gzipped files that were lost PLUS the new files that were generated AFTER the issue was identified.

[Aug 17] Seems the latest version of my dataset lost around 100+ files, good thing this dataset is versioned so one can just go back to the previous version(s) and download them. Version 188 HAS ALL THE LOST FILES, I wont be reuploading all datasets as it will be tedious and I've deleted them already in my local and I only store the latest 2-3 days.

[Aug 10] 3/5 of my Python processes errored out and resulted to around 10-12 hours of NO data gathering for those processes thus the sharp decrease of tweets for Aug 09 dataset. I've applied an exception/ error checking to prevent this from happening.

[Aug 09] Significant drop in tweets extracted, but I am now getting ORIGINAL/ NON-RETWEETS.

[Aug 08] I've noticed that I had a spike of Tweets extracted, but they are literally thousands of retweets of a single original tweet. I also noticed that my crawlers seem to deviate because of this tactic being used by some Twitter users where they flood Twitter w...
f
An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT
city.figshare.com
html
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ernesto Priego (2023). An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT [Dataset]. http://doi.org/10.6084/m9.figshare.3487103.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3487103.v1
Dataset updated
May 31, 2023
Dataset provided by
City, University of London
Authors
Ernesto Priego
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Digital Humanities 2016 conference is taking/took place in Kraków, Poland, between Sunday 11 July and Saturday 16 July 2016. #DH2016 is/was the conference official hashtag.What This Output IsThis is a CSV file containing a total of 3717 Tweets publicly published with the hashtag #DH2016 on Thursday 14 July 2016 GMT.The

archive starts with a Tweet published on Thursday July 14 2016 at 00:01:04 +0000 and ends with a Tweet published on Thursday July 14 2016 at 23:49:14 +0000 (GMT). Previous days have been shared on a different output. A breakdown of Tweets per day so far:Sunday 10 July 2016: 179 TweetsMonday 11 July 2016: 981 TweetsTuesday 12 July 2016: 2318 TweetsWednesday 13 July 2016: 4175 TweetsThursday 14 July 2016: 3717 Tweets Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using Martin Hawksey's TAGS 6.0. Only users with at least 1 follower were included in the archive. Retweets have been included (Retweets count as Tweets). The collection spreadsheet was customised to reflect the time zone and geographical location of the conference.The profile_image_url and entities_str metadata were removed before public sharing in this archive. Please bear in mind that the conference hashtag has been spammed so some Tweets colllected may be from spam accounts. Some automated refining has been performed to remove Tweets not related to the conference but the data is likely to require further refining and deduplication. Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet tagged with #dh2016 during the indicated period, and the dataset is shared for archival, comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web with the queried hashtag and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. No private personal information is shared in this dataset. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road. This dataset is shared to archive, document and encourage open educational research into scholarly activity on Twitter. Other ConsiderationsTweets published publicly by scholars during academic conferences are often tagged (labeled) with a hashtag dedicated to the conference in question.The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. Though every reason for Tweeters' use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter's Privacy and data sharing policies. Professional associations like the Modern Language Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter's search API has well-known temporal limitations for retrospective historical search and collection.Beyond individual tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. To date, collecting in real time is the only relatively accurate method to archive tweets at a small scale. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time.The CC-BY license has been applied to the output in the repository as a curated dataset. Authorial/curatorial/collection work has been performed on the file in order to make it available as part of the scholarly record. The data contained in the deposited file is otherwise freely available elsewhere through different methods and anyone not wishing to attribute the data to the creator of this output is needless to say free to do their own collection and clean their own data.
Z
A Twitter Dataset for Spatial Infectious Disease Surveillance
data.niaid.nih.gov
zenodo.org
Updated Jan 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horta Ribeiro, Manoel (2021). A Twitter Dataset for Spatial Infectious Disease Surveillance [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2541439
Explore at:
Dataset updated
Jan 6, 2021
Dataset provided by
M. Assuncao, Renato
Horta Ribeiro, Manoel
C.S.N.P. Souza, Roberto
Meira Jr., Wagner
dos Santos, Walter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dengue is a mosquito-borne viral disease which infects millions of people every year, specially in developing countries. Some of the main challenges facing the disease are reporting risk indicators and rapidly detecting outbreaks. Traditional surveillance systems rely on passive reporting from health-care facilities, often ignoring human mobility and locating each individual by their home address. Yet, geolocated data are becoming commonplace in social media, which is widely used as means to discuss a large variety of health topics, including the users' health status. In this dataset paper, we make available two large collections of dengue related labeled Twitter data. One is a set of tweets available through the Streaming API using the keywords dengue and aedes from 2010 to 2016. The other is the set of all geolocated tweets in Brazil during the year of 2015 (available also through the Streaming API). We detail the process of collecting and labeling each tweet containing keywords related to dengue in one of 5 categories: personal experience, information, opinion, campaign, and joke. This dataset can be useful for the development of models for spatial disease surveillance, but also scenarios such as understanding health-related content in a language other than English, and studying human mobility.
Current Employment Statistics (CES), Annual Average
data.ca.gov
csv
Updated Jul 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Employment Development Department (2023). Current Employment Statistics (CES), Annual Average [Dataset]. https://data.ca.gov/dataset/current-employment-statistics-ces-annual-average
Explore at:
csv(15980721)Available download formats
Dataset updated
Jul 24, 2023
Dataset provided by
Employment Development Departmenthttp://www.edd.ca.gov/
Authors
California Employment Development Department
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains annual average CES data for California statewide and areas from 1990 to 2023.

The Current Employment Statistics (CES) program is a Federal-State cooperative effort in which monthly surveys are conducted to provide estimates of employment, hours, and earnings based on payroll records of business establishments. The CES survey is based on approximately 119,000 businesses and government agencies representing approximately 629,000 individual worksites throughout the United States.

CES data reflect the number of nonfarm, payroll jobs. It includes the total number of persons on establishment payrolls, employed full- or part-time, who received pay (whether they worked or not) for any part of the pay period that includes the 12th day of the month. Temporary and intermittent employees are included, as are any employees who are on paid sick leave or on paid holiday. Persons on the payroll of more than one establishment are counted in each establishment. CES data excludes proprietors, self-employed, unpaid family or volunteer workers, farm workers, and household workers. Government employment covers only civilian employees; it excludes uniformed members of the armed services.

The Bureau of Labor Statistics (BLS) of the U.S. Department of Labor is responsible for the concepts, definitions, technical procedures, validation, and publication of the estimates that State workforce agencies prepare under agreement with BLS.
s
Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics
orda.shef.ac.uk
txt
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics [Dataset]. https://orda.shef.ac.uk/articles/dataset/Orphan_Drugs_-_Dataset_1_Twitter_issue-networks_as_excluded_publics/16447326
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.16447326.v1
Dataset updated
Oct 22, 2021
Dataset provided by
The University of Sheffield
Authors
Matthew Hanchard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises of two .csv format files used within workstream 2 of the Wellcome Trust funded ‘Orphan drugs: High prices, access to medicines and the transformation of biopharmaceutical innovation’ project (219875/Z/19/Z). They appear in various outputs, e.g. publications and presentations.

The deposited data were gathered using the University of Amsterdam Digital Methods Institute’s ‘Twitter Capture and Analysis Toolset’ (DMI-TCAT) before being processed and extracted from Gephi. DMI-TCAT queries Twitter’s STREAM Application Programming Interface (API) using SQL and retrieves data on a pre-set text query. It then sends the returned data for storage on a MySQL database. The tool allows for output of that data in various formats. This process aligns fully with Twitter’s service user terms and conditions. The query for the deposited dataset gathered a 1% random sample of all public tweets posted between 10-Feb-2021 and 10-Mar-2021 containing the text ‘Rare Diseases’ and/or ‘Rare Disease Day’, storing it on a local MySQL database managed by the University of Sheffield School of Sociological Studies (http://dmi-tcat.shef.ac.uk/analysis/index.php), accessible only via a valid VPN such as FortiClient and through a permitted active directory user profile. The dataset was output from the MySQL database raw as a .gexf format file, suitable for social network analysis (SNA). It was then opened using Gephi (0.9.2) data visualisation software and anonymised/pseudonymised in Gephi as per the ethical approval granted by the University of Sheffield School of Sociological Studies Research Ethics Committee on 02-Jun-201 (reference: 039187). The deposited dataset comprises of two anonymised/pseudonymised social network analysis .csv files extracted from Gephi, one containing node data (Issue-networks as excluded publics – Nodes.csv) and another containing edge data (Issue-networks as excluded publics – Edges.csv). Where participants explicitly provided consent, their original username has been provided. Where they have provided consent on the basis that they not be identifiable, their username has been replaced with an appropriate pseudonym. All other usernames have been anonymised with a randomly generated 16-digit key. The level of anonymity for each Twitter user is provided in column C of deposited file ‘Issue-networks as excluded publics – Nodes.csv’.

This dataset was created and deposited onto the University of Sheffield Online Research Data repository (ORDA) on 26-Aug-2021 by Dr. Matthew S. Hanchard, Research Associate at the University of Sheffield iHuman institute/School of Sociological Studies. ORDA has full permission to store this dataset and to make it open access for public re-use without restriction under a CC BY license, in line with the Wellcome Trust commitment to making all research data Open Access.

The University of Sheffield are the designated data controller for this dataset.
E
Data from: Temporally-Informed Analysis of Named Entity Recognition
live.european-language-grid.eu
data.niaid.nih.gov
json
Updated Aug 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Temporally-Informed Analysis of Named Entity Recognition [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7805
Explore at:
jsonAvailable download formats
Dataset updated
Aug 29, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the data set developed for the paper:
“Shruti Rijhwani and Daniel Preoțiuc-Pietro. Temporally-Informed Analysis of Named Entity Recognition. In Proceedings of the Association for Computational Linguistics (ACL). 2020.”
It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.
The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities.
Format
The repository contains the annotations in JSON format.
Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (https://developer.twitter.com/en/docs/tweets/search) can be used extract the text for the tweet corresponding to the tweet IDs.
Data Splits
Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set.
To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period.
The development and test splits are provided in the JSON format.
Use
Please cite the data set and the accompanying paper if you found the resources in this repository useful.
o
#nowplaying
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Mar 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eva Zangerle (2019). #nowplaying [Dataset]. http://doi.org/10.5281/zenodo.2594482
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.2594482
Dataset updated
Mar 15, 2019
Authors
Eva Zangerle
Description
This dataset contains a dump of the #nowplaying dataset which contains so-called listening events of users who publish the music they are currently listening to on Twitter. In particular, this dataset includes tracks which have been tweeted using the hashtags #nowplaying, #listento or #listeningto. In this dataset, we provide the track and artist of a listening event and metadata on the tweet (date sent, user, source). Furthermore, we provide a mapping of tracks to its respective Musicbrainz identifiers. The dataset features a total of 126 mio listening events. This archive contains the nowplaying.csv file, the main file which contains the following fields: user id (each user is identified by a unique hash value) source of the tweet (how it was sent; as provided by the Twitter API) timestamp of the time the tweet underlying the listening event was sent track title artist name musicbrainz identifier of the recording (cf. https://musicbrainz.org/) In case you make use of our dataset in a scientific setting, we kindly ask you to cite the following paper: Eva Zangerle, Martin Pichl, Wolfgang Gassler, and Günther Specht. 2014. #nowplaying Music Dataset: Extracting Listening Behavior from Twitter. In Proceedings of the First International Workshop on Internet-Scale Multimedia Management (WISMM '14). ACM, New York, NY, USA, 21-26. If you have any questions or suggestions regarding the dataset, please do not hesitate to contact Eva Zangerle (eva.zangerle@uibk.ac.at). {"references": ["Eva Zangerle, Martin Pichl, Wolfgang Gassler, and G\u00fcnther Specht. 2014. #nowplaying Music Dataset: Extracting Listening Behavior from Twitter. In Proceedings of the First International Workshop on Internet-Scale Multimedia Management (WISMM '14). ACM, New York, NY, USA, 21-26."]}
Information Seeking in Academic Conferences
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xidao Wen; Yu-Ru Lin; Xidao Wen; Yu-Ru Lin (2020). Information Seeking in Academic Conferences [Dataset]. http://doi.org/10.5281/zenodo.819537
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.819537
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xidao Wen; Yu-Ru Lin; Xidao Wen; Yu-Ru Lin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The data sets released here has been used in our a study on longitudinal information seeking and social networking behaviors across academic communities. Social media like Twitter have been widely used in physical gatherings, such as conferences and sports events, as a "backchannel" to facilitate the conversations among participants. It has remained largely unexplored though, how event participants seek information in those situations.

There are three key results:

(1) Our study takes the first initiative to characterize the information seeking and responding networks in a concrete context---academic conferences---as one example of physical gatherings. By studying over 190 thousand tweets posted by 66 academic communities over five years, we unveil the landscape of information-seeking activities and the associated social and temporal contexts during the conferences.

(2) We leverage crowdsourcing and machine learning techniques to identify distinct types of information-seeking tweets in academic communities. We show that the information needs can be differentiated by their posted time and content, as well as how they were responded to. Interestingly, users' tendencies of posting certain types of information needs can be inferred by prior tweeting activities and network positions.

(3) Moreover, our results suggest it is also possible to predict the potential respondents to different types of information needs. Our study was based on two data sets: (1) a long-term collection of tweets posted by 66 academic communities over five years, and (2) a subset of information-seeking tweets with human annotated labels (the types of questions). We are making the data sets available for academic researchers and public use, to enable the discovery of new insights and development of better techniques to facilitate information seeking.

Dataset (1):

The conference tweets are collected through keywords search using Topsy API in 2014. The keywords vary for each conference and each year, but typically include two parts in the text and follow the format of "Conference Acronym"+"Year". For example, the International World Wide Web Conference in the year of 2013 would have the hashtag as "www2013".

Duration: 2008 to 2013 Total number of tweets: 334,507

Dataset (2):

We further identify the information seeking tweets by checking whether the tweet contains the question mark (?) in its text. We then design the information seeking question categorization and develop the code book to help human subjects identify the question type. The human annotations are obtained from Amazon Mechanical Turk. Based on the human annotations, we train machine classifiers to identify the question types for the rest of information seeking tweets.

Duration: 2008 to 2013

Total number of labeled information seeking tweets: 1,899 Total number of unlabeled information seeking tweets: 9,967

Publication:

If you make use of this data set, please cite:

Wen, X., & Lin, Y. R. (2015, November). Information Seeking and Responding Networks in Physical Gatherings: A Case Study of Academic Conferences in Twitter. In Proceedings of the 2015 ACM on Conference on Online Social Networks (pp. 197-208). ACM.
Following/Followers and Tags on 0.1 million Twitter Users
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitsuo Yoshida; Yuto Yamaguchi; Mitsuo Yoshida; Yuto Yamaguchi (2020). Following/Followers and Tags on 0.1 million Twitter Users [Dataset]. http://doi.org/10.5281/zenodo.13966
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13966
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mitsuo Yoshida; Yuto Yamaguchi; Mitsuo Yoshida; Yuto Yamaguchi
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Abstract (our paper)

Why does Smith follow Johnson on Twitter? In most cases, the reason why users follow other users is unavailable. In this work, we answer this question by proposing TagF, which analyzes the who-follows-whom network (matrix) and the who-tags-whom network (tensor) simultaneously. Concretely, our method decomposes a coupled tensor constructed from these matrix and tensor. The experimental results on million-scale Twitter networks show that TagF uncovers different, but explainable reasons why users follow other users.

Data

coupled_tensor:
The first column is the source user id (from user id), the second column is the destination user id (to user id), and the third column is the tag id.

users.id:
The first column is the user id for coupled_tensor, and the second column is the user id on Twitter.

tags.id:
The first column is the tag id for coupled_tensor, and the second column is the tag (i.e. slug or list name) on Twitter. On the tags, ###follow### and ###friend### are special tags expressing follower and following.

Publication

This dataset was created for our study. If you make use of this dataset, please cite:
Yuto Yamaguchi, Mitsuo Yoshida, Christos Faloutsos, Hiroyuki Kitagawa. Why Do You Follow Him? Multilinear Analysis on Twitter. Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). pp.137-138, 2015.
http://doi.org/10.1145/2740908.2742715

Code

Our code outputting experiment results made available at:
https://github.com/yamaguchiyuto/tagf

Note

If you would like to use larger dataset, the dataset on 1 million seed users made available at:
http://dx.doi.org/10.5281/zenodo.16267
(The dataset on 0.1 million seed users is not subset of the dataset on 1 million seed users.)
f
Life Satisfaction and the Pursuit of Happiness on Twitter
plos.figshare.com
tiff
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chao Yang; Padmini Srinivasan (2023). Life Satisfaction and the Pursuit of Happiness on Twitter [Dataset]. http://doi.org/10.1371/journal.pone.0150881
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0150881
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Chao Yang; Padmini Srinivasan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Life satisfaction refers to a somewhat stable cognitive assessment of one’s own life. Life satisfaction is an important component of subjective well being, the scientific term for happiness. The other component is affect: the balance between the presence of positive and negative emotions in daily life. While affect has been studied using social media datasets (particularly from Twitter), life satisfaction has received little to no attention. Here, we examine trends in posts about life satisfaction from a two-year sample of Twitter data. We apply a surveillance methodology to extract expressions of both satisfaction and dissatisfaction with life. A noteworthy result is that consistent with their definitions trends in life satisfaction posts are immune to external events (political, seasonal etc.) unlike affect trends reported by previous researchers. Comparing users we find differences between satisfied and dissatisfied users in several linguistic, psychosocial and other features. For example the latter post more tweets expressing anger, anxiety, depression, sadness and on death. We also study users who change their status over time from satisfied with life to dissatisfied or vice versa. Noteworthy is that the psychosocial tweet features of users who change from satisfied to dissatisfied are quite different from those who stay satisfied over time. Overall, the observations we make are consistent with intuition and consistent with observations in the social science research. This research contributes to the study of the subjective well being of individuals through social media.
c
Electoral violence incident dataset 2015-2016
datacatalogue.cessda.eu
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Birch, S; Ounis, I; Macdonald, C; Yang, X (2025). Electoral violence incident dataset 2015-2016 [Dataset]. http://doi.org/10.5255/UKDA-SN-853262
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-853262
Dataset updated
Mar 19, 2025
Dataset provided by
King
University of Glasgow
Authors
Birch, S; Ounis, I; Macdonald, C; Yang, X
Time period covered
Sep 1, 2016 - Mar 31, 2018
Area covered
Venezuela, Ghana, Philippines
Variables measured
Event/process
Measurement technique
Machine learning; we collect Twitter posts published by Twitter users during the period of one month before and after the election dates. In order to permit human assessors to identify relevant (election-related) tweets without having to judge millions of tweets, we adopt a TREC-style pooling methodology. We target (1) the 2015 Venezuela parliamentary election that was held on 6 December 2015 to elect the 164 deputies and three indigenous representatives of the National Assembly, (2) the 2016 Ghana General election that was held on 7 December 2016 to elect a President and Members of Parliament and (3) the 2016 Philippines general election that was held on 9 May 2016 for executive and legislative branches for all levels of government - national, provincial, and local, except for thebarangay officials.
Description
We collected Twitter posts that are topically related to three selected elections: the 2015 Venezuela parliamentary election, 2016 Philippines general election and 2016 Ghana general election. Using human annotators and trained classifiers, we built two datasets in tweet-level and incident-level. Tweet-level dataset is consist of annotated tweets, however the incident-level dataset contains grouped tweets and the reported incident details by each group of tweets. Electoral violence is a common theme in developing countries all around the world where they destabilize basic standards for democratic elections. Violence against candidates, voters, journalists and election officials can reduce voters’ choices and suppress the vote. Nowadays, social media platforms such as Twitter are popular as a medium for reporting and discussing current news and events, including political events. In particular, by comparing Twitter and newswire for breaking news, found that Twitter leads newswire in reporting political events. Such a conclusion indicates that Twitter is useful for monitoring and studying political events, including elections. Our datasets enable further electoral violence studies based on social media data, which can provide valuable insights on explaining and mitigating electoral violence.
Elections are a means of adjudicating political differences through peaceful, fair, democratic mechanisms. When elections are beset by violence, these aims are compromised and political crises often result. Despite the undisputed importance of understanding electoral violence, there has been only a limited body of systematic comparative research on this topic. If scholars and practitioners are to gain insight into the dynamics of electoral violence and develop superior strategies for deterring it, better data and more sophisticated theories are required. The aim of this project is to develop conceptual, methodological and practical tools to facilitate an enhanced understanding of electoral violence and the behavioural interventions best suited to preventing it, with a view to sustaining fair and vibrant societies. The project will involve the construction of two databases of electoral violence and will make these data available to those engaged in electoral assistance, electoral administration and electoral observation as well as academic and other researchers. The project will also use the resulting data to develop and test a series of theoretically-driven propositions about the causes of electoral violence and to assess a range of interventions designed to prevent violent behaviours. Finally, the project will generate an online electoral violence early warning tool that can be used to provide relevant information about current electoral risks. The project will be of considerable use both to academic students of election and conflict and to practitioners in the fields of contentious politics, electoral assistance, electoral observation, electoral administration, human rights, international relations, criminology and development studies. Electoral violence is frequently an aspect of contentious politics. Though contentious politics can play an important role in the democratic process, it raises problems for democracy both when it generates violence and when it disrupts key phases of the electoral cycle. Given the centrality of both contentious politics and elections to our understanding of contemporary political processes, this study promises to yield considerable benefits to a wide range of academic fields. In addition to scholars, many actors with a stake in peaceful elections urgently require superior means of averting disruptive forms of violence that threaten political stability, state-building and development. Since the violent interlude that followed the Kenyan elections of 2007, there has been an increased focus on the topic of electoral violence and a heightened sense of urgency in the international community's search for remedies, as exemplified by the 2012 final report of the Global Commission on Elections, Democracy and Security, chaired by Kofi Annan. One of the key recommendations of this report was 'to develop institutions, processes, and networks that deter election-related violence and, should deterrence fail, hold perpetrators accountable'. The proposed research is intended to make a substantial contribution towards this aim, which has become all the more urgent following the recent increase in violent behaviours in the Middle East and elsewhere. Finally, the project will innovate methodologically by integrating 'big data' retrieval methods into political science. Political scientists have to date made scant use of the possibilities represented by current data retrieval techniques; by enabling collaboration between political scientists and computer scientists, this project will facilitate the collection of a dataset of unprecedented size in the study of electoral violence, and it will allow...
Z
Newly Emerged Rumors in Twitter
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirhosein Bodaghi (2020). Newly Emerged Rumors in Twitter [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2563863
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Amirhosein Bodaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
*** Newly Emerged Rumors in Twitter ***

These 12 datasets are the results of an empirical study on the spreading process of newly emerged rumors in Twitter. Newly emerged rumors are those rumors whose rise and fall happen in a short period of time, in contrast to the long standing rumors. Particularly, we have focused on those newly emerged rumors which have given rise to an anti-rumor spreading simultaneously against them. The story of each rumor is as follow :

1- Dataset_R1 : The National Football League team in Washington D.C. changed its name to Redhawks.

2- Dataset_R2 : A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

3- Dataset_R3 : Facebook CEO Mark Zuckerberg bought a "super-yacht" for $150 million.

4- Dataset_R4 : Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

5- Dataset_R5 : Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

6- Dataset_R6 : Harley-Davidson's chief executive officer Matthew Levatich called President Trump "a moron."

7- Dataset_R7 : The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

8- Dataset_R8 : Michael Jordan resigned from the board at Nike and took his Air Jordan line of apparel with him.

9- Dataset_R9 : In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

10- Dataset_R10 : During confirmation hearings for Supreme Court nominee Brett Kavanaugh, congressional Democrats demanded that the nominee undergo DNA testing to prove he is not Adolf Hitler.

11- Dataset_R11 : Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.

12- Dataset_R12 : A screenshot from MyLife.com confirms that mail bomb suspect Cesar Sayoc was registered as a Democrat.

The structure of excel files for each dataset is as follow :

Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet :

User ID (user who has posted the current tweet/retweet)

The description sentence in the profile of the user who has published the tweet/retweet

The number of published tweet/retweet by the user at the time of posting the current tweet/retweet

Date and time of creation of the the account by which the current tweet/retweet has been posted

Language of the tweet/retweet

Number of followers

Number of followings (friends)

Date and time of posting the current tweet/retweet

Number of like (favorite) the current tweet had been acquired before crawling it

Number of times the current tweet had been retweeted before crawling it

Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)

The source (OS) of device by which the current tweet/retweet was posted

Tweet/Retweet ID

Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)

Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)

Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)

Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)

State of the tweet which can be one of the following forms (achieved by an agreement between the annotators) :

r : The tweet/retweet is a rumor post a : The tweet/retweet is an anti-rumor post q : The tweet/retweet is a question about the rumor, however neither confirm nor deny it n : The tweet/retweet is not related to the rumor (even though it contains the queries related to the rumor, but does not refer to the rumor)
Z
A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for...
data.niaid.nih.gov
Updated Sep 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arogundade Oluwasefunmi Tale (2023). A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for Natural Language Processing Tasks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8347220
Explore at:
Dataset updated
Sep 17, 2023
Dataset provided by
Odeyinka, Abiola Michael
Abayomi-Alli Adebayo
Abayomi-Alli Ayomide
Arogundade Oluwasefunmi Tale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nigeria
Description
The dataset contains tweets related to the 2023 presidential elections in Nigeria. The data was retrieved from the social media network, Twitter (Now X) between February 4th, 2023 and April 4th, 2023. The hashtags from the official handles and other popular hashtags endorsed and/or representing the candidates of each party were considered for retrieving election related tweets using an API from Twitter social media platform. Three major political parties in Nigeria were considered and they have been labelled as Party A, Party L and Party P in this dataset. The party or group called "General" contains tweets from the Independent National Electoral Commission (INEC) hashtags such as @inecnigeria and #2023election which is not directly for any political party.

The dataset has been pre-processed lightly to make it very useful to researcher for a wide range of natural language processing tasks like sentiment analysis, topic modelling, fake news detection, emotion detection, election stance, etc.

Details of the dataset collection such as hashtags, retrieved tweets, duplicates removed, and the remaining unique tweets is presented in Table 1.

Table 1: Tweets collection and duplicates removal

S/N

Party

Hash tags

Retrieved tweets

Duplicates tweets

Unique tweets

1

X

@inecnigeria

2023election

64,496

47,275

17,195

2

A

TinubuIsComing

emilokan

jagabanarmy

RenewedHope

BATKSM2023

263,870

231,036

32,832

3

L

VoteLP

NigeriaMustBeBright

PeterObiForPresident2023

ObiDatti2023

PeterObi

664,083

310,857

353,226

4

P

NigeriaDecides

VotePDP

AtikuOkowa2023

FinalPushToVictory

RecoverNigeria

387,450

318,425

66,227

1,379,899

907,593

468,480

To encourage NLP tasks, we uploaded in this Version One the following files:

The combined dataset with pre-processed tweets and their meta data but with removed duplicates are in the file labelled “Combined Dataset Pre-processed without duplicates.csv”

General statistics on each corpus is in the file labelled “Dataset Statistics.xlsx”

The preprocessed corpus from the general group with the tweet contents only is in file labelled “Preprocessed_Tweet only_GENERAL.xlsx”

The preprocessed corpus from Party A with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party A.xlsx”

The preprocessed corpus from Party L with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party L.xlsx”

The preprocessed corpus from Party P with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party P.xlsx”

The top 100 frequent tokens are in the file labelled “Top 100 Tokens and weights.xlsx”

The top frequent bigrams and their weights are in the file labelled “Top 100 Bigrams and weights.xlsx”

The top frequent trigrams and their weights are in the file labelled “Top 100 Trigrams and weights.xlsx”
Z
Portuguese Comparative Sentences: A Collection of Labeled Sentences on...
data.niaid.nih.gov
live.european-language-grid.eu
+1more
Updated Apr 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabrício Benevenuto (2021). Portuguese Comparative Sentences: A Collection of Labeled Sentences on Twitter and Buscapé [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4124409
Explore at:
Dataset updated
Apr 19, 2021
Dataset provided by
Breno Matos
Julio C. S. Reis
Matheus Barbosa
Fabrício Benevenuto
Daniel Kansaon
Michele A. Brandão
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
More and more customers demand online reviews of products and comments on the Web to make decisions about buying a product over another. In this context, sentiment analysis techniques constitute the traditional way to summarize a user’s opinions that criticizes or highlights the positive aspects of a product. Sentiment analysis of reviews usually relies on extracting positive and negative aspects of products, neglecting comparative opinions. Such opinions do not directly express a positive or negative view but contrast aspects of products from different competitors.

Here, we present the first effort to study comparative opinions in Portuguese, creating two new Portuguese datasets with comparative sentences marked by three humans. This repository consists of three important files: (1) lexicon that contains words frequently used to make a comparison in Portuguese; (2) Twitter dataset with labeled comparative sentences; and (3) Buscapé dataset with labeled comparative sentences.

The lexicon is a set of 176 words frequently used to express a comparative opinion in the Portuguese language. In these contexts, the lexicon is aggregated in a filter and used to build two sets of data with comparative sentences from two important contexts: (1) Social Network Online; and (2) Product reviews.

For Twitter, we collected all Portuguese tweets published in Brazil on 2018/01/10 and filtered all tweets that contained at least one keyword present in the lexicon, obtaining 130,459 tweets. Our work is based on the sentence level. Thus, all sentences were extracted and a sample with 2,053 sentences was created, which was labeled for three human manuals, reaching an 83.2% agreement with Fleiss' Kappa coefficient. For Buscapé, a Brazilian website (https://www.buscape.com.br/) used to compare product prices on the web, the same methodology was conducted by creating a set of 2,754 labeled sentences, obtained from comments made in 2013. This dataset was labeled by three humans, reaching an agreement of 83.46% with the Fleiss Kappa coefficient.

The Twitter dataset has 2,053 labeled sentences, of which 918 are comparative. The Buscapé dataset has 2,754 labeled sentences, of which 1,282 are comparative.

The datasets contain these labeled properties:

text: the sentence extracted from the review comment.

entity_s1: the first entity compared in the sentence.

entity_s2: the second entity compared in the sentence.

keyword: the comparative keyword used in the sentence to express comparison.

preferred_entity: the preferred entity.

id_start: the keyword's initial position in the sentence.

id_end: the keyword's final position in the sentence.

type: the sentence label, which specifies whether the phrase is a comparison.

Additional Information:

1 - The sentences were separated using a sentence tokenizer.

2 - If the compared entity is not specified, the field will receive a value: "_".

3 - The property "type" can contain five values, they are:

0: Non-comparative (Não Comparativa).

1: Non-Equal-Gradable (Gradativa com Predileção).

2: Equative (Equitativa).

3: Superlative (Superlativa).

4: Non-Equal-Gradable (Não Gradativa).

If you use this data, please cite our paper as follows:

"Daniel Kansaon, Michele A. Brandão, Julio C. S. Reis, Matheus Barbosa,Breno Matos, and Fabrício Benevenuto. 2020. Mining Portuguese Comparative Sentences in Online Reviews. In Brazilian Symposium on Multimedia and the Web (WebMedia ’20), November 30-December 4, 2020, São Luís, Brazil. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3428658.3431081"

Plus Information:

We make the raw sentences available in the dataset to allow future work to test different pre-processing steps. Then, if you want to obtain the exact sentences used in the paper above, you must reproduce the pre-processing step described in the paper (Figure 2).

For each sentence with more than one keyword in the dataset:

You need to extract three words before and three words after the comparative keyword, creating a new sentence that will receive the existing value in the “type” field as a label;

The original sentence will be divided into n new sentences. (n) is the number of keywords in the sentence;

The stopwords should not be accounted for as part of this range (3 words);

Note that: the final processed sentence can have more than six words because the stopwords are not counted as part of the range.
b
Visdoorgangen - Fish passage places in Flanders, Belgium - Dataset - Belgian...
data.biodiversity.be
Updated Aug 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Visdoorgangen - Fish passage places in Flanders, Belgium - Dataset - Belgian biodiversity data portal [Dataset]. https://data.biodiversity.be/dataset/5d637678-cb64-4863-a12b-78b4e1a56628
Explore at:
Dataset updated
Aug 20, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Belgium, Flanders
Description
Migration of fish in streams and rivers is prevented by various structures (mills, weirs, locks, ...). To make the migration of those fish species possible again, various projects are realized in Flanders. This dataset contains information on the functioning of the fish passages. To allow anyone to use this dataset, we have released the data to the public domain under a Creative Commons Zero waiver (http://creativecommons.org/publicdomain/zero/1.0/). We would appreciate however, if you read and follow these norms for data use (http://www.inbo.be/en/norms-for-data-use) and provide a link to the original dataset (https://doi.org/10.15468/92ylpd) whenever possible. If you use these data for a scientific paper, please cite the dataset following the applicable citation norms and/or consider us for co-authorship. We are always interested to know how you have used or visualized the data, or to provide more information, so please contact us via the contact information provided in the metadata, opendata@inbo.be or https://twitter.com/LifeWatchINBO.
f
Data_Sheet_3_Probing sociodemographic influence on code-switching and...
figshare.com
frontiersin.figshare.com
pdf
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olga Kellert (2023). Data_Sheet_3_Probing sociodemographic influence on code-switching and language choice in Quebec with geolocation of tweets.PDF [Dataset]. http://doi.org/10.3389/fpsyg.2023.1137038.s003
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2023.1137038.s003
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Olga Kellert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Quebec
Description
This paper investigates the influence of the relative size of speech communities on language use in multilingual regions and cities. Due to peoples’ everyday mobility inside a city, it is still unclear whether the size of a population matters for language use on a sub-city scale. By testing the correlation between the size of a population and language use on various spatial scales, this study will contribute to a better understanding of the extent to which sociodemographic factors influence language use. The present study investigates two particular phenomena that are common to multilingual speakers, namely language mixing or Code-Switching and using multiple languages without mixing. Demographic information from a Canadian census will make predictions about the intensity of Code-Switching and language use by multilinguals in cities of Quebec and neighborhoods of Montreal. Geolocated tweets will be used to identify where these linguistic phenomena occur the most and the least. My results show that the intensity of Code-Switching and the use of English by bilinguals is influenced by the size of anglophone and francophone populations on various spatial scales such as the city level, land use level (city center vs. periphery of Montreal), and large urban zones on the sub-city level, namely the western and eastern urban zones of Montreal. However, the correlation between population figures and language use is difficult to measure and evaluate on a much smaller sub-urban scale such as the city block scale due to factors such as population figures missing from the census and people’s mobility. A qualitative evaluation of language use on a small spatial scale seems to suggest that other social influences such as the location context or topic of discussion are much more important predictors for language use than population figures. Methods will be suggested for testing this hypothesis in future research. I conclude that geographic space can provide us information about the relation between language use in multilingual cities and sociodemographic factors such as a speech community’s size and that social media is a valuable alternative data source for sociolinguistic research that offers new insights into the mechanisms of language use such as Code-Switching.
Z
Data from: Detecting East Asian Prejudice on Social Media
data.niaid.nih.gov
zenodo.org
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bertie Vidgen (2024). Detecting East Asian Prejudice on Social Media [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3816666
Explore at:
Dataset updated
Jul 22, 2024
Dataset provided by
David Broniatowski
Zeerak Waseem
Rebekah Tromble
Matthew Hall
Bertie Vidgen
Helen Margetts
Ella Guest
Austin Botelho
Scott Hale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
East Asia
Description
This repository contains:

A deep learning model which distinguishes between Hostililty against East Asia, Criticism of East Asia, Discussion of East Asian prejudice and Neutral content. The F1 score is 0.83.

A detailed annotation codebook used for marking up the tweets.

A labelled dataset with 20,000 entries.

A dataset with all 40,000 annotations, which can be used to investigate annotation processes for abusive content moderation.

A list of thematic hashtag replacements.

Three sets of annotations for the 1,000 most used hashtags in the original database of COVID-19 related tweets. Hashtags were annotated for COVID-19 relevance, East Asian relevance and stance.

The outbreak of COVID-19 has transformed societies across the world as governments tackle the health, economic and social costs of the pandemic. It has also raised concerns about the spread of hateful language and prejudice online, especially hostility directed against East Asia. This data repository is for a classifier that detects and categorizes social media posts from Twitter into four classes: Hostility against East Asia, Criticism of East Asia, Meta-discussions of East Asian prejudice and a neutral class. The classifier achieves an F1 score of 0.83 across all four classes. We provide our final model (coded in Python), as well as a new 20,000 tweet training dataset used to make the classifier, two analyses of hashtags associated with East Asian prejudice and the annotation codebook. The classifier can be implemented by other researchers, assisting with both online content moderation processes and further research into the dynamics, prevalence and impact of East Asian prejudice online during this global pandemic.

This work is a collaboration between The Alan Turing Institute and the Oxford Internet Institute. It was funded by the Criminal JusticeTheme of the Alan Turing Institute under Wave 1 of The UKRI Strategic Priorities Fund, EPSRC Grant EP/T001569/1

Facebook

Twitter

Click to copy link

Link copied

Cite

Gayo-Avello, Daniel (2020). Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3833781

Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets)

Explore at:

Dataset updated

May 20, 2020

Dataset authored and provided by

Gayo-Avello, Daniel

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.

The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).

It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).

Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.

The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.

To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.

In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).

In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:

March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).

June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).

September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).

December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).

March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).

June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).

September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).

December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).

March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).

June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).

September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).

December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).

March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).

June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).

The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.

At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.

In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).

Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.

For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).

If you use this dataset in any way please cite that preprint (in addition to the dataset itself).

If you need to contact me you can find me as @PFCdgayo in Twitter.

Clear search

Close search

Google apps

Main menu

Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July...

Famous Keyword Twitter Replies Dataset

Data from: Google Analytics & Twitter dataset from a movies, TV series and...

(🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset

IMPORTANT (02-Apr-2024)

IMPORTANT (28-Mar-2024)

Context

Announcement

An Archive of #DH2016 Tweets Published on Thursday 14 July 2016 GMT

A Twitter Dataset for Spatial Infectious Disease Surveillance

Current Employment Statistics (CES), Annual Average

Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics

Data from: Temporally-Informed Analysis of Named Entity Recognition

#nowplaying

Information Seeking in Academic Conferences

Following/Followers and Tags on 0.1 million Twitter Users

Life Satisfaction and the Pursuit of Happiness on Twitter

Electoral violence incident dataset 2015-2016

Newly Emerged Rumors in Twitter

A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for...

2023election

TinubuIsComing

emilokan

jagabanarmy

RenewedHope

BATKSM2023

VoteLP

NigeriaMustBeBright

PeterObiForPresident2023

ObiDatti2023

PeterObi

NigeriaDecides

VotePDP

AtikuOkowa2023

FinalPushToVictory

RecoverNigeria

Portuguese Comparative Sentences: A Collection of Labeled Sentences on...

Visdoorgangen - Fish passage places in Flanders, Belgium - Dataset - Belgian...

Data_Sheet_3_Probing sociodemographic influence on code-switching and...

Data from: Detecting East Asian Prejudice on Social Media

Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets)