21 datasets found

Customer Support on Twitter
kaggle.com
zip
Updated Nov 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thought Vector (2017). Customer Support on Twitter [Dataset]. https://www.kaggle.com/thoughtvector/customer-support-on-twitter
Explore at:
zip(149959515 bytes)Available download formats
Dataset updated
Nov 27, 2017
Dataset authored and provided by
Thought Vector
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.

https://i.imgur.com/nTv3Iuu.png" alt="Example Analysis - Inbound Volume for the Top 20 Brands">

Context

Natural language remains the densest encoding of human experience we have, and innovation in NLP has accelerated to power understanding of that data, but the datasets driving this innovation don't match the real language in use today. The Customer Support on Twitter dataset offers a large corpus of modern English (mostly) conversations between consumers and customer support agents on Twitter, and has three important advantages over other conversational text datasets:

Focused - Consumers contact customer support to have a specific problem solved, and the manifold of problems to be discussed is relatively small, especially compared to unconstrained conversational datasets like the reddit Corpus.

Natural - Consumers in this dataset come from a much broader segment than those in the Ubuntu Dialogue Corpus and have much more natural and recent use of typed text than the Cornell Movie Dialogs Corpus.

Succinct - Twitter's brevity causes more natural responses from support agents (rather than scripted), and to-the-point descriptions of problems and solutions. Also, its convenient in allowing for a relatively low message limit size for recurrent nets.

Inspiration

The size and breadth of this dataset inspires many interesting questions:

Can we predict company responses? Given the bounded set of subjects handled by each company, the answer seems like yes!

Do requests get stale? How quickly do the best companies respond, compared to the worst?

Can we learn high quality dense embeddings or representations of similarity for topical clustering?

How does tone affect the customer support conversation? Does saying sorry help?

Can we help companies identify new problems, or ones most affecting their customers?

Content

The dataset is a CSV, where each row is a tweet. The different columns are described below. Every conversation included has at least one request from a consumer and at least one response from a company. Which user IDs are company user IDs can be calculated using the inbound field.

tweet_id

A unique, anonymized ID for the Tweet. Referenced by response_tweet_id and in_response_to_tweet_id.

author_id

A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.

inbound

Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.

created_at

Date and time when the tweet was sent.

text

Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like _email_.

response_tweet_id

IDs of tweets that are responses to this tweet, comma-separated.

in_response_to_tweet_id

ID of the tweet this tweet is in response to, if any.

Contributing

Know of other brands the dataset should include? Found something that needs to be fixed? Start a discussion, or email me directly at $FIRSTNAME@$LASTNAME.com!

Acknowledgements

A huge thank you to my friends who helped bootstrap the list of companies that do customer support on Twitter! There are many rocks that would have been left un-turned were it not for your suggestions!

Relevant Resources

NLTK - casual_tokenize for social media text tokenizing, vader sentiment analysis for social media text

SciKit Learn - BoW Count Vectorizer, Multinomial Naive Bayes Classifier

Topic Modeling via Phrase detection with gensim

facebook research - fastText text classifier
(🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2024). (🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset [Dataset]. https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows
Explore at:
zip(18174367560 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Ukraine
Description
IMPORTANT (02-Apr-2024)

Kaggle has fixed the issue with gzip files and Version 510 should now reflect properly working files

IMPORTANT (28-Mar-2024)

Please use the version 508 of the dataset, as 509 is broken. See link below of the dataset that is properly working https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows/versions/508

Context

The context and history of the current ongoing conflict can be found https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine.

Announcement

[Jun 16] (🌇Sunset) Twitter has finally pulled the plug on all of my remaining TWITTER API accounts as part of their efforts for developers to migrate to the new API. The last tweets that I pulled was dated last Jun 14, and no more data from Jun 15 onwards. It was fun til it lasted and I hope that this dataset was able and will continue to help a lot. I'll just leave the dataset here for future download and reference. Thank you all!

[Apr 19] Two additional developer accounts have been permanently suspended, expect a lower throughtput in the next few weeks. I will pull data til they ban my last account.

[Apr 08] I woke up this morning and saw that Twitter has banned/ permanently suspended 4 of my developer accounts, I have around a few more but it is just a matter of time till all my accounts will most likely get banned as well. This was a fun project that I maintained for as long as I can. I will pull data til my last account gets banned.

[Feb 26] I've started to pull in RETWEETS again, so I am expecting a significant amount of throughput in tweets again on top of the dedicated processes that I have that gets NONRETWEETS. If you don't want RETWEETS, just filter them out.

[Feb 24] It's been a year since I started getting tweets of this conflict and had no idea that a year later this is still ongoing. Almost everyone assumed that Ukraine will crumble in a matter of days, but it is not the case. To those who have been using my dataset, i hope that I am helping all of you in one way or another. Ill do my best to maintain updating this dataset as long as I can.

[Feb 02] I seem to be getting less tweets as my crawlers are getting throttled, i used to get 2500 tweets per 15 mins but around 2-3 of my crawlers are getting throttling limit errors. There may be some kind of update that Twitter has done about rate limits or something similar. Will try to find ways to increase the throughput again.

[Jan 02] For all new datasets, it will now be prefixed by a year, so for Jan 01, 2023, it will be 20230101_XXXX.

[Dec 28] For those looking for a cleaned version of my dataset, with the retweets removed from before Aug 08, here is a dataset by @@vbmokin https://www.kaggle.com/datasets/vbmokin/russian-invasion-ukraine-without-retweets

[Nov 19] I noticed that one of my developer accounts, which ISNT TWEETING ANYTHING and just pulling data out of twitter has been permanently banned by Twitter.com, thus the decrease of unique tweets. I will try to come up with a solution to increase my throughput and signup for a new developer account.

[Oct 19] I just noticed that this dataset is finally "GOLD", after roughly seven months since I first uploaded my gzipped csv files.

[Oct 11] Sudden spike in number of tweets revolving around most recent development(s) about the Kerch Bridge explosion and the response from Russia.

[Aug 19- IMPORTANT] I raised the missing dataset issue to Kaggle team and they confirmed it was a bug brought by a ReactJs upgrade, the conversation and details can be seen here https://www.kaggle.com/discussions/product-feedback/345915 . It has been fixed already and I've reuploaded all the gzipped files that were lost PLUS the new files that were generated AFTER the issue was identified.

[Aug 17] Seems the latest version of my dataset lost around 100+ files, good thing this dataset is versioned so one can just go back to the previous version(s) and download them. Version 188 HAS ALL THE LOST FILES, I wont be reuploading all datasets as it will be tedious and I've deleted them already in my local and I only store the latest 2-3 days.

[Aug 10] 3/5 of my Python processes errored out and resulted to around 10-12 hours of NO data gathering for those processes thus the sharp decrease of tweets for Aug 09 dataset. I've applied an exception/ error checking to prevent this from happening.

[Aug 09] Significant drop in tweets extracted, but I am now getting ORIGINAL/ NON-RETWEETS.

[Aug 08] I've noticed that I had a spike of Tweets extracted, but they are literally thousands of retweets of a single original tweet. I also noticed that my crawlers seem to deviate because of this tactic being used by some Twitter users where they flood Twitter w...
Z
A study on real graphs of fake news spreading on Twitter
data.niaid.nih.gov
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirhosein Bodaghi (2021). A study on real graphs of fake news spreading on Twitter [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3711599
Explore at:
Dataset updated
Aug 20, 2021
Dataset authored and provided by
Amirhosein Bodaghi
Description
*** Fake News on Twitter ***

These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:

1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).

DD

DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:

The structure of excel files for each dataset is as follow:

Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet:

User ID (user who has posted the current tweet/retweet)

The description sentence in the profile of the user who has published the tweet/retweet

The number of published tweet/retweet by the user at the time of posting the current tweet/retweet

Date and time of creation of the account by which the current tweet/retweet has been posted

Language of the tweet/retweet

Number of followers

Number of followings (friends)

Date and time of posting the current tweet/retweet

Number of like (favorite) the current tweet had been acquired before crawling it

Number of times the current tweet had been retweeted before crawling it

Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)

The source (OS) of device by which the current tweet/retweet was posted

Tweet/Retweet ID

Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)

Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)

Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)

Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)

State of the tweet which can be one of the following forms (achieved by an agreement between the annotators):

r : The tweet/retweet is a fake news post

a : The tweet/retweet is a truth post

q : The tweet/retweet is a question about the fake news, however neither confirm nor deny it

n : The tweet/retweet is not related to the fake news (even though it contains the queries related to the rumor, but does not refer to the given fake news)

DG

DG for each fake news contains two files:

A file in graph format (.graph) which includes the information of graph such as who is linked to whom. (This file named FNx_DG.graph, where x is the number of fake news)

A file in Jsonl format (.jsonl) which includes the real user IDs of nodes in the graph file. (This file named FNx_Labels.jsonl, where x is the number of fake news)

Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.

The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.
d
Population of X/Twitter users and web domains embedded in a multidimensional...
data.sciencespo.fr
tsv
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoine Vendeville; Jimena Royo-Letelier; Duncan Cassells; Jean-Philippe Cointet; Maxime Crépel; Tim Faverjon; Théophile Lenoir; Béatrice Mazoyer; Benjamin Ooghe-Tabanou; Armin Pournaki; Hiroki Yamashita; Pedro Ramaciotti; Antoine Vendeville; Jimena Royo-Letelier; Duncan Cassells; Jean-Philippe Cointet; Maxime Crépel; Tim Faverjon; Théophile Lenoir; Béatrice Mazoyer; Benjamin Ooghe-Tabanou; Armin Pournaki; Hiroki Yamashita; Pedro Ramaciotti (2025). Population of X/Twitter users and web domains embedded in a multidimensional political opinion space [Dataset]. http://doi.org/10.21410/7E4/QPECFF
Explore at:
tsv(100846), tsv(106000433), tsv(177962), tsv(32523281), tsv(146217)Available download formats
Unique identifier
https://doi.org/10.21410/7E4/QPECFF
Dataset updated
Mar 14, 2025
Dataset provided by
data.sciencespo
Authors
Antoine Vendeville; Jimena Royo-Letelier; Duncan Cassells; Jean-Philippe Cointet; Maxime Crépel; Tim Faverjon; Théophile Lenoir; Béatrice Mazoyer; Benjamin Ooghe-Tabanou; Armin Pournaki; Hiroki Yamashita; Pedro Ramaciotti; Antoine Vendeville; Jimena Royo-Letelier; Duncan Cassells; Jean-Philippe Cointet; Maxime Crépel; Tim Faverjon; Théophile Lenoir; Béatrice Mazoyer; Benjamin Ooghe-Tabanou; Armin Pournaki; Hiroki Yamashita; Pedro Ramaciotti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The undertaking of several studies of political phenomena in social media mandates the operationalization of the notion of political stance of users and contents involved. Relevant examples include the study of segregation and polarization online, the study of political diversity in content diets in social media, or AI explainability. While many research designs rely on operationalizations best suited for the US setting, few allow addressing more general design, in which users and content might take stances on multiple ideology and issue dimensions, going beyond traditional Liberal-Conservative or Left-Right scales. To advance the study of more general online ecosystems, we present a dataset of X/Twitter population of users in the French political Twittersphere and web domains embedded in a political space spanned by dimensions measuring attitudes towards immigration, the EU, liberal values, elites and institutions, nationalism and the environment. We provide several benchmarks validating the positions of these entities (based on both, LLM and human annotations), and discuss several applications for this dataset.
Newly Emerged Rumors in Twitter
zenodo.org
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirhosein Bodaghi; Amirhosein Bodaghi (2020). Newly Emerged Rumors in Twitter [Dataset]. http://doi.org/10.5281/zenodo.2563864
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2563864
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amirhosein Bodaghi; Amirhosein Bodaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
*** Newly Emerged Rumors in Twitter ***

These 12 datasets are the results of an empirical study on the spreading process of newly emerged rumors in Twitter. Newly emerged rumors are those rumors whose rise and fall happen in a short period of time, in contrast to the long standing rumors. Particularly, we have focused on those newly emerged rumors which have given rise to an anti-rumor spreading simultaneously against them. The story of each rumor is as follow :

1- Dataset_R1 : The National Football League team in Washington D.C. changed its name to Redhawks.

2- Dataset_R2 : A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

3- Dataset_R3 : Facebook CEO Mark Zuckerberg bought a "super-yacht" for $150 million.

4- Dataset_R4 : Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

5- Dataset_R5 : Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

6- Dataset_R6 : Harley-Davidson's chief executive officer Matthew Levatich called President Trump "a moron."

7- Dataset_R7 : The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

8- Dataset_R8 : Michael Jordan resigned from the board at Nike and took his Air Jordan line of apparel with him.

9- Dataset_R9 : In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

10- Dataset_R10 : During confirmation hearings for Supreme Court nominee Brett Kavanaugh, congressional Democrats demanded that the nominee undergo DNA testing to prove he is not Adolf Hitler.

11- Dataset_R11 : Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.

12- Dataset_R12 : A screenshot from MyLife.com confirms that mail bomb suspect Cesar Sayoc was registered as a Democrat.

The structure of excel files for each dataset is as follow :

- Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet :

- User ID (user who has posted the current tweet/retweet)

- The description sentence in the profile of the user who has published the tweet/retweet

- The number of published tweet/retweet by the user at the time of posting the current tweet/retweet

- Date and time of creation of the the account by which the current tweet/retweet has been posted

- Language of the tweet/retweet

- Number of followers

- Number of followings (friends)

- Date and time of posting the current tweet/retweet

- Number of like (favorite) the current tweet had been acquired before crawling it

- Number of times the current tweet had been retweeted before crawling it

- Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)

- The source (OS) of device by which the current tweet/retweet was posted

- Tweet/Retweet ID

- Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)

- Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)

- Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)

- Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)

- State of the tweet which can be one of the following forms (achieved by an agreement between the annotators) :

r : The tweet/retweet is a rumor post

a : The tweet/retweet is an anti-rumor post

q : The tweet/retweet is a question about the rumor, however neither confirm nor deny it

n : The tweet/retweet is not related to the rumor (even though it contains the queries related to the rumor, but does not refer to the rumor)
R
Deniers and believers in climate change discourse on twitter, and anti/pro...
repod.icm.edu.pl
txt
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoudi, Amin; Jemielniak, Dariusz; Ciechanowski, Leon (2025). Deniers and believers in climate change discourse on twitter, and anti/pro positions in ukraine war and vaccine discourse [Dataset]. http://doi.org/10.18150/FVIMEK
Explore at:
txt(235514081), txt(83611532)Available download formats
Unique identifier
https://doi.org/10.18150/FVIMEK
Dataset updated
Jun 23, 2025
Dataset provided by
RepOD
Authors
Mahmoudi, Amin; Jemielniak, Dariusz; Ciechanowski, Leon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Ukraine
Dataset funded by
National Science Centre (Poland)
Description
We have acquired the data from George Washington University Libraries Dataverse, the Climate Change Tweets Ids [Data set] . This dataset has been collected from the Twitter API using Social Feed Manager, and totalled to 39,622,026 tweets related to climate change. The tweets were collected between September 21, 2017 and May 17, 2019. However, there is a gap in data collection between January 7, 2019 and April 17, 2019. The tweets with the following hashtags and keywords were scraped: climatechange, #climatechangeisreal, #actonclimate, #globalwarming, #climatechangehoax, #climatedeniers, #climatechangeisfalse, #globalwarminghoax, #climatechangenotreal, climate change, global warming, climate hoax.Due to Twitter's Developer Policy, only the tweet IDs were shared in the database, not the full tweets. Therefore, we had to hydrate the tweet ids with the use of Hydrator application. Hydrating was carried out by us in June, 2020, and it allowed us to obtain 22,564,380 tweets (some tweets or user accounts are deleted or suspended by Twitter in its standard maintenance procedures). Challenges encountered during data hydration included dealing with deleted tweets or suspended user accounts, which is a common occurrence in Twitter's standard maintenance procedures. We addressed this by using the Hydrator application, which allowed us to recover as much data as possible within the constraints of Twitter's Developer Policy.In order to comprehensively diagnose Polish social networks and to enable automated classification of Twitter users in terms of their attitude towards vaccinations, we collected a balanced, importance-wise database of Twitter users for manual annotation. The most important keywords used by groups that spread anti-vaccination propaganda were identified. Using our programming pipeline, databases of Polish social media on the topic of the pandemic and attitudes towards vaccinations were obtained. The raw data contained over 5 million tweets from almost 3600 users with the following hashtags related to the COVID-19 pandemic in Poland and the war in Ukraine: stopsegregacjisanitarnej, nieszczepimysie, szczepimysie, szczepienie, szczepienia, koronawirus, koronawiruswpolsce, koronawiruspolska, rozliczymysanitarystow, stopss, covid, covid19, sanitaryzm, epidemia, pandemia, plandemia, zelensky, zelenski, wojna, muremzabraunem, konfederacja, wojnanaukrainie, putin, ukraina, ukraine, rosja, russia, wolyn, bandera, upa. Twelve annotators rated the scraped Twitter users based on their posts on a nine-point Likert scale. Samples evaluated by annotators were partially overlapped in order to examine their consistency and reliability. Statistical tests performed on data before and after binning (in three- and two-category versions) confirmed significant annotator agreement. Fleiss' kappa, Randolpha, Kirchendorff alpha, and intracorrelation coefficients indicate non-random agreement among the competent judges (annotators).Our initial data acquisition based on the abovementioned hashtags yielded 5,308,997 posts. To focus specifically on discussions related to COVID-19 and the war in Ukraine, we implemented a filtering process using Polish word stems relevant to these topics. This step reduced our dataset to 4,840,446 posts. The filtering was performed using regular expressions based on lemmatized versions of key terms. For war-related content, we used stems such as 'wojna' (war), 'inwazj' (invasion), 'ukrai' (Ukraine), and 'putin'. For COVID-related content, we used stems like 'mask' (mask), 'szczepi' (vaccine), and 'koronawirus' (coronavirus). This approach allowed us to capture various grammatical forms of these words.Following this initial filtering, we removed three users who had no posts related to either COVID-19 or the war in Ukraine. This step left us with 3,597 users and 4,839,995 posts. Finally, to ensure consistency in our analysis, we selected only posts in the Polish language. This final step resulted in our dataset of 3,577,040 posts from 3,597 users. Before the tweets content analysis was performed, text lemmatization had been performed, special characters, links, and low-importance words based on a stop list (e.g. conjunctions) had been removed.Data preprocessing has been carried out in Python programming language with the use of specific libraries and our original code. The hydrated tweets were further cleaned by removing duplicates and all tweets that had no English language label. Some characters and technical expressions were then replaced with natural language terms (e.g., changing “&” into “and”). We have also created a couple of versions of the database, for various purposes - in some of them we have replaced emoji pictures with their descriptions (using the demoji library and our original code), for other database versions we have removed the emojis, hyperlinks, and special characters. This caused the dataset to comprise 24,083,452 tweets (7,741,602 tweets without retweets), which makes it the biggest database of social media data referring to climate change analyzed to date.We created the social network directed graph with the use of RAPIDS cuGraph library in Python for most of the network statistics calculations, and also with the use of the graph-tool . The final graph visualization was created with the use of Gephi after preparing and filtering the data in Python. The final graph had 4,398,368 nodes and 18,595,472 edges, after removing duplicates and self-loops.The final label of "believer," "denier," or "neutral/unknown" was assigned to users present across annotators through the averaging of results from multiple annotators.In the Ukraine dataset, the term 'anti-group' refers to various tactics of information warfare aimed at discrediting Ukraine's sovereignty and legitimacy, whereas the 'pro-group' consists of tweets that support Ukraine's sovereignty and legitimacy. In the Vaccine dataset, 'anti' denotes a group of users who publish tweets against vaccination, while 'pro' users advocate for vaccination programs. In the Climate Change dataset, 'denier' users dismiss it as a conspiracy theory, while 'believer' users perceive climate change as a serious threat to the future of humanity.For ClimateChange dataset, the creationdate indicates when the connection between two users was established. The user1 and user2 fields are anonymized unique IDs representing the source and target users, respectively. Specifically, user1 is the unique ID of the source, while user2 is the unique ID of the target. The user1status denotes whether user1 is a believer (1), neutral (2), or denier (3). The creationday is a numeric value tied to the creation date. The onset and terminus fields mark the first and last days of any recorded interaction between user1 and user2, respectively, and duration captures the total time they have interacted. Finally, the w field indicates the number of interactions (such as replies, retweets, or direct messages) exchanged between them in a Twitter context.In the Ukraine war and Vaccine dataset, the “createdate” indicates the date of that interaction. The “likecount,” “retweetcount,” “replycount,” and “quotecount” columns capture various engagement metrics on Twitter—how many times a tweet is liked, retweeted, replied to, or quoted. The “user1” and “user2” fields store unique user IDs, whereas “user1proukraine,” “user1provaccine,” “user2proukraine,” and “user2provaccine” denote each user’s stance (e.g., pro, anti, or unknown) regarding Ukraine and vaccines. The “creationday” is a numeric value corresponding to the creation date, while “onset” and “terminus” mark the first and last recorded interactions between user1 and user2, respectively. Finally, “duration” shows the total time span across which these interactions took place.
H
Replication Data for: Tweeting from Left to Right: Is Online Political...
dataverse.harvard.edu
application/x-gzip +1
Updated Jun 9, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2015). Replication Data for: Tweeting from Left to Right: Is Online Political Communication More Than an Echo Chamber? [Dataset]. http://doi.org/10.7910/DVN/F9ICHH
Explore at:
html(121461), application/x-gzip(5338)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/F9ICHH
Dataset updated
Jun 9, 2015
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We estimated ideological preferences of 3.8 million Twitter users and, using a dataset of 150 million tweets concerning 12 political and non-political issues, explored whether online communication resembles an “echo chamber” due to selective exposure and ideological segregation or a “national conversation.” We observed that information was exchanged primarily among individuals with similar ideological preferences for political issues (e.g., presidential election, government shutdown) but not for many other current events (e.g., Boston marathon bombing, Super Bowl). Discussion of the Newtown shootings in 2012 reflected a dynamic process, beginning as a “national conversation” before being transformed into a polarized exchange. With respect to political and non-political issues, liberals were more likely than conservatives to engage in cross-ideological dissemination, highlighting an important asymmetry with respect to the structure of communication that is consistent with psychological theory and research. We conclude that previous work may have overestimated the degree of ideological segregation in social media usage.
H
Replication Data for: Right and left, partisanship predicts vulnerability to...
dataverse.harvard.edu
search.dataone.org
Updated Aug 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimitar Nikolov; Alessandro Flammini; Filippo Menczer (2021). Replication Data for: Right and left, partisanship predicts vulnerability to misinformation [Dataset]. http://doi.org/10.7910/DVN/6CZHH5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/6CZHH5
Dataset updated
Aug 12, 2021
Dataset provided by
Harvard Dataverse
Authors
Dimitar Nikolov; Alessandro Flammini; Filippo Menczer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The dataset consists of two files: 1. anonymized_shares.json: A collection of sharing actions, each corresponding to a tweet posted in June 2017 in which one or more URLs were shared. The format is as follows: { uid1: [ {"domains": ["domain1", "domain2"]}, {"domains": ["domain3"], "retweeted": retweeted_uid}, {"domains": ["domain4"], "quoted": quoted_uid}, ... ], uid2: [ ... ], ... } That is, for each sharing action we have: the list of domains from which links were shared if the tweet was a retweet, the ID of the user who created the original tweet if the tweet was quoting another tweet, the ID of the user who is being quoted All user IDs were anonymized and will not be traceable to Twitter user IDs. The tweets were collected from the Social Media Observatory at Indiana University. 2. anonymized_friends.json: For each user in the dataset, the list of their friends (followees) as given by the friends/ids Twitter API endpoint. The format is as follows: { uid1: [friend_uid1, friend_uid2, ...], uid2: [...], ... } 3. measures.tab: TAB-separated file with partisanship and misinformation scores for each anonymized user. All user IDs were anonymized and will not be traceable to Twitter user IDs.
Tweets Targeting Isis
kaggle.com
zip
Updated Nov 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ActiveGalaXy (2019). Tweets Targeting Isis [Dataset]. https://www.kaggle.com/activegalaxy/isis-related-tweets
Explore at:
zip(10419329 bytes)Available download formats
Dataset updated
Nov 17, 2019
Authors
ActiveGalaXy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The image at the top of the page is a frame from today's (7/26/2016) Isis #TweetMovie from twitter, a "normal" day when two Isis operatives murdered a priest saying mass in a French church. (You can see this in the center left). A selection of data from this site is being made available here to Kaggle users.

UPDATE: An excellent study by Audrey Alexander titled Digital Decay? is now available which traces the "change over time among English-language Islamic State sympathizers on Twitter.

Intent

This data set is intended to be a counterpoise to the How Isis Uses Twitter data set. That data set contains 17k tweets alleged to originate with "100+ pro-ISIS fanboys". This new set contains 122k tweets collected on two separate days, 7/4/2016 and 7/11/2016, which contained any of the following terms, with no further editing or selection:

isis

isil

daesh

islamicstate

raqqa

Mosul

"islamic state"

This is not a perfect counterpoise as it almost surely contains a small number of pro-Isis fanboy tweets. However, unless some entity, such as Kaggle, is willing to expend significant resources on a service something like an expert level Mechanical Turk or Zooniverse, a high quality counterpoise is out of reach.

A counterpoise provides a balance or backdrop against which to measure a primary object, in this case the original pro-Isis data. So if anyone wants to discriminate between pro-Isis tweets and other tweets concerning Isis you will need to model the original pro-Isis data or signal against the counterpoise which is signal + noise. Further background and some analysis can be found in this forum thread.

This data comes from postmodernnews.com/token-tv.aspx which daily collects about 25MB of Isis tweets for the purposes of graphical display. PLEASE NOTE: This server is not currently active.

Data Details

There are several differences between the format of this data set and the pro-ISIS fanboy dataset. 1. All the twitter t.co tags have been expanded where possible 2. There are no "description, location, followers, numberstatuses" data columns.

I have also included my version of the original pro-ISIS fanboy set. This version has all the t.co links expanded where possible.
CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis
zenodo.org
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Puneet Kumar; Puneet Kumar; Sarthak Malik; Sarthak Malik; Balasubramanian Raman; Balasubramanian Raman; Xiaobai Li; Xiaobai Li (2025). CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis [Dataset]. http://doi.org/10.5281/zenodo.11409612
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11409612
Dataset updated
May 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Puneet Kumar; Puneet Kumar; Sarthak Malik; Sarthak Malik; Balasubramanian Raman; Balasubramanian Raman; Xiaobai Li; Xiaobai Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 1, 2024
Description
Overview
The Controllable Multimodal Feedback Synthesis (CMFeed) Dataset is designed to enable the generation of sentiment-controlled feedback from multimodal inputs, including text and images. This dataset can be used to train feedback synthesis models in both uncontrolled and sentiment-controlled manners. Serving a crucial role in advancing research, the CMFeed dataset supports the development of human-like feedback synthesis, a novel task defined by the dataset's authors. Additionally, the corresponding feedback synthesis models and benchmark results are presented in the associated code and research publication.

Task Uniqueness: The task of controllable multimodal feedback synthesis is unique, distinct from LLMs and tasks like VisDial, and not addressed by multi-modal LLMs. LLMs often exhibit errors and hallucinations, as evidenced by their auto-regressive and black-box nature, which can obscure the influence of different modalities on the generated responses [Ref1; Ref2]. Our approach includes an interpretability mechanism, as detailed in the supplementary material of the corresponding research publication, demonstrating how metadata and multimodal features shape responses and learn sentiments. This controllability and interpretability aim to inspire new methodologies in related fields.

Data Collection and Annotation
Data was collected by crawling Facebook posts from major news outlets, adhering to ethical and legal standards. The comments were annotated using four sentiment analysis models: FLAIR, SentimentR, RoBERTa, and DistilBERT. Facebook was chosen for dataset construction because of the following factors:
• Facebook was chosen for data collection because it uniquely provides metadata such as news article link, post shares, post reaction, comment like, comment rank, comment reaction rank, and relevance scores, not available on other platforms.
• Facebook is the most used social media platform, with 3.07 billion monthly users, compared to 550 million Twitter and 500 million Reddit users. [Ref]
• Facebook is popular across all age groups (18-29, 30-49, 50-64, 65+), with at least 58% usage, compared to 6% for Twitter and 3% for Reddit. [Ref]. Trends are similar for gender, race, ethnicity, income, education, community, and political affiliation [Ref]
• The male-to-female user ratio on Facebook is 56.3% to 43.7%; on Twitter, it's 66.72% to 23.28%; Reddit does not report this data. [Ref]

Filtering Process: To ensure high-quality and reliable data, the dataset underwent two levels of filtering:
a) Model Agreement Filtering: Retained only comments where at least three out of the four models agreed on the sentiment.
b) Probability Range Safety Margin: Comments with a sentiment probability between 0.49 and 0.51, indicating low confidence in sentiment classification, were excluded.
After filtering, 4,512 samples were marked as XX. Though these samples have been released for the reader's understanding, they were not used in training the feedback synthesis model proposed in the corresponding research paper.

Dataset Description
• Total Samples: 61,734
• Total Samples Annotated: 57,222 after filtering.
• Total Posts: 3,646
• Average Likes per Post: 65.1
• Average Likes per Comment: 10.5
• Average Length of News Text: 655 words
• Average Number of Images per Post: 3.7

Components of the Dataset
The dataset comprises two main components:
• CMFeed.csv File: Contains metadata, comment, and reaction details related to each post.
• Images Folder: Contains folders with images corresponding to each post.

Data Format and Fields of the CSV File
The dataset is structured in CMFeed.csv file along with corresponding images in related folders. This CSV file includes the following fields:
• Id: Unique identifier
• Post: The heading of the news article.
• News_text: The text of the news article.
• News_link: URL link to the original news article.
• News_Images: A path to the folder containing images related to the post.
• Post_shares: Number of times the post has been shared.
• Post_reaction: A JSON object capturing reactions (like, love, etc.) to the post and their counts.
• Comment: Text of the user comment.
• Comment_like: Number of likes on the comment.
• Comment_reaction_rank: A JSON object detailing the type and count of reactions the comment received.
• Comment_link: URL link to the original comment on Facebook.
• Comment_rank: Rank of the comment based on engagement and relevance.
• Score: Sentiment score computed based on the consensus of sentiment analysis models.
• Agreement: Indicates the consensus level among the sentiment models, ranging from -4 (all negative) to 4 (all positive). 3 negative and 1 positive will result into -2 and 3 positives and 1 negative will result into +2.
• Sentiment_class: Categorizes the sentiment of the comment into 1 (positive) or 0 (negative).

More Considerations During Dataset Construction
We thoroughly considered issues such as the choice of social media platform for data collection, bias and generalizability of the data, selection of news handles/websites, ethical protocols, privacy and potential misuse before beginning data collection. While achieving completely unbiased and fair data is unattainable, we endeavored to minimize biases and ensure as much generalizability as possible. Building on these considerations, we made the following decisions about data sources and handling to ensure the integrity and utility of the dataset:

• Why not merge data from different social media platforms? We chose not to merge data from platforms such as Reddit and Twitter with Facebook due to the lack of comprehensive metadata, clear ethical guidelines, and control mechanisms—such as who can comment and whether users' anonymity is maintained—on these platforms other than Facebook. These factors are critical for our analysis. Our focus on Facebook alone was crucial to ensure consistency in data quality and format.

• Choice of four news handles: We selected four news handles—BBC News, Sky News, Fox News, and NY Daily News—to ensure diversity and comprehensive regional coverage. These news outlets were chosen for their distinct regional focuses and editorial perspectives: BBC News is known for its global coverage with a centrist view, Sky News offers geographically targeted and politically varied content learning center/right in the UK/EU/US, Fox News is recognized for its right-leaning content in the US, and NY Daily News provides left-leaning coverage in New York. Many other news handles such as NDTV, The Hindu, Xinhua, and SCMP are also large-scale but may contain information in regional languages such as Indian and Chinese, hence, they have not been selected. This selection ensures a broad spectrum of political discourse and audience engagement.

• Dataset Generalizability and Bias: With 3.07 billion of the total 5 billion social media users, the extensive user base of Facebook, reflective of broader social media engagement patterns, ensures that the insights gained are applicable across various platforms, reducing bias and strengthening the generalizability of our findings. Additionally, the geographic and political diversity of these news sources, ranging from local (NY Daily News) to international (BBC News), and spanning political spectra from left (NY Daily News) to right (Fox News), ensures a balanced representation of global and political viewpoints in our dataset. This approach not only mitigates regional and ideological biases but also enriches the dataset with a wide array of perspectives, further solidifying the robustness and applicability of our research.

• Dataset size and diversity: Facebook prohibits the automatic scraping of its users' personal data. In compliance with this policy, we manually scraped publicly available data. This labor-intensive process requiring around 800 hours of manual effort, limited our data volume but allowed for precise selection. We followed ethical protocols for scraping Facebook data , selecting 1000 posts from each of the four news handles to enhance diversity and reduce bias. Initially, 4000 posts were collected; after preprocessing (detailed in Section 3.1), 3646 posts remained. We then processed all associated comments, resulting in a total of 61734 comments. This manual method ensures adherence to Facebook’s policies and the integrity of our dataset.

Ethical considerations, data privacy and misuse prevention
The data collection adheres to Facebook’s ethical guidelines [<a href="https://developers.facebook.com/terms/"
E
Data from: Temporally-Informed Analysis of Named Entity Recognition
live.european-language-grid.eu
json
Updated Aug 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Temporally-Informed Analysis of Named Entity Recognition [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7805
Explore at:
jsonAvailable download formats
Dataset updated
Aug 29, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the data set developed for the paper:
“Shruti Rijhwani and Daniel Preoțiuc-Pietro. Temporally-Informed Analysis of Named Entity Recognition. In Proceedings of the Association for Computational Linguistics (ACL). 2020.”
It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.
The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities.
Format
The repository contains the annotations in JSON format.
Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (https://developer.twitter.com/en/docs/tweets/search) can be used extract the text for the tweet corresponding to the tweet IDs.
Data Splits
Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set.
To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period.
The development and test splits are provided in the JSON format.
Use
Please cite the data set and the accompanying paper if you found the resources in this repository useful.
Instagram users in the United Kingdom 2019-2028
statista.com
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). Instagram users in the United Kingdom 2019-2028 [Dataset]. https://www.statista.com/topics/3236/social-media-usage-in-the-uk/
Explore at:
Dataset updated
Nov 22, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United Kingdom
Description
The number of Instagram users in the United Kingdom was forecast to continuously increase between 2024 and 2028 by in total 2.1 million users (+7.02 percent). After the ninth consecutive increasing year, the Instagram user base is estimated to reach 32 million users and therefore a new peak in 2028. Notably, the number of Instagram users of was continuously increasing over the past years.User figures, shown here with regards to the platform instagram, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Australian Election 2019 Tweets
kaggle.com
Updated May 21, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tania J (2019). Australian Election 2019 Tweets [Dataset]. https://www.kaggle.com/taniaj/australian-election-2019-tweets/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 21, 2019
Dataset provided by
Kaggle
Authors
Tania J
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Australia
Description
Context

During the 2019 Australian election I noticed that almost everything I was seeing on Twitter was unusually left-wing. So I decided to scrape some data and investigate. Unfortunately my sentiment analysis has so far been too inaccurate to come to any useful conclusions. I decided to share the data so that others may be able to help with the sentiment or any other interesting analysis.

Content

Over 180,000 tweets collected using Twitter API keyword search between 10.05.2019 and 20.05.2019. Columns are as follows:

created_at: Date and time of tweet creation

id: Unique ID of the tweet

full_text: Full tweet text

retweet_count: Number of retweets

favorite_count: Number of likes

user_id: User ID of tweet creator

user_name: Username of tweet creator

user_screen_name: Screen name of tweet creator

user_description: Description on tweet creator's profile

user_location: Location given on tweet creator's profile

user_created_at: Date the tweet creator joined Twitter

The latitude and longitude of user_location is also available in location_geocode.csv. This information was retrieved using the Google Geocode API.

Acknowledgements

Thanks to Twitter for providing the free API.

Inspiration

There are a lot of interesting things that could be investigated with this data. Primarily I was interested to do sentiment analysis, before and after the election results were known, to determine whether Twitter users are indeed a left-leaning bunch. Did the tweets become more negative as the results were known?

Other ideas for investigation include:

Take into account retweets and favourites to weight overall sentiment analysis.

Which parts of the world are interested (ie: tweet about) the Australian elections, apart from Australia?

How do the users who tweet about this sort of thing tend to describe themselves?

Is there a correlation between when the user joined Twitter and their political views (this assumes the sentiment analysis is already working well)?

Predict gender from username/screen name and segment tweet count and sentiment by gender
Z
Multiway Alignment of Twitter networks from 2019 and 2023 Finnish...
data.niaid.nih.gov
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kivelä, Mikko (2024). Multiway Alignment of Twitter networks from 2019 and 2023 Finnish Parliamentary Elections [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12593832
Explore at:
Dataset updated
Jul 1, 2024
Dataset provided by
Salloum, Ali
Kivelä, Mikko
Faqeeh, Ali
Iannucci, Letizia
Chen, Ted Hsuan Yun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Finland
Description
This dataset includes partition identifiers obtained from retweet networks collected before the 2019 Finnish Parliamentary Elections and before the 2023 Finnish Parliamentary Elections from conversation streams related to Finnish political parties (SDP, National Coalition, Finns Party, Green Party, Left Party, Center Party) and salient topics (economic policy, social security, immigration, climate, education).

The column headers in the two files are the same and indicate the topic of the retweet network. Each row in one file represents all the memberships of one particular Twitter account for the corresponding networks. Memberships are obtained by partitioning each retweet network with METIS algorithm into two similar-sized groups with minimal retweet activity between them. For each column, one partition includes all users marked with 0, whereas the other partition includes all users marked with 1. The order of the numbering is random. Missing values mean that the account was not found in the corresponding network.

The dataset does not contain any identifying information or original raw data from the Twitter platform. Anonymization was achieved by randomly shuffling the order of the users so that, within each file, a row index identifies a user. Importantly, there no correspondence can be implied between users in 2019 and users in 2023: in general, the user at row 1 in the 2019 dataset is not the same as the user at row 1 in the 2023 dataset.
Paid Family Leave (PFL) - Monthly Data
catalog.data.gov
data.ca.gov
Updated Jul 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Employment Development Department (2025). Paid Family Leave (PFL) - Monthly Data [Dataset]. https://catalog.data.gov/dataset/paid-family-leave-pfl-monthly-data
Explore at:
Dataset updated
Jul 23, 2025
Dataset provided by
Employment Development Departmenthttp://www.edd.ca.gov/
Description
The monthly summary report is intended to provide the user with a quick overview of the status of the PFL program at the state level. This summary report contains monthly information on claims activities, average weekly benefit amounts, average duration of claims, and benefits authorized. This data is used in budgetary and administrative planning, program evaluation, and reports to the Legislature and the public.
f
Pulling power of primary topics that are also primary elsewhere vs “average”...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kilian Ollivier; Chiara Boldrini; Andrea Passarella; Marco Conti (2023). Pulling power of primary topics that are also primary elsewhere vs “average” primary / nonprimary topic. [Dataset]. http://doi.org/10.1371/journal.pone.0277182.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277182.t007
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Kilian Ollivier; Chiara Boldrini; Andrea Passarella; Marco Conti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
On the left, for all rx, ry pairs in our datasets. On the right, . The highest value per column is in bold.
Pinterest users in the United Kingdom 2019-2028
statista.com
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). Pinterest users in the United Kingdom 2019-2028 [Dataset]. https://www.statista.com/topics/3236/social-media-usage-in-the-uk/
Explore at:
Dataset updated
Nov 22, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United Kingdom
Description
The number of Pinterest users in the United Kingdom was forecast to continuously increase between 2024 and 2028 by in total 0.3 million users (+3.14 percent). After the ninth consecutive increasing year, the Pinterest user base is estimated to reach 9.88 million users and therefore a new peak in 2028. Notably, the number of Pinterest users of was continuously increasing over the past years.User figures, shown here regarding the platform pinterest, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
c
Content analysis of tweets produced by alternative media sites in the UK:...
research-data.cardiff.ac.uk
zip
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Thomas; Stephen Cushion (2024). Content analysis of tweets produced by alternative media sites in the UK: data [Dataset]. http://doi.org/10.17035/d.2023.0248123509
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17035/d.2023.0248123509
Dataset updated
Oct 30, 2024
Dataset provided by
Cardiff University
Authors
Richard Thomas; Stephen Cushion
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom
Description
This dataset is based on a content analysis of 14,807 Tweets concentrated on the main Twitter accounts run by nine alternative media sites between 2015 and 2018. Our sample was drawn from four periods: 6–25 October 2015; 9–29 October 2016; 30 April–7 June 2017 (the UK general election); and 8–28 October 2018. We chose 2015–2018 as this would provide some insight into the patterns of Twitter use and behaviour either side of a general election (in June 2017). Tweets were collected using Twitter’s Full Archive Search AP, which we accessed using Twurl to collect JSON files, and subsequently converted into Excel files ready for manual coding. Our sample, therefore, represents the “full” content from each account, excluding deleted content, but including all Tweet types. In total, there were 9,284 standard Tweets, 634 quote Tweets, 1443 reply Tweets, and 3446 Retweets. Besides quantifying Tweets, we coded each type and share metrics (as of August 2019). we examined the purpose of tweets in the following of ways: to share content e.g., links to articles, videos, images produced by the outlet who is Tweeting;to share content from other media publications;to share opinion, conjecture, speculation, viewpoints, hypothesis, predictions;to share information e.g., a fact, figure, report, announcement, event;to share hominem, dismissive, inflammatory, sarcastic, insulting content aimed at others;Other purposes, including promoting individuals or organisations, appeals for subscribers, running polls, etc.In practice, coding was straightforward, since the limit of characters naturally restricts Tweets from performing many functions simultaneously. Accordingly, there was no double coding, and where there was a decision to make, the more dominant Tweet function was chosen. If an opinion, for example, was in any way inflammatory and specifically directed, then this was coded as “attack”, rather than the sharing of less contentious and less targeted opinions. Most often, Tweets simply share online content, and this was straightforward to code.Political reference and sentiment. Where a Tweet referred to a UK political party, politician, representative or general references to the “left”, “right” or “the government”. We determined “positive” as anything supportive of a party or associated ideology, including the validity of its policies or the behaviours of those representing it. We coded “Negative” for anything interpreted as critical, such as a policy failing or suggestions of poor practice, or corruption. Whenever there was no evaluative judgement, we coded “neutral”. Manual coding enabled us to capture nuanced versions of these categories, including sarcasm or more oblique references that nonetheless could clearly be assigned as either “positive” or “negative”.Media Reference and sentiment. Where a Tweet referred to the BBC, a UK media outlet or journalist, the “mainstream media”, or other alternative media outlets. As before, we coded “positive” for anything supportive of legacy media, such as the quality of their journalism and whether they were performing well, for example. “Negative” was anything interpreted as critical of legacy media bands, perhaps pointing to “biased” coverage or no coverage of a particular issue at all. As before, “neutral” was coded in the absence of any evaluation.The data was analysed by three coders and intercoder reliability tests were performed on 1,485 Tweets (10% of the sample). Levels of agreement for all variables were between 89.7% and 94.3%, and the more intuitive Krippendorf Alpha scores ranged between 0.83 and 0.91, indicating a robust and repeatable framework and a reliable coding process.
Number of LinkedIn users in the United Kingdom 2019-2028
statista.com
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). Number of LinkedIn users in the United Kingdom 2019-2028 [Dataset]. https://www.statista.com/topics/3236/social-media-usage-in-the-uk/
Explore at:
Dataset updated
Nov 22, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United Kingdom
Description
The number of LinkedIn users in the United Kingdom was forecast to continuously increase between 2024 and 2028 by in total 1.5 million users (+4.51 percent). After the eighth consecutive increasing year, the LinkedIn user base is estimated to reach 34.7 million users and therefore a new peak in 2028. User figures, shown here with regards to the platform LinkedIn, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
s
Data from: Dataset for EOU UK central heating on/off date micro-survey...
openresearch.surrey.ac.uk
csv, md
Updated Nov 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damon Hart-Davis (2023). Dataset for EOU UK central heating on/off date micro-survey result 2017 to 2023 [Dataset]. https://openresearch.surrey.ac.uk/esploro/outputs/dataset/Dataset-for-EOU-UK-central-heating/99827666502346
Explore at:
md(2395 bytes), csv(1452 bytes)Available download formats
Dataset updated
Nov 7, 2023
Dataset provided by
https://www.earth.org.uk/data/UKCHOnOff
Authors
Damon Hart-Davis
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Nov 7, 2023
Area covered
United Kingdom
Measurement technique
UK Twitter followers' heating status was polled periodically from April 2017 to July 2022.The poll question changed somewhat over time.Per-poll results are captured in the data.
Description
When do UK Twitter users (2017 to 2022) turn their central heating fully off, by month? This informal periodic survey on social media suggests that a substantial fraction of respondents (up to 10\%) leave their central heating on year-round, which may lead to unnecessary energy consumption and carbon emissions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Thought Vector (2017). Customer Support on Twitter [Dataset]. https://www.kaggle.com/thoughtvector/customer-support-on-twitter

Customer Support on Twitter

Over 3 million tweets and replies from the biggest brands on Twitter

Explore at:

zip(149959515 bytes)Available download formats

Dataset updated

Nov 27, 2017

Dataset authored and provided by

Thought Vector

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.

https://i.imgur.com/nTv3Iuu.png" alt="Example Analysis - Inbound Volume for the Top 20 Brands">

Context

Natural language remains the densest encoding of human experience we have, and innovation in NLP has accelerated to power understanding of that data, but the datasets driving this innovation don't match the real language in use today. The Customer Support on Twitter dataset offers a large corpus of modern English (mostly) conversations between consumers and customer support agents on Twitter, and has three important advantages over other conversational text datasets:

Focused - Consumers contact customer support to have a specific problem solved, and the manifold of problems to be discussed is relatively small, especially compared to unconstrained conversational datasets like the reddit Corpus.
Natural - Consumers in this dataset come from a much broader segment than those in the Ubuntu Dialogue Corpus and have much more natural and recent use of typed text than the Cornell Movie Dialogs Corpus.
Succinct - Twitter's brevity causes more natural responses from support agents (rather than scripted), and to-the-point descriptions of problems and solutions. Also, its convenient in allowing for a relatively low message limit size for recurrent nets.

Inspiration

The size and breadth of this dataset inspires many interesting questions:

Can we predict company responses? Given the bounded set of subjects handled by each company, the answer seems like yes!
Do requests get stale? How quickly do the best companies respond, compared to the worst?
Can we learn high quality dense embeddings or representations of similarity for topical clustering?
How does tone affect the customer support conversation? Does saying sorry help?
Can we help companies identify new problems, or ones most affecting their customers?

Content

The dataset is a CSV, where each row is a tweet. The different columns are described below. Every conversation included has at least one request from a consumer and at least one response from a company. Which user IDs are company user IDs can be calculated using the inbound field.

`tweet_id`

A unique, anonymized ID for the Tweet. Referenced by response_tweet_id and in_response_to_tweet_id.

`author_id`

A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.

`inbound`

Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.

`created_at`

Date and time when the tweet was sent.

`text`

Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like _email_.

`response_tweet_id`

IDs of tweets that are responses to this tweet, comma-separated.

`in_response_to_tweet_id`

ID of the tweet this tweet is in response to, if any.

Contributing

Know of other brands the dataset should include? Found something that needs to be fixed? Start a discussion, or email me directly at $FIRSTNAME@$LASTNAME.com!

Acknowledgements

A huge thank you to my friends who helped bootstrap the list of companies that do customer support on Twitter! There are many rocks that would have been left un-turned were it not for your suggestions!

Relevant Resources

Clear search

Close search

Google apps

Main menu

Customer Support on Twitter

Context

Inspiration

Content

tweet_id

author_id

inbound

created_at

text

response_tweet_id

in_response_to_tweet_id

Contributing

Acknowledgements

Relevant Resources

(🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset

IMPORTANT (02-Apr-2024)

IMPORTANT (28-Mar-2024)

Context

Announcement

A study on real graphs of fake news spreading on Twitter

Population of X/Twitter users and web domains embedded in a multidimensional...

Newly Emerged Rumors in Twitter

Deniers and believers in climate change discourse on twitter, and anti/pro...

Replication Data for: Tweeting from Left to Right: Is Online Political...

Replication Data for: Right and left, partisanship predicts vulnerability to...

Tweets Targeting Isis

Context

Intent

Data Details

CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

Data from: Temporally-Informed Analysis of Named Entity Recognition

Instagram users in the United Kingdom 2019-2028

Australian Election 2019 Tweets

Context

Content

Acknowledgements

Inspiration

Multiway Alignment of Twitter networks from 2019 and 2023 Finnish...

Paid Family Leave (PFL) - Monthly Data

Pulling power of primary topics that are also primary elsewhere vs “average”...

Pinterest users in the United Kingdom 2019-2028

Content analysis of tweets produced by alternative media sites in the UK:...

Number of LinkedIn users in the United Kingdom 2019-2028

Data from: Dataset for EOU UK central heating on/off date micro-survey...

Customer Support on Twitter

Over 3 million tweets and replies from the biggest brands on Twitter

Context

Inspiration

Content

tweet_id

author_id

inbound

created_at

text

response_tweet_id

in_response_to_tweet_id

Contributing

Acknowledgements

Relevant Resources

`tweet_id`

`author_id`

`inbound`

`created_at`

`text`

`response_tweet_id`

`in_response_to_tweet_id`

`tweet_id`

`author_id`

`inbound`

`created_at`

`text`

`response_tweet_id`

`in_response_to_tweet_id`