Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comprises of 10 JSON files, each containing geographic metadata and a sentiment score collected from tweets between March 20, 2020 and December 1, 2020 pertaining to the COVID-19 global pandemic for ten of the most populous cities in the United States and Canada.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Charlottesville is home to a statue of Robert E. Lee which is slated to be removed. (For those unfamiliar with American history, Robert E. Lee was a US Army general who defected to the Confederacy during the American Civil War and was considered to be one of their best military leaders.) While many Americans support the move, believing the main purpose of the Confederacy was to defend the institution of slavery, many others do not share this view. Furthermore, believing Confederate symbols to be merely an expression of Southern pride, many have not taken its planned removal lightly.
As a result, many people--including white nationalists and neo-Nazis--have descended to Charlottesville to protest its removal. This in turn attracted many counter-protestors. Tragically, one of the counter-protestors--Heather Heyer--was killed and many others injured after a man intentionally rammed his car into them. In response, President Trump blamed "both sides" for the chaos in Charlottesville, leading many Americans to denounce him for what they see as a soft-handed approach to what some have called an act of "domestic terrorism."
This dataset below captures the discussion--and copious amounts of anger--revolving around this past week's events.
This data set consists of a random sample of 50,000 tweets per day (in accordance with the Twitter Developer Agreement) of tweets mentioning Charlottesville or containing "#charlottesville" extracted via the Twitter Streaming API, starting on August 15. The files were copied from a large Postgres database containing--currently--over 2 million tweets. Finally, a table of tweet counts per timestamp was created using the whole database (not just the Kaggle sample). The data description PDF provides a full summary of the attributes found in the CSV files.
Note: While the tweet timestamps are in UTC, the cutoffs were based on Eastern Standard Time, so the August 16 file will have timestamps ranging from 2017-08-16 4:00:00
UTC to 2017-08-17 4:00:00
UTC.
The dataset is available as either separate CSV files or a single SQLite database.
I'm releasing the dataset under the CC BY-SA 4.0 license. Furthermore, because this data was extracted via the Twitter Streaming API, its use must abide by the Twitter Developer Agreement. Most notably, the display of individual tweets should satisfy these requirements. More information can be found in the data description file, or on Twitter's website.
Obviously, I would like to thank Twitter for providing a fast and reliable streaming service. I'd also like to thank the developers of the Python programming language, psycopg2, and Postgres for creating amazing software with which this data set would not exist.
The banner above is a personal modification of these images:
I almost removed the header "inspiration" from this section, because this is a rather sad and dark data set. However, this is preciously why this is an important data set to analyze. Good history books have never shied away from unpleasant events, and never should we.
This data set provides a rich opportunity for many types of research, including:
Furthermore, given the political nature of this dataset, there are a lot of social science questions that can potentially be answered, or at least piqued, by this data.
How many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset, trained model, and software companion for the paper titled: Characterizing Anti-Asian Rhetoric During The COVID-19 Pandemic: A Sentiment Analysis Case Study on Twitter accepted for the Workshop on Data for the Wellbeing of Most Vulnerable of the ICWSM 2022 conference.
The COVID-19 pandemic has shown a measurable increase in the usage of sinophobic comments or terms on online social media platforms. In the United States, Asian Americans have been primarily targeted by violence and hate speech stemming from negative sentiments about the origins of the novel SARS-CoV-2 virus. While most published research focuses on extracting these sentiments from social media data, it does not connect the specific news events during the pandemic with changes in negative sentiment on social media platforms. In this work we combine and enhance publicly available resources with our own manually annotated set of tweets to create machine learning classification models to characterize the sinophobic behavior. We then applied our classifier to a pre-filtered longitudinal dataset spanning two years of pandemic related tweets and overlay our findings with relevant news events.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset comprises all tweets that contained the hashtag #ALittleLife or the official account @alittlelifebook and were posted in March or April 2015. The dataset was generated in February 2023 using snscrape (https://github.com/JustAnotherArchivist/snscrape/).
The dataset is discussed in my article, "What We Can('t) Know Before We Read: Towards a Theory of the Pre-Reading Environment", Book History 27.2 (2024).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset enumerates the number of geocoded tweets captured in geographic rectangular bounding boxes around the metropolitan statistical areas (MSAs) defined for 49 American cities, during a four-week period in 2012 (between April and June), through the Twitter Streaming API. More information on MSA definitions: https://www.census.gov/population/metro/
This dataset contains the tweet ids of approximately 132,907,659 tweets related to announcement of the American Health Care Act (AHCA). They were collected between March 9, 2017 and April 13, 2018 from the Twitter API using Social Feed Manager. These tweet ids are broken up into 2 collections. Each collection was collected either from the GET statuses/search method of the Twitter REST API (retrieved on a weekly schedule) or the POST statuses/filter method of the Twitter Stream API. The collections are: Healthcare filter (Twitter filter): healthcare-filter_ids.txt.[00-13] Healthcare search (Twitter seasrch): healthcare-search_ids.txt There is a README.txt file for each collection containing additional documentation on how it was collected, including the keywords used in each collection. The GET statuses/lookup method supports retrieving the complete tweet for a tweet id (known as hydrating). Tools such as Twarc or Hydrator can be used to hydrate tweets. Per Twitter’s Developer Policy, tweet ids may be publicly shared for academic purposes; tweets may not. Questions about this dataset can be sent to sfm@gwu.edu. George Washington University researchers should contact us for access to the tweets.
The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Social media companies are starting to offer users the option to subscribe to their platforms in exchange for monthly fees. Until recently, social media has been predominantly free to use, with tech companies relying on advertising as their main revenue generator. However, advertising revenues have been dropping following the COVID-induced boom. As of July 2023, Meta Verified is the most costly of the subscription services, setting users back almost 15 U.S. dollars per month on iOS or Android. Twitter Blue costs between eight and 11 U.S. dollars per month and ensures users will receive the blue check mark, and have the ability to edit tweets and have NFT profile pictures. Snapchat+, drawing in four million users as of the second quarter of 2023, boasts a Story re-watch function, custom app icons, and a Snapchat+ badge.
More than 250 million tweets in Spanish from 331 Spanish-speaking cities in Latin America, Spain and the United States were compiled from Twitter. In this data set, a column is provided with the 5000 most frequent words and one with their corresponding frequencies (the number of times the word was produced in that city) for each of the 331 cities. The reported data correspond to the years 2009 to 2016.
A list of 10,538 Twitter IDs for tweets harvested between 4 January at 11am and 9 January at 11am using Social Feed Manager. As this used the search API, the 4 January at 11am crawl went back about 5-9 days. Tweet IDs included, as is a log of the decisions made to curate this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As global political preeminence gradually shifted from the United Kingdom to the United States, so did the capacity to culturally influence the rest of the world. In this work, we analyze how the world-wide varieties of written English are evolving. We study both the spatial and temporal variations of vocabulary and spelling of English using a large corpus of geolocated tweets and the Google Books datasets corresponding to books published in the US and the UK. The advantage of our approach is that we can address both standard written language (Google Books) and the more colloquial forms of microblogging messages (Twitter). We find that American English is the dominant form of English outside the UK and that its influence is felt even within the UK borders. Finally, we analyze how this trend has evolved over time and the impact that some cultural events have had in shaping it.
The global social media penetration rate in was forecast to continuously increase between 2024 and 2028 by in total 11.6 (+18.19 percent). After the ninth consecutive increasing year, the penetration rate is estimated to reach 75.31 and therefore a new peak in 2028. Notably, the social media penetration rate of was continuously increasing over the past years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Appendix S1-S3, Table S1 and Software S1. Appendix S1. Term list. List of all words considered in our main analysis. Appendix S2. Term examples. Examples for each term considered in our analysis. Appendix S3. Data Procedures. Description of the procedures used for data processing, including Twitter data acquisition, geocoding, content filtering, word filtering, and text processing. Table S1. Term annotations. Tab-separated file describing annotations of each term as entities, foreign-language, or acceptable for analysis. Software S1. Preprocessing software. Source code for data preprocessing. (ZIP)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(A) Total number of tweets and (B) proportion of tweets made by each of the 10 Granger linked accounts from 01/01/2015 to 02/26/2017. The vertical lines represent significant events in the election cycle. From left to right: Trump announces candidacy (black, 06/16/2015), Trump calls for a ban on the immigration of Muslims after San Bernardino shooting (red, 12/07/2015), Trumps declared Republican nominee (red, 06/19/2016), Hillary declared Democratic nominee (blue, 07/28/2016), 3 presidential primary debates (green, 09/26, 10/09, 10/19/2016), election day (black, 11/08/2016), and Trump’s inauguration (black, 01/20/2017).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Invasive species - American bullfrog (Lithobates catesbeianus) in Flanders, Belgium (Post 2018) is a species occurrence dataset published by the Research Institute for Nature and Forest (INBO). The dataset contains over 24600 occurrences (40 % of which are American bullfrogs) sampled between 2019 until now, in the months April to October. The occurrences were collected through fieldwork and the framework of bullfrog management. Captured bullfrogs were almost always removed from the environment and humanely killed, while the other occurrences are recorded bycatch, which were released upon catch (see bibliography for detailed descriptions of the methods). Therefore, caution is advised when using these data for trend analysis, distribution range calculation, or other. Issues with the dataset can be reported at https://github.com/inbo/sk-analyse
We strongly believe an open attitude is essential for tackling the IAS problem (Groom et al. 2015). To allow anyone to use this dataset, we have released the data to the public domain under a Creative Commons Zero waiver (http://creativecommons.org/publicdomain/zero/1.0/). We would appreciate it however if you read and follow these norms for data use (http://www.inbo.be/en/norms-for-data-use) and provide a link to the original dataset (https://doi.org/10.15468/daf62d) whenever possible. If you use these data for a scientific paper, please cite the dataset following the applicable citation norms and/or consider us for co-authorship. We are always interested to know how you have used or visualized the data, or to provide more information, so please contact us via the contact information provided in the metadata, opendata@inbo.be or https://twitter.com/LifeWatchINBO.
Data from 2010 to 2018 can be found here: https://doi.org/10.15468/2hqkqn
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The English language has evolved dramatically throughout its lifespan, to the extent that a modern speaker of Old English would be incomprehensible without translation. One concrete indicator of this process is the movement from irregular to regular (-ed) forms for the past tense of verbs. In this study we quantify the extent of verb regularization using two vastly disparate datasets: (1) Six years of published books scanned by Google (2003–2008), and (2) A decade of social media messages posted to Twitter (2008–2017). We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books. Regularization is also greater for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables such as education or income.
Facebook received 73,390 user data requests from federal agencies and courts in the United States during the second half of 2023. The social network produced some user data in 88.84 percent of requests from U.S. federal authorities. The United States accounts for the largest share of Facebook user data requests worldwide.
How much time do people spend on social media?
As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in
the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively.
People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general.
During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The feature groups are defined in Table 3; “population” refers to “raw diff log population.”Average accuracy predicting links between MSA pairs, and its Monte Carlo standard error (calculated from simulation samples).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comprises of 10 JSON files, each containing geographic metadata and a sentiment score collected from tweets between March 20, 2020 and December 1, 2020 pertaining to the COVID-19 global pandemic for ten of the most populous cities in the United States and Canada.