Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social network services such as Twitter are important venues that can be used as rich data sources to mine public opinions about various topics. In this study, we used Twitter to collect data on one of the most growing theories in education, namely Self-Regulated Learning (SRL) and carry out further analysis to investigate What Twitter says about SRL? This work uses three main analysis methods, descriptive, topic modeling, and geocoding analysis. The searched and collected dataset consists of a large volume of relevant SRL tweets equal to 54,070 tweets between 2011 and 2021. The descriptive analysis uncovers a growing discussion on SRL on Twitter from 2011 till 2018 and then markedly decreased till the collection day. For topic modeling, the text mining technique of Latent Dirichlet allocation (LDA) was applied and revealed insights on computationally processed topics. Finally, the geocoding analysis uncovers a diverse community from all over the world, yet a higher density representation of users from the Global North was identified. Further implications are discussed in the paper.
This submission consists of 12 data sets containing Twitter IDs pertaining to 6 AI controversies identified by UK-based experts in AI and Society as especially significant during the period 2012-2021. The data sets were collected by researchers at the University of Warwick as part of the 3-year international project “Shaping AI” which mapped controversies about “Artificial Intelligence” (AI) during 2012-2022. Research teams in the UK, France, Germany and Canada analysed controversies about AI in their countries across different spheres: research, policy and the media during this 10-year period. The UK team at the University of Warwick designed and undertook an analysis of research controversies about AI in the relevant period following a standpoint methodology. Our study began with an online consultation that took place in the Autumn of 2021, in which we asked UK-based experts in AI from across disciplines to identify what are the most important concerns, disputes and problematics that have arisen in the last 10 years in relation to AI as a strategic area of research.
Based on the responses to this expert consultation—described in detail in Marres et al (2024) and Poletti et al (forthcoming)—we identified a broad range of relevant controversy topics, objects and problems. To select controversies for further analysis, we considered their research intensity, in the form of a frequency count of research publications mentioned by respondents in relation to controversy topics.
On this basis, we selected 6 AI research controversies for further research: COMPAS; NHS+Deepmind; Gaydar; Facial recognition; Stochastic Parrots (LLMs) & Deeplearning as a solution for AI. For each of these controversies, we collected Twitter data by submitting queries to Twitter's academic API using TWARC between January 2022 and June 2022. Further details of the methods of data collection and curation can be found in the methods file with further detail of the queries in the ReadMe file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present GeoCoV19
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">
Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!
Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)
There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.
Thanks to the tweepy package for making the data extraction via Twitter API so easy.
Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.
Here's an App I built using a live version of this data.
Sample of tweets generated within a USA bounding box between August 2019 and April 2020. The data was used for the paper titled: Enhanced geocoding precision for location inferennce of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection. The paper was submitted in PLoS ONE journal. Two datasets have been submitted. 1. Dataset A; Consisting of 133,577 geocoded tweets 2. Dataset B: Consisting of 133,587 geocoded tweets
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
only the tweet IDs are shared. The tweet IDs in this dataset belong to the tweets created providing an exact location.
Train.csv contains tweets and labels are emojis. You can find the emoji-label mapping in Mapping.csv. Predict emoji's to use for the test set.
Best method among those tried was Bi-directional LSTM with Glove embeddings (42B)
Belongs to the original author on Twitter
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Updated daily
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This web map is part of SDGs Today. Please see sdgstoday.orgPromoting well-being is one of the key targets of Sustainable Development Goals at the United Nations. Many governments worldwide are incorporating subjective well-being (SWB) indicators to complement traditional objective and economic metrics. Our Twitter Sentiment Geographical Index (TSGI) can provide a high granularity monitor of well-being worldwide.This dataset is a joint effort of the Sustainable Urbanization Lab at MIT and Center for Geographic Analysis at Harvard.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a dump of the #nowplaying dataset which contains so-called listening events of users who publish the music they are currently listening to on Twitter. In particular, this dataset includes tracks which have been tweeted using the hashtags #nowplaying, #listento or #listeningto. In this dataset, we provide the track and artist of a listening event and metadata on the tweet (date sent, user, source). Furthermore, we provide a mapping of tracks to its respective Musicbrainz identifiers. The dataset features a total of 126 mio listening events.
This archive contains the nowplaying.csv file, the main file which contains the following fields:
In case you make use of our dataset in a scientific setting, we kindly ask you to cite the following paper:
Eva Zangerle, Martin Pichl, Wolfgang Gassler, and Günther Specht. 2014. #nowplaying Music Dataset: Extracting Listening Behavior from Twitter. In Proceedings of the First International Workshop on Internet-Scale Multimedia Management (WISMM '14). ACM, New York, NY, USA, 21-26.
If you have any questions or suggestions regarding the dataset, please do not hesitate to contact Eva Zangerle (eva.zangerle@uibk.ac.at).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset is a subset of the TBCOV dataset collected at QCRI filtered for mentions of personally related COVID-19 deaths. The filtering was done using regular expressions such as my * passed, my * died, my * succumbed & lost * battle. A sample of the dataset was annotated on Appen. Please see 'annotation-instructions.txt' for the full instructions provided to the annotators.
The "classifier_filtered_english.csv" file contains 33k deduplicated and classifier-filtered tweets (following X's content redistribution policy). for the 6 countries (Australia, Canada, India, Italy, United Kingdom, and United States) from March 2020 to March 2021 with classifier-labeled death labels, regular expression-filtered gender and relationship labels, and the user device label. The full 57k regex-filtered collection of tweets can be made available on special cases for Academics and Researchers.
date: the date of the tweet
country_name: the country name from Nominatim API
tweet_id: the ID of the tweet
url: the full URL of the tweet
full_text: the full-text content of the tweet (also includes the URL of any media attached)
does_the_tweet_refer_to_the_covidrelated_death_of_one_or_more_individuals_personally_known_to_the_tweets_author: the classifier predicted label for the death (also includes the original labels for the annotated samples)
what_is_the_relationship_between_the_tweets_author_and_the_victim_mentioned: the annotated relationship labels
relative_to_the_time_of_the_tweet_when_did_the_mentioned_death_occur: the annotated relative time labels
user_is_verified: if the user is verified or not
user_gender: the gender of the Twitter user (from the user profile)
user_device: the Twitter client the user uses
has_media: if the tweet has any attached media
has_url: if the tweet text contains a URL
matched_device: the device (Apple or Android) based on the Twitter client
regex_gender: the gender inferred from regular expression-based filtering
regex_relationship: the relationship label from regular expression-based filtering
We first determine the mapping between different relationship labels mentioned in the tweet to the gender. We do not use any relationship like "cousin" from which we cannot easily infer the gender.
Male relationships: 'father', 'dad', 'daddy', 'papa', 'pop', 'pa', 'son', 'brother', 'uncle', 'nephew', 'grandfather', 'grandpa', 'gramps', 'husband', 'boyfriend', 'fiancé', 'groom', 'partner', 'beau', 'friend', 'buddy', 'pal', 'mate', 'companion', 'boy', 'gentleman', 'man', 'father-in-law', 'brother-in-law', 'stepfather', 'stepbrother'
Female relationships: 'mother', 'mom', 'mama', 'mum', 'ma', 'daughter', 'sister', 'aunt', 'niece', 'grandmother', 'grandma', 'granny', 'wife', 'girlfriend', 'fiancée', 'bride', 'partner', 'girl', 'lady', 'woman', 'miss', 'mother-in-law', 'sister-in-law', 'stepmother', 'stepsister'
Based on these mappings, we used the following regex for each gender label to determine the gender of the deceased mentioned in the tweet.
"[m|M]y\s(" + "|".join([r + "s?" for r in relationships]) + ")\s(died|succumbed|deceased)"
First, we get the relationship labels using regex filtering, and then we group them into different age-group categories as shown in the following table. The UK and the US use different age groups because of the different age group definitions in the official data.
Category | Relationship (from tweets) | Age Group (UK) | Age Group (US) |
Grandparents | grandfather, grandmother | 65+ | 65+ |
Parents | father, mother, uncle, aunt | 45-64 | 35-64 |
Siblings | brother, sister, cousin | 15-44 | 15-34 |
Children | son, daughter, nephew, niece | 0-14 | 0-14 |
The 'english-training.csv' file contains about 13k deduplicated human-annotated tweets. We use a random seed (42) to create the train/test split. The model Covid-Bert-V2 was fine-tuned on the training set for 2 epochs with the following hyperparameters (obtained using 10-fold CV): random_seed: 42, batch_size: 32, dropout: 0.1. We obtained a F1-score of 0.81 on the test set. We used about 5% (671) of the combined and deduplicated annotated tweets as the test set, about 2% (255) as the validation set, and the remaining 12,494 tweets were used for fine-tuning the model. The tweets were preprocessed to replace mentions, URLs, emojis, etc with generic keywords. The model was trained on a system with a single Nvidia A4000 16GB GPU. The fine-tuned model is also available as the 'model.bin' file. The code for finetuning the model as well as reproducing the experiments are available in this GitHub repository.
We also include a datasheet for the dataset following the recommendation of "Datasheets for Datasets" (Gebru et. al.) which provides more information about how the dataset was created and how it can be used. Please see "Datasheet.pdf".
NOTE: We recommend that researchers try to rehydrate the individual tweets to ensure that the user has not deleted the tweet since posting. This gives users a mechanism to opt out of having their data analyzed.
Please only use your institutional email when requesting the dataset as anything else (like gmail.com) will be rejected. The dataset will only be made available on reasonable request for Academics and Researchers. Please mention why you need the dataset and how you plan to use the dataset when making a request.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Most of the available datasets are not particularly adapted to our target application: geolocate natural disasters from social networks. First, social media posts are largely underrepresented in these datasets, and the only Twitter dataset lacks Entity-Linking annotations. Second, none of the datasets focuses on a crisis or natural disaster event.
To mitigate these issues, we extracted a collection of French tweets written during earthquakes and major floods that have occurred in France in recent years. We set up Label-Studio in order to annotate these tweets. A total of 4617 tweets were annotated, including 1678 tweets posted during earthquakes and 2939 during floods. For each annotated tweet, mentions were annotated using the set of labels described earlier in the paper as well as, when possible, the target Wikipedia title.
Named “RéSoCIO” in reference to the research project in which it was carried out, the dataset resulting from this work contains a total of 12 828 annotated mentions and 1 513 distinct Wikipedia entities. 85% of mentions were associated with a Wikipedia page and 94 % if we ignore the RISKNAT and DAMAGES labels, which are often difficult to map to an existing entity.
Labels
#Mentions
#Linked
#Entities
PERSON
315
263
136
ORG
863
790
281
GEOLOC
4375
4234
701
TRANSPORT
250
203
101
EVENT
35
21
16
FACILITY
129
94
49
RISKNAT
5502
4994
128
DAMAGES
1136
121
56
OTHER
223
200
46
Total
12828
1322
1513
Overview of the mentions annotated in the Twitter dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.
Labels
#Mentions
#Linked
#Entities
PERSON
1100102
1098406
557697
ORG
750925
749504
130394
GEOLOC
2729702
2728296
215924
TRANSPORT
161539
160487
53405
EVENT
798433
798251
86471
FACILITY
258835
258513
109867
RISKNAT
5502
4994
127
DAMAGES
1136
121
56
OTHER
4340621
4339658
682458
Total
10146795
10138230
1836399
Overview of the mentions annotated in the full dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Recent baseball scholarship has drawn attention to U.S. professional baseball’s complex twentieth century labor dynamics and expanding global presence. From debates around desegregation to discussions about the sport’s increasingly multicultural identity and global presence, the cultural politics of U.S. professional baseball is connected to the problem of baseball labor. However, most scholars address these topics by focusing on Major League Baseball (MLB), ignoring other teams and leagues—Minor League Baseball (MiLB)—that develop players for Major League teams. Considering Minor League Baseball is critical to understanding the professional game in the United States, since players who populate Major League rosters constitute a fraction of U.S. professional baseball’s entire labor force. As a digital humanities dissertation on baseball labor and globalization, this project uses digital humanities approaches and tools to analyze and visualize a quantitative data set, exploring how Minor League Baseball relates to and complicates MLB-dominated narratives around globalization and diversity in U.S. professional baseball labor. This project addresses how MiLB demographics and global dimensions shifted over time, as well as how the timeline and movement of foreign-born players through the Minor Leagues differs from their U.S.-born counterparts. This project emphasizes the centrality and necessity of including MiLB data in studies of baseball’s labor and ideological significance or cultural meaning, making that argument by drawing on data analysis, visualization, and mapping to address how MiLB labor complicates or supplements existing understandings of the relationship between U.S. professional baseball’s global reach and “national pastime” claims.
The LGBTQI+ Dataset 2020-2022_es is a collection of 410,015 original tweets extracted from the social network Twitter between January 1, 2020, and December 31, 2022. To ensure data quality and relevance, retweets, replies, and other duplicate content were excluded, retaining only original tweets. The tweets were collected by Jacinto Mata (University of Huelva, I2C/CITES) with the support of the Python programming language and using the twarc2 tool and the Academic API v2 of Twitter. Tbis data collection is part of the project “Conspiracy Theories and Hate Speech Online: Comparison of patterns in narratives and social networks about COVID-19, immigrants and refugees and LGBTI people [NON-CONSPIRA-HATE!]”, PID2021-123983OB-I00, funded by MCIN/AEI/10.13039/501100011033/ by FEDER/EU.
The search criteria (words and hashtags) used for the data collection followed the objectives of the aforementioned project and were defined by Estrella Gualda, Francisco Javier Santos Fernández and Jacinto Mata (University of Huelva, Spain). Terms and hashtags used for the search and extraction of tweets were: #orgullogay, #orgullotrans, #OrgulloLGTB, #OrgulloLGTBI, #Díadelorgullo, #TRANSFOBIA, #transexuales, #LGTB, #LGTBI, #LGTBIQ, #LGTBQ, #LGTBQ+, anti-gay, "anti gay", anti-trans, "anti trans", "Ley Anti-LGTB", "ley trans", "anti-ley trans".
This dataset collected in the frame of the NON-CONSPIRA-HATE! project had the aim of identifying and mapping online hate speech narratives and conspiracy theories towards LGBTIQ+ people and community. Additionally, the dataset is intended to compare communication patterns in social media (rhetoric, language, micro-discourses, semantic networks, emotions, etc.) deployed in different datasets collected in this project. This dataset also contributes to mapping the actors, communities, and networks that spread hate messages and conspiracy theories, aiming to understand the patterns and strategies implemented by extremist sectors on social media. he dataset includes messages that address a wide range of topics related to the LGBTQI+ community, such as rights, visibility, the fight against discrimination and transphobia, as well as debates surrounding the Trans Law and other related issues. It includes expressions of support and celebration of Pride as well as hate speech and opposition to LGBTQI+ rights, along with debates and controversies surrounding these issues.
This dataset offers a wide range of possibilities for research in various disciplines, as the following examples express:
Social Sciences & Digital Humanities:- Analysis of opinions, attitudes, and trends toward the LGBTIQ+ people and community.- Studies on the evolution of public discourse and polarization around issues such as transphobia, hate speech, disinformation, LGBTIQ+ rights and pride, and others.- Analysis on social and political actors, leaders or organizations disseminating diverse narratives on LGBTIQ+ - Research on the impact of specific events (e.g., Pride Day) on social media conversations.- Investigations on social and semantic networks around LGBTIQ+ people and community.- Analysis of narratives, discourses and rethoric around gender identity and sexual diversity.- Comparative studies on the representation of the LGBTIQ+ people and community in different cultural or geographic contexts.
Computer Science and Artificial Intelligence:- Development of algorithms for the automatic detection of hate speech, discriminatory language, or offensive content.- Training natural language processing (NLP) models to analyze sentiments and emotions in texts related to the LGBTIQ+ people and community.
For more information on other technical details of the dataset and the structure of the .jsonl data, see the “Readme.txt” file.
Social network X/Twitter is particularly popular in the United States, and as of February 2025, the microblogging service had an audience reach of 103.9 million users in the country. Japan and the India were ranked second and third with more than 70 million and 25 million users respectively. Global Twitter usage As of the second quarter of 2021, X/Twitter had 206 million monetizable daily active users worldwide. The most-followed Twitter accounts include figures such as Elon Musk, Justin Bieber and former U.S. president Barack Obama. X/Twitter and politics X/Twitter has become an increasingly relevant tool in domestic and international politics. The platform has become a way to promote policies and interact with citizens and other officials, and most world leaders and foreign ministries have an official Twitter account. Former U.S. president Donald Trump used to be a prolific Twitter user before the platform permanently suspended his account in January 2021. During an August 2018 survey, 61 percent of respondents stated that Trump's use of Twitter as President of the United States was inappropriate.
Live Maps is a configurable app template that provides the ability to consume a live data feeds from a variety of sources.Use CasesProvide a map that shows locations of health care facilities and the reported cases of the influenza.Present the locations of political campaign events with related tweets.Configurable OptionsLive Maps is used to combine social media feeds with your operational content, it can be configured using the following options:Map: Choose the web map used in your application.Title: The application name displayed in the header.Subitle: The application subtitle displayed in the header.Color: Choose the color scheme for the application.Feed: The live feed to use in the application, currently supports: Twitter, Flickr, SickWeather.Keyword: Optional search keyword for feeds like Twitter and Flickr.Interval: The interval in minutes to switch between records.Refresh interval: The interval in minutes to refresh the feed.Supported DevicesThis application is responsively designed to support use in browsers on desktops, mobile phones, and tablets.Data RequirementsThis application has no data requirements.Get Started This application can be created in the following ways:Click the Create a Web App button on this pageShare a map and choose to Create a Web AppOn the Content page, click Create - App - From Template Click the Download button to access the source code. Do this if you want to host the app on your own server and optionally customize it to add features or change styling.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Interactions between key concepts mentionned on Twitter in tweets containing words from the field of ecology. See the URL for more details on the methodology. These data come from a series of relatively short sampling sessions.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the tweet, mention, identity of mentioned persons, hashtags and X URL's posted by OECD leaders during the study period
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Researcher(s): Alexandros Mokas, Eleni Kamateri
Supervisor: Ioannis Tsampoulatidis
This repository contains 3 social media datasets:
2 Post-processing datasets: These datasets contain post-processing data extracted from the analysis of social media posts collected for two different use cases during the first two years of the Deepcube project. More specifically, these include:
The UC2 dataset containing the post-processing analysis of the Twitter data collected for the DeepCube use case (UC2) dealing with the climate induced migration in Africa. This dataset contains in total 5,695,253 social media posts collected from the Twitter platform, based on the initial version of search criteria relevant to UC2 defined by Universitat De Valencia, focused on the regions of Ethiopia and Somalia and started from 26 June, 2021 till March, 2023.
The UC5 dataset containing the post-processing analysis of the Twitter and Instagram data collected for the DeepCube use case (UC5) related to the sustainable and environmentally-friendly tourism. This dataset contains in total 58,143 social media posts collected from the Twitter and Instagram platform (12,881 collected from Twitter and 45,262 collected from Instagram), based on the initial version of search criteria relevant to UC5 defined by MURMURATION SAS, focused on the regions of Brasil and started from 26 June, 2021 till March, 2023.
1 Annotated dataset: An additional anottated dataset was created that contains post-processing data along with annotations of Twitter posts collected for UC2 for the years 2010-2022. More specifically, it includes:
The UC2 dataset contain the post-processing of the Twitter data collected for the DeepCube use case (UC2) dealing with the climate induced migration in Africa. This dataset contains in total 1721 annotated (412 relevant and 1309 irrelevant) by social media posts collected from the Twitter platform, focused on the region of Somalia and started from 1 January, 2010 till 31 December, 2022.
For every social media post retrieved from Twitter and Instagram, a preprocessing step was performed. This involved a three-step analysis of each post using the appropriate web service. First, the location of the post was automatically extracted from the text using a location extraction service. Second, the images included in the post were analyzed using a concept extraction service, which identified and provided the top ten concepts that best described the image. These concepts included items such as "person," "building," "drought," "sun," and so on. Finally, the sentiment expressed in the post's text was determined by using a sentiment analysis service. The sentiment was classified as either positive, negative, or neutral.
After the social media posts were preprocessed, they were visualized using the Social Media Web Application. This intuitive, user-friendly online application was designed for both expert and non-expert users and offers a web-based user interface for filtering and visualizing the collected social media data. The application provides various filtering options, an interactive map, a timeline, and a collection of graphs to help users analyze the data. Moreover, this application provides users with the option to download aggregated data for specific periods by applying filters and clicking the "Download Posts" button. This feature allows users to easily extract and analyze social media data outside of the web application, providing greater flexibility and control over data analysis.
The dataset is provided by INFALIA. INFALIA, being a spin-off of the CERTH institute and a partner of a research EU project, releases this dataset containing Tweets IDs and post pre-processing data for the sole purpose of enabling the validation of the research conducted within the DeepCube. Moreover, Twitter Content provided in this dataset to third parties remains subject to the Twitter Policy, and those third parties must agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy (https://developer.twitter.com/en/developer-terms) before receiving this download.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social network services such as Twitter are important venues that can be used as rich data sources to mine public opinions about various topics. In this study, we used Twitter to collect data on one of the most growing theories in education, namely Self-Regulated Learning (SRL) and carry out further analysis to investigate What Twitter says about SRL? This work uses three main analysis methods, descriptive, topic modeling, and geocoding analysis. The searched and collected dataset consists of a large volume of relevant SRL tweets equal to 54,070 tweets between 2011 and 2021. The descriptive analysis uncovers a growing discussion on SRL on Twitter from 2011 till 2018 and then markedly decreased till the collection day. For topic modeling, the text mining technique of Latent Dirichlet allocation (LDA) was applied and revealed insights on computationally processed topics. Finally, the geocoding analysis uncovers a diverse community from all over the world, yet a higher density representation of users from the Global North was identified. Further implications are discussed in the paper.