65 datasets found

f
Data_Sheet_5_What Does Twitter Say About Self-Regulated Learning? Mapping...
frontiersin.figshare.com
txt
Updated Jun 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Khalil; Gleb Belokrys (2023). Data_Sheet_5_What Does Twitter Say About Self-Regulated Learning? Mapping Tweets From 2011 to 2021.CSV [Dataset]. http://doi.org/10.3389/fpsyg.2022.820813.s005
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2022.820813.s005
Dataset updated
Jun 15, 2023
Dataset provided by
Frontiers
Authors
Mohammad Khalil; Gleb Belokrys
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Social network services such as Twitter are important venues that can be used as rich data sources to mine public opinions about various topics. In this study, we used Twitter to collect data on one of the most growing theories in education, namely Self-Regulated Learning (SRL) and carry out further analysis to investigate What Twitter says about SRL? This work uses three main analysis methods, descriptive, topic modeling, and geocoding analysis. The searched and collected dataset consists of a large volume of relevant SRL tweets equal to 54,070 tweets between 2011 and 2021. The descriptive analysis uncovers a growing discussion on SRL on Twitter from 2011 till 2018 and then markedly decreased till the collection day. For topic modeling, the text mining technique of Latent Dirichlet allocation (LDA) was applied and revealed insights on computationally processed topics. Finally, the geocoding analysis uncovers a diverse community from all over the world, yet a higher density representation of users from the Global North was identified. Further implications are discussed in the paper.
Data from: Mapping English-Language AI Research Controversies on Twitter,...
beta.ukdataservice.ac.uk
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noortje Suzanne Marres (2025). Mapping English-Language AI Research Controversies on Twitter, 2022 [Dataset]. http://doi.org/10.5255/ukda-sn-857742
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-857742
Dataset updated
2025
Dataset provided by
DataCitehttps://www.datacite.org/
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Noortje Suzanne Marres
Description
This submission consists of 12 data sets containing Twitter IDs pertaining to 6 AI controversies identified by UK-based experts in AI and Society as especially significant during the period 2012-2021. The data sets were collected by researchers at the University of Warwick as part of the 3-year international project “Shaping AI” which mapped controversies about “Artificial Intelligence” (AI) during 2012-2022. Research teams in the UK, France, Germany and Canada analysed controversies about AI in their countries across different spheres: research, policy and the media during this 10-year period. The UK team at the University of Warwick designed and undertook an analysis of research controversies about AI in the relevant period following a standpoint methodology. Our study began with an online consultation that took place in the Autumn of 2021, in which we asked UK-based experts in AI from across disciplines to identify what are the most important concerns, disputes and problematics that have arisen in the last 10 years in relation to AI as a strategic area of research.

Based on the responses to this expert consultation—described in detail in Marres et al (2024) and Poletti et al (forthcoming)—we identified a broad range of relevant controversy topics, objects and problems. To select controversies for further analysis, we considered their research intensity, in the form of a frequency count of research publications mentioned by respondents in relation to controversy topics.

On this basis, we selected 6 AI research controversies for further research: COMPAS; NHS+Deepmind; Gaydar; Facial recognition; Stochastic Parrots (LLMs) & Deeplearning as a solution for AI. For each of these controversies, we collected Twitter data by submitting queries to Twitter's academic API using TWARC between January 2022 and June 2022. Further details of the methods of data collection and curation can be found in the methods file with further detail of the queries in the ReadMe file.
Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Qazi; Muhammad Imran; Muhammad Imran; Ferda Ofli; Ferda Ofli; Umair Qazi (2020). GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information [Dataset]. http://doi.org/10.5281/zenodo.3878599
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3878599
Dataset updated
Jun 16, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Umair Qazi; Muhammad Imran; Muhammad Imran; Ferda Ofli; Ferda Ofli; Umair Qazi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.
i
Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...
ieee-dataport.org
Updated Jun 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Imran (2020). GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information [Dataset]. https://ieee-dataport.org/open-access/geocov19-dataset-hundreds-millions-multilingual-covid-19-tweets-location-information
Explore at:
Dataset updated
Jun 24, 2020
Authors
Muhammad Imran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present GeoCoV19
World - Twitter Sentiment By Country
kaggle.com
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Jiang (2020). World - Twitter Sentiment By Country [Dataset]. https://www.kaggle.com/wjia26/twittersentimentbycountry/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
William Jiang
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
World
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">

Introduction

Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!

Content

Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)

Notes

There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.

Acknowledgements

Thanks to the tweepy package for making the data extraction via Twitter API so easy.

Shameless Plug

Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.

Here's an App I built using a live version of this data.
d
Replication Data for: Analysing the performance of a location inference...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serere, Helen Ngonidzashe (2023). Replication Data for: Analysing the performance of a location inference method on various Twitter source distribution [Dataset]. http://doi.org/10.7910/DVN/LOTEGM
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/LOTEGM
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Serere, Helen Ngonidzashe
Description
Sample of tweets generated within a USA bounding box between August 2019 and April 2020. The data was used for the paper titled: Enhanced geocoding precision for location inferennce of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection. The paper was submitted in PLoS ONE journal. Two datasets have been submitted. 1. Dataset A; Consisting of 133,577 geocoded tweets 2. Dataset B: Consisting of 133,587 geocoded tweets
i
Coronavirus (COVID-19) Geo-tagged Tweets Dataset
ieee-dataport.org
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rabindra Lamsal (2025). Coronavirus (COVID-19) Geo-tagged Tweets Dataset [Dataset]. https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tweets-dataset
Explore at:
Dataset updated
May 18, 2025
Authors
Rabindra Lamsal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
only the tweet IDs are shared. The tweet IDs in this dataset belong to the tweets created providing an exact location.
Twitter Emoji Prediction
kaggle.com
Updated Feb 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HariAS (2019). Twitter Emoji Prediction [Dataset]. https://www.kaggle.com/hariharasudhanas/twitter-emoji-prediction/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2019
Dataset provided by
Kaggle
Authors
HariAS
Description
Content

Train.csv contains tweets and labels are emojis. You can find the emoji-label mapping in Mapping.csv. Predict emoji's to use for the test set.

Approaches

Best method among those tried was Bi-directional LSTM with Glove embeddings (42B)

License

Belongs to the original author on Twitter
w
Street and Traffic SRs Web/Twitter Activity Map
data.wu.ac.at
csv, json, xml
Updated Apr 15, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KCMO Information Technology People Soft CRM cases (2014). Street and Traffic SRs Web/Twitter Activity Map [Dataset]. https://data.wu.ac.at/odso/data_kcmo_org/bTliay1ua3k1
Explore at:
json, xml, csvAvailable download formats
Dataset updated
Apr 15, 2014
Dataset provided by
KCMO Information Technology People Soft CRM cases
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Updated daily
a
Twitter Sentiment Geographical Index (MIT & Harvard)
sdgstoday-sdsn.hub.arcgis.com
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sustainable Development Solutions Network (2023). Twitter Sentiment Geographical Index (MIT & Harvard) [Dataset]. https://sdgstoday-sdsn.hub.arcgis.com/maps/a49e84eca1694e6fad9eda6e8ecc86af
Explore at:
Dataset updated
May 31, 2023
Dataset authored and provided by
Sustainable Development Solutions Network
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Description
This web map is part of SDGs Today. Please see sdgstoday.orgPromoting well-being is one of the key targets of Sustainable Development Goals at the United Nations. Many governments worldwide are incorporating subjective well-being (SWB) indicators to complement traditional objective and economic metrics. Our Twitter Sentiment Geographical Index (TSGI) can provide a high granularity monitor of well-being worldwide.This dataset is a joint effort of the Sustainable Urbanization Lab at MIT and Center for Geographic Analysis at Harvard.
#nowplaying
zenodo.org
explore.openaire.eu
+1more
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eva Zangerle; Eva Zangerle (2020). #nowplaying [Dataset]. http://doi.org/10.5281/zenodo.2594483
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2594483
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eva Zangerle; Eva Zangerle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a dump of the #nowplaying dataset which contains so-called listening events of users who publish the music they are currently listening to on Twitter. In particular, this dataset includes tracks which have been tweeted using the hashtags #nowplaying, #listento or #listeningto. In this dataset, we provide the track and artist of a listening event and metadata on the tweet (date sent, user, source). Furthermore, we provide a mapping of tracks to its respective Musicbrainz identifiers. The dataset features a total of 126 mio listening events.

This archive contains the nowplaying.csv file, the main file which contains the following fields:

user id (each user is identified by a unique hash value)

source of the tweet (how it was sent; as provided by the Twitter API)

timestamp of the time the tweet underlying the listening event was sent

track title

artist name

musicbrainz identifier of the recording (cf. https://musicbrainz.org/)

In case you make use of our dataset in a scientific setting, we kindly ask you to cite the following paper:

Eva Zangerle, Martin Pichl, Wolfgang Gassler, and Günther Specht. 2014. #nowplaying Music Dataset: Extracting Listening Behavior from Twitter. In Proceedings of the First International Workshop on Internet-Scale Multimedia Management (WISMM '14). ACM, New York, NY, USA, 21-26.

If you have any questions or suggestions regarding the dataset, please do not hesitate to contact Eva Zangerle (eva.zangerle@uibk.ac.at).

Data from: Analyzing Mentions of Death in Covid-19 Tweets

zenodo.org

Updated Jul 6, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Divya Mani Adhikari; Divya Mani Adhikari; Muhammad Imran; Muhammad Imran; Umair Qazi; Umair Qazi; Ingmar Weber; Ingmar Weber (2024). Analyzing Mentions of Death in Covid-19 Tweets [Dataset]. http://doi.org/10.5281/zenodo.10839649

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.10839649

Dataset updated

Jul 6, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Divya Mani Adhikari; Divya Mani Adhikari; Muhammad Imran; Muhammad Imran; Umair Qazi; Umair Qazi; Ingmar Weber; Ingmar Weber

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset preparation and annotation

The dataset is a subset of the TBCOV dataset collected at QCRI filtered for mentions of personally related COVID-19 deaths. The filtering was done using regular expressions such as my * passed, my * died, my * succumbed & lost * battle. A sample of the dataset was annotated on Appen. Please see 'annotation-instructions.txt' for the full instructions provided to the annotators.

Dataset description

The "classifier_filtered_english.csv" file contains 33k deduplicated and classifier-filtered tweets (following X's content redistribution policy). for the 6 countries (Australia, Canada, India, Italy, United Kingdom, and United States) from March 2020 to March 2021 with classifier-labeled death labels, regular expression-filtered gender and relationship labels, and the user device label. The full 57k regex-filtered collection of tweets can be made available on special cases for Academics and Researchers.

date: the date of the tweet

country_name: the country name from Nominatim API

tweet_id: the ID of the tweet

url: the full URL of the tweet

full_text: the full-text content of the tweet (also includes the URL of any media attached)

does_the_tweet_refer_to_the_covidrelated_death_of_one_or_more_individuals_personally_known_to_the_tweets_author: the classifier predicted label for the death (also includes the original labels for the annotated samples)

what_is_the_relationship_between_the_tweets_author_and_the_victim_mentioned: the annotated relationship labels

relative_to_the_time_of_the_tweet_when_did_the_mentioned_death_occur: the annotated relative time labels

user_is_verified: if the user is verified or not

user_gender: the gender of the Twitter user (from the user profile)

user_device: the Twitter client the user uses

has_media: if the tweet has any attached media

has_url: if the tweet text contains a URL

matched_device: the device (Apple or Android) based on the Twitter client

regex_gender: the gender inferred from regular expression-based filtering

regex_relationship: the relationship label from regular expression-based filtering

Inferring gender using regular expressions

We first determine the mapping between different relationship labels mentioned in the tweet to the gender. We do not use any relationship like "cousin" from which we cannot easily infer the gender.

Male relationships: 'father', 'dad', 'daddy', 'papa', 'pop', 'pa', 'son', 'brother', 'uncle', 'nephew', 'grandfather', 'grandpa', 'gramps', 'husband', 'boyfriend', 'fiancé', 'groom', 'partner', 'beau', 'friend', 'buddy', 'pal', 'mate', 'companion', 'boy', 'gentleman', 'man', 'father-in-law', 'brother-in-law', 'stepfather', 'stepbrother'

Female relationships: 'mother', 'mom', 'mama', 'mum', 'ma', 'daughter', 'sister', 'aunt', 'niece', 'grandmother', 'grandma', 'granny', 'wife', 'girlfriend', 'fiancée', 'bride', 'partner', 'girl', 'lady', 'woman', 'miss', 'mother-in-law', 'sister-in-law', 'stepmother', 'stepsister'

Based on these mappings, we used the following regex for each gender label to determine the gender of the deceased mentioned in the tweet.

"[m|M]y\s(" + "|".join([r + "s?" for r in relationships]) + ")\s(died|succumbed|deceased)"

Age groups from relationship labels

First, we get the relationship labels using regex filtering, and then we group them into different age-group categories as shown in the following table. The UK and the US use different age groups because of the different age group definitions in the official data.

Category	Relationship (from tweets)	Age Group (UK)	Age Group (US)
Grandparents	grandfather, grandmother	65+	65+
Parents	father, mother, uncle, aunt	45-64	35-64
Siblings	brother, sister, cousin	15-44	15-34
Children	son, daughter, nephew, niece	0-14	0-14

Training the classifier

The 'english-training.csv' file contains about 13k deduplicated human-annotated tweets. We use a random seed (42) to create the train/test split. The model Covid-Bert-V2 was fine-tuned on the training set for 2 epochs with the following hyperparameters (obtained using 10-fold CV): random_seed: 42, batch_size: 32, dropout: 0.1. We obtained a F1-score of 0.81 on the test set. We used about 5% (671) of the combined and deduplicated annotated tweets as the test set, about 2% (255) as the validation set, and the remaining 12,494 tweets were used for fine-tuning the model. The tweets were preprocessed to replace mentions, URLs, emojis, etc with generic keywords. The model was trained on a system with a single Nvidia A4000 16GB GPU. The fine-tuned model is also available as the 'model.bin' file. The code for finetuning the model as well as reproducing the experiments are available in this GitHub repository.

Datasheet

We also include a datasheet for the dataset following the recommendation of "Datasheets for Datasets" (Gebru et. al.) which provides more information about how the dataset was created and how it can be used. Please see "Datasheet.pdf".

NOTE: We recommend that researchers try to rehydrate the individual tweets to ensure that the user has not deleted the tweet since posting. This gives users a mechanism to opt out of having their data analyzed.

Please only use your institutional email when requesting the dataset as anything else (like gmail.com) will be rejected. The dataset will only be made available on reasonable request for Academics and Researchers. Please mention why you need the dataset and how you plan to use the dataset when making a request.

Z
Data from: French Entity-Linking dataset between annotated tweets collected...
data.niaid.nih.gov
Updated Mar 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caillaut (2023). French Entity-Linking dataset between annotated tweets collected during major crises in France and French Wikipedia corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7767293
Explore at:
Dataset updated
Mar 25, 2023
Dataset authored and provided by
Caillaut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
France, French
Description
Most of the available datasets are not particularly adapted to our target application: geolocate natural disasters from social networks. First, social media posts are largely underrepresented in these datasets, and the only Twitter dataset lacks Entity-Linking annotations. Second, none of the datasets focuses on a crisis or natural disaster event.

To mitigate these issues, we extracted a collection of French tweets written during earthquakes and major floods that have occurred in France in recent years. We set up Label-Studio in order to annotate these tweets. A total of 4617 tweets were annotated, including 1678 tweets posted during earthquakes and 2939 during floods. For each annotated tweet, mentions were annotated using the set of labels described earlier in the paper as well as, when possible, the target Wikipedia title.

Named “RéSoCIO” in reference to the research project in which it was carried out, the dataset resulting from this work contains a total of 12 828 annotated mentions and 1 513 distinct Wikipedia entities. 85% of mentions were associated with a Wikipedia page and 94 % if we ignore the RISKNAT and DAMAGES labels, which are often difficult to map to an existing entity.

Labels #Mentions #Linked #Entities PERSON 315 263 136 ORG 863 790 281 GEOLOC 4375 4234 701 TRANSPORT 250 203 101 EVENT 35 21 16 FACILITY 129 94 49 RISKNAT 5502 4994 128 DAMAGES 1136 121 56 OTHER 223 200 46 Total 12828 1322 1513

Overview of the mentions annotated in the Twitter dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.

Labels #Mentions #Linked #Entities PERSON 1100102 1098406 557697 ORG 750925 749504 130394 GEOLOC 2729702 2728296 215924 TRANSPORT 161539 160487 53405 EVENT 798433 798251 86471 FACILITY 258835 258513 109867 RISKNAT 5502 4994 127 DAMAGES 1136 121 56 OTHER 4340621 4339658 682458 Total 10146795 10138230 1836399

Overview of the mentions annotated in the full dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.
u
Twitter data for "Remapping and visualizing baseball labor"
iro.uiowa.edu
zip
Updated Dec 13, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katherine Walden (2017). Twitter data for "Remapping and visualizing baseball labor" [Dataset]. https://iro.uiowa.edu/esploro/outputs/dataset/Twitter-data-for-Remapping-and-visualizing/9983736668802771
Explore at:
zip(470983 bytes)Available download formats
Dataset updated
Dec 13, 2017
Dataset provided by
University of Iowa
Authors
Katherine Walden
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Time period covered
2019
Description
Recent baseball scholarship has drawn attention to U.S. professional baseball’s complex twentieth century labor dynamics and expanding global presence. From debates around desegregation to discussions about the sport’s increasingly multicultural identity and global presence, the cultural politics of U.S. professional baseball is connected to the problem of baseball labor. However, most scholars address these topics by focusing on Major League Baseball (MLB), ignoring other teams and leagues—Minor League Baseball (MiLB)—that develop players for Major League teams. Considering Minor League Baseball is critical to understanding the professional game in the United States, since players who populate Major League rosters constitute a fraction of U.S. professional baseball’s entire labor force. As a digital humanities dissertation on baseball labor and globalization, this project uses digital humanities approaches and tools to analyze and visualize a quantitative data set, exploring how Minor League Baseball relates to and complicates MLB-dominated narratives around globalization and diversity in U.S. professional baseball labor. This project addresses how MiLB demographics and global dimensions shifted over time, as well as how the timeline and movement of foreign-born players through the Minor Leagues differs from their U.S.-born counterparts. This project emphasizes the centrality and necessity of including MiLB data in studies of baseball’s labor and ideological significance or cultural meaning, making that argument by drawing on data analysis, visualization, and mapping to address how MiLB labor complicates or supplements existing understandings of the relationship between U.S. professional baseball’s global reach and “national pastime” claims.
u
A dataset of Spanish tweets on people and communities LGBTQI+ during the...
produccioncientifica.uhu.es
zenodo.org
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mata, Jacinto; Gualda, Estrella; Mata, Jacinto; Gualda, Estrella (2025). A dataset of Spanish tweets on people and communities LGBTQI+ during the COVID-19 pandemic 2020-2022 [LGBTQI+ Dataset 2020-2022_es] [Dataset]. https://produccioncientifica.uhu.es/documentos/67bc32b7478fbf5d29390ca9
Explore at:
Dataset updated
2025
Authors
Mata, Jacinto; Gualda, Estrella; Mata, Jacinto; Gualda, Estrella
Description
The LGBTQI+ Dataset 2020-2022_es is a collection of 410,015 original tweets extracted from the social network Twitter between January 1, 2020, and December 31, 2022. To ensure data quality and relevance, retweets, replies, and other duplicate content were excluded, retaining only original tweets. The tweets were collected by Jacinto Mata (University of Huelva, I2C/CITES) with the support of the Python programming language and using the twarc2 tool and the Academic API v2 of Twitter. Tbis data collection is part of the project “Conspiracy Theories and Hate Speech Online: Comparison of patterns in narratives and social networks about COVID-19, immigrants and refugees and LGBTI people [NON-CONSPIRA-HATE!]”, PID2021-123983OB-I00, funded by MCIN/AEI/10.13039/501100011033/ by FEDER/EU.

The search criteria (words and hashtags) used for the data collection followed the objectives of the aforementioned project and were defined by Estrella Gualda, Francisco Javier Santos Fernández and Jacinto Mata (University of Huelva, Spain). Terms and hashtags used for the search and extraction of tweets were: #orgullogay, #orgullotrans, #OrgulloLGTB, #OrgulloLGTBI, #Díadelorgullo, #TRANSFOBIA, #transexuales, #LGTB, #LGTBI, #LGTBIQ, #LGTBQ, #LGTBQ+, anti-gay, "anti gay", anti-trans, "anti trans", "Ley Anti-LGTB", "ley trans", "anti-ley trans".

This dataset collected in the frame of the NON-CONSPIRA-HATE! project had the aim of identifying and mapping online hate speech narratives and conspiracy theories towards LGBTIQ+ people and community. Additionally, the dataset is intended to compare communication patterns in social media (rhetoric, language, micro-discourses, semantic networks, emotions, etc.) deployed in different datasets collected in this project. This dataset also contributes to mapping the actors, communities, and networks that spread hate messages and conspiracy theories, aiming to understand the patterns and strategies implemented by extremist sectors on social media. he dataset includes messages that address a wide range of topics related to the LGBTQI+ community, such as rights, visibility, the fight against discrimination and transphobia, as well as debates surrounding the Trans Law and other related issues. It includes expressions of support and celebration of Pride as well as hate speech and opposition to LGBTQI+ rights, along with debates and controversies surrounding these issues.

This dataset offers a wide range of possibilities for research in various disciplines, as the following examples express:

Social Sciences & Digital Humanities:- Analysis of opinions, attitudes, and trends toward the LGBTIQ+ people and community.- Studies on the evolution of public discourse and polarization around issues such as transphobia, hate speech, disinformation, LGBTIQ+ rights and pride, and others.- Analysis on social and political actors, leaders or organizations disseminating diverse narratives on LGBTIQ+ - Research on the impact of specific events (e.g., Pride Day) on social media conversations.- Investigations on social and semantic networks around LGBTIQ+ people and community.- Analysis of narratives, discourses and rethoric around gender identity and sexual diversity.- Comparative studies on the representation of the LGBTIQ+ people and community in different cultural or geographic contexts.

Computer Science and Artificial Intelligence:- Development of algorithms for the automatic detection of hate speech, discriminatory language, or offensive content.- Training natural language processing (NLP) models to analyze sentiments and emotions in texts related to the LGBTIQ+ people and community.

For more information on other technical details of the dataset and the structure of the .jsonl data, see the “Readme.txt” file.
X/Twitter: Countries with the largest audience 2025
statista.com
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). X/Twitter: Countries with the largest audience 2025 [Dataset]. https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected-countries/
Explore at:
Dataset updated
Jun 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description
Social network X/Twitter is particularly popular in the United States, and as of February 2025, the microblogging service had an audience reach of 103.9 million users in the country. Japan and the India were ranked second and third with more than 70 million and 25 million users respectively. Global Twitter usage As of the second quarter of 2021, X/Twitter had 206 million monetizable daily active users worldwide. The most-followed Twitter accounts include figures such as Elon Musk, Justin Bieber and former U.S. president Barack Obama. X/Twitter and politics X/Twitter has become an increasingly relevant tool in domestic and international politics. The platform has become a way to promote policies and interact with citizens and other officials, and most world leaders and foreign ministries have an official Twitter account. Former U.S. president Donald Trump used to be a prolific Twitter user before the platform permanently suspended his account in January 2021. During an August 2018 survey, 61 percent of respondents stated that Trump's use of Twitter as President of the United States was inappropriate.
Live Maps (Mature)
data-salemva.opendata.arcgis.com
Updated Jun 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
esri_en (2016). Live Maps (Mature) [Dataset]. https://data-salemva.opendata.arcgis.com/items/d74ca978920a4f31aab9fdbc4ff1ef1a
Explore at:
Dataset updated
Jun 16, 2016
Dataset provided by
Esrihttp://esri.com/
Authors
esri_en
Description
Live Maps is a configurable app template that provides the ability to consume a live data feeds from a variety of sources.Use CasesProvide a map that shows locations of health care facilities and the reported cases of the influenza.Present the locations of political campaign events with related tweets.Configurable OptionsLive Maps is used to combine social media feeds with your operational content, it can be configured using the following options:Map: Choose the web map used in your application.Title: The application name displayed in the header.Subitle: The application subtitle displayed in the header.Color: Choose the color scheme for the application.Feed: The live feed to use in the application, currently supports: Twitter, Flickr, SickWeather.Keyword: Optional search keyword for feeds like Twitter and Flickr.Interval: The interval in minutes to switch between records.Refresh interval: The interval in minutes to refresh the feed.Supported DevicesThis application is responsively designed to support use in browsers on desktops, mobile phones, and tablets.Data RequirementsThis application has no data requirements.Get Started This application can be created in the following ways:Click the Create a Web App button on this pageShare a map and choose to Create a Web AppOn the Content page, click Create - App - From Template Click the Download button to access the source code. Do this if you want to host the app on your own server and optionally customize it to add features or change styling.
f
Mapping ecological concepts using twitter
figshare.com
xml
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothée Poisot (2023). Mapping ecological concepts using twitter [Dataset]. http://doi.org/10.6084/m9.figshare.827286.v1
Explore at:
xmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.827286.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Timothée Poisot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Interactions between key concepts mentionned on Twitter in tweets containing words from the field of ecology. See the URL for more details on the methodology. These data come from a series of relatively short sampling sessions.
H
Replication Tweet Data of "Does the rich man’s club employ social media to...
dataverse.harvard.edu
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Saidur Rahman Khan (2025). Replication Tweet Data of "Does the rich man’s club employ social media to advance digital public diplomacy? Mapping the interactional network dynamics of OECD leaders’ cross-border communication on X (formerly Twitter)" [Dataset]. http://doi.org/10.7910/DVN/STHR49
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/STHR49
Dataset updated
Jan 20, 2025
Dataset provided by
Harvard Dataverse
Authors
Md. Saidur Rahman Khan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the tweet, mention, identity of mentioned persons, hashtags and X URL's posted by OECD leaders during the study period
Z
DeepCube: Post-processing and annotated datasets of social media data
data.niaid.nih.gov
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandros Mokas (2024). DeepCube: Post-processing and annotated datasets of social media data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7732930
Explore at:
Dataset updated
Mar 15, 2024
Dataset provided by
Eleni Kamateri
Giannis Tsampoulatidis
Alexandros Mokas
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Researcher(s): Alexandros Mokas, Eleni Kamateri

Supervisor: Ioannis Tsampoulatidis

This repository contains 3 social media datasets:

2 Post-processing datasets: These datasets contain post-processing data extracted from the analysis of social media posts collected for two different use cases during the first two years of the Deepcube project. More specifically, these include:

The UC2 dataset containing the post-processing analysis of the Twitter data collected for the DeepCube use case (UC2) dealing with the climate induced migration in Africa. This dataset contains in total 5,695,253 social media posts collected from the Twitter platform, based on the initial version of search criteria relevant to UC2 defined by Universitat De Valencia, focused on the regions of Ethiopia and Somalia and started from 26 June, 2021 till March, 2023.

The UC5 dataset containing the post-processing analysis of the Twitter and Instagram data collected for the DeepCube use case (UC5) related to the sustainable and environmentally-friendly tourism. This dataset contains in total 58,143 social media posts collected from the Twitter and Instagram platform (12,881 collected from Twitter and 45,262 collected from Instagram), based on the initial version of search criteria relevant to UC5 defined by MURMURATION SAS, focused on the regions of Brasil and started from 26 June, 2021 till March, 2023.

1 Annotated dataset: An additional anottated dataset was created that contains post-processing data along with annotations of Twitter posts collected for UC2 for the years 2010-2022. More specifically, it includes:

The UC2 dataset contain the post-processing of the Twitter data collected for the DeepCube use case (UC2) dealing with the climate induced migration in Africa. This dataset contains in total 1721 annotated (412 relevant and 1309 irrelevant) by social media posts collected from the Twitter platform, focused on the region of Somalia and started from 1 January, 2010 till 31 December, 2022.

For every social media post retrieved from Twitter and Instagram, a preprocessing step was performed. This involved a three-step analysis of each post using the appropriate web service. First, the location of the post was automatically extracted from the text using a location extraction service. Second, the images included in the post were analyzed using a concept extraction service, which identified and provided the top ten concepts that best described the image. These concepts included items such as "person," "building," "drought," "sun," and so on. Finally, the sentiment expressed in the post's text was determined by using a sentiment analysis service. The sentiment was classified as either positive, negative, or neutral.

After the social media posts were preprocessed, they were visualized using the Social Media Web Application. This intuitive, user-friendly online application was designed for both expert and non-expert users and offers a web-based user interface for filtering and visualizing the collected social media data. The application provides various filtering options, an interactive map, a timeline, and a collection of graphs to help users analyze the data. Moreover, this application provides users with the option to download aggregated data for specific periods by applying filters and clicking the "Download Posts" button. This feature allows users to easily extract and analyze social media data outside of the web application, providing greater flexibility and control over data analysis.

The dataset is provided by INFALIA. INFALIA, being a spin-off of the CERTH institute and a partner of a research EU project, releases this dataset containing Tweets IDs and post pre-processing data for the sole purpose of enabling the validation of the research conducted within the DeepCube. Moreover, Twitter Content provided in this dataset to third parties remains subject to the Twitter Policy, and those third parties must agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy (https://developer.twitter.com/en/developer-terms) before receiving this download.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mohammad Khalil; Gleb Belokrys (2023). Data_Sheet_5_What Does Twitter Say About Self-Regulated Learning? Mapping Tweets From 2011 to 2021.CSV [Dataset]. http://doi.org/10.3389/fpsyg.2022.820813.s005

Data_Sheet_5_What Does Twitter Say About Self-Regulated Learning? Mapping Tweets From 2011 to 2021.CSV

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.3389/fpsyg.2022.820813.s005

Dataset updated

Jun 15, 2023

Dataset provided by

Frontiers

Authors

Mohammad Khalil; Gleb Belokrys

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Social network services such as Twitter are important venues that can be used as rich data sources to mine public opinions about various topics. In this study, we used Twitter to collect data on one of the most growing theories in education, namely Self-Regulated Learning (SRL) and carry out further analysis to investigate What Twitter says about SRL? This work uses three main analysis methods, descriptive, topic modeling, and geocoding analysis. The searched and collected dataset consists of a large volume of relevant SRL tweets equal to 54,070 tweets between 2011 and 2021. The descriptive analysis uncovers a growing discussion on SRL on Twitter from 2011 till 2018 and then markedly decreased till the collection day. For topic modeling, the text mining technique of Latent Dirichlet allocation (LDA) was applied and revealed insights on computationally processed topics. Finally, the geocoding analysis uncovers a diverse community from all over the world, yet a higher density representation of users from the Global North was identified. Further implications are discussed in the paper.

Clear search

Close search

Google apps

Main menu

Data_Sheet_5_What Does Twitter Say About Self-Regulated Learning? Mapping...

Data from: Mapping English-Language AI Research Controversies on Twitter,...

Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...

Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...

World - Twitter Sentiment By Country

Introduction

Content

Notes

Acknowledgements

Shameless Plug

Replication Data for: Analysing the performance of a location inference...

Coronavirus (COVID-19) Geo-tagged Tweets Dataset

Twitter Emoji Prediction

Content

Approaches

License

Street and Traffic SRs Web/Twitter Activity Map

Twitter Sentiment Geographical Index (MIT & Harvard)

#nowplaying

Data from: Analyzing Mentions of Death in Covid-19 Tweets

Dataset preparation and annotation

Dataset description

Inferring gender using regular expressions

Age groups from relationship labels

Training the classifier

Datasheet

Data from: French Entity-Linking dataset between annotated tweets collected...

Twitter data for "Remapping and visualizing baseball labor"

A dataset of Spanish tweets on people and communities LGBTQI+ during the...

X/Twitter: Countries with the largest audience 2025

Live Maps (Mature)

Mapping ecological concepts using twitter

Replication Tweet Data of "Does the rich man’s club employ social media to...

DeepCube: Post-processing and annotated datasets of social media data

Data_Sheet_5_What Does Twitter Say About Self-Regulated Learning? Mapping Tweets From 2011 to 2021.CSVSee More Versions

Data_Sheet_5_What Does Twitter Say About Self-Regulated Learning? Mapping Tweets From 2011 to 2021.CSV