100+ datasets found
  1. g

    Just Another Day on Twitter: A Complete 24 Hours of Twitter Data

    • search.gesis.org
    Updated Oct 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pfeffer, Jürgen (2022). Just Another Day on Twitter: A Complete 24 Hours of Twitter Data [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2516
    Explore at:
    Dataset updated
    Oct 16, 2022
    Dataset provided by
    GESIS search
    GESIS, Köln
    Authors
    Pfeffer, Jürgen
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.

  2. o

    Gender Prediction from Tweet Typo Data

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Gender Prediction from Tweet Typo Data [Dataset]. https://www.opendatabay.com/data/ai-ml/05c9578a-719d-4ab0-82cd-0aa99bfa2bbe
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    This dataset provides simple Twitter analytics data, focusing on user profiles and tweet content. Its primary purpose is to enable the classification of gender based on tweet characteristics, specifically exploring the likelihood of different genders committing typos on their tweets. It serves as a valuable resource for emerging Natural Language Processing (NLP) enthusiasts looking to apply basic models to real-world social media data. The dataset includes unformatted tweet text, user information, and confidence scores related to various attributes.

    Columns

    The dataset contains the following key columns: * _unit_id: A unique identifier for the unit. * Tweet ID: The unique identifier for a tweet. * _golden: Indicates whether a user is a Golden User. * _unit_state: The state of the tweet. * _trusted_judgments: The level of trust associated with the judgment. * _last_judgment_at: The timestamp of the last judgment. * gender: The declared or inferred sex of the user. * gender:confidence: The confidence level associated with the gender classification. * profile_yn: A boolean indicating whether the user's profile is active or exists. * profile_yn:confidence: The confidence level for the profile's existence. * created: The date and time when the user's account was created. * Label Count: A count related to various labels within the dataset.

    Distribution

    The dataset is provided as a single data file, typically in CSV format. It comprises approximately 20,000 records. The structure includes various data types, such as IDs, boolean indicators, numerical confidence scores, and datetime stamps.

    Usage

    This dataset is ideal for: * Classifying user gender based on tweet content and user profile information. * Analysing spelling errors or typos in tweets in relation to user demographics. * Developing and testing Natural Language Processing (NLP) models, particularly for tasks like text classification and sentiment analysis. * Exploring patterns in social media behaviour and user characteristics on Twitter. * Educational purposes for those new to applying machine learning techniques to real-world tweet data.

    Coverage

    The dataset offers global geographical coverage as indicated by its region. The time range for tweet activity appears to be concentrated around 26th to 27th October 2015. However, the account creation dates for the users span a much broader period, from 5th August 2006 to 26th October 2015. In terms of demographics, the dataset includes gender distribution, with approximately 33% female, 31% male, and 36% categorised as 'Other'.

    License

    CCO

    Who Can Use It

    This dataset is primarily intended for: * Data scientists and analysts interested in social media analytics and user behaviour. * Machine learning practitioners, especially those working on classification problems and NLP tasks. * Students and researchers in fields such as computer science, linguistics, and social sciences. * NLP enthusiasts who are developing or looking to test basic linear or naive models on real-world text data.

    Dataset Name Suggestions

    • Twitter User Profile & Activity Data
    • Gender Prediction from Tweet Typo Data
    • Social Media Analytics: Twitter User Gender
    • Tweet Classification for Gender Studies

    Attributes

    Original Data Source: Twitter Data

  3. Famous Words Twitter Dataset

    • kaggle.com
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _w1998 (2023). Famous Words Twitter Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/twitter-dataset-keywords-likes-and-tweets/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    _w1998
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    The Famous Words Twitter Dataset is a comprehensive collection of tweets associated with famous words. The dataset provides valuable insights into the social media engagement and popularity of these words on the Twitter platform. It includes three primary columns: keyword, likes, and tweets.

    The keyword column represents the specific famous word or phrase associated with each tweet. It allows researchers and analysts to explore the dynamics of user interactions and discussions surrounding these popular terms on Twitter.

    The likes column indicates the number of likes received by each tweet. This metric serves as an indicator of the tweet's popularity and resonation among Twitter users.

    The tweet column contains the actual tweet text, capturing the content and context of user-generated messages related to the famous words. This column provides valuable qualitative data for sentiment analysis, topic modeling, and other natural language processing tasks.

    Researchers, data scientists, and social media analysts can leverage this dataset to study various aspects, such as tracking trends, sentiment analysis, understanding user engagement patterns, and identifying influential topics associated with famous words on Twitter.

    Topics: "COVID-19", "Vaccine", "Zoom", "Bitcoin", "Dogecoin", "NFT", "Elon Musk", "Tesla", "Amazon", "iPhone 12", "Remote work", "TikTok", "Instagram", "Facebook", "YouTube", "Netflix", "GameStop", "Super Bowl", "Olympics", "Black Lives Matter" "India vs England", "Ukraine", "Queen Elizabeth", "World Cup", "Jeffrey Dahmer", "Johnny Depp", "Will Smith", "Weather", "xvideo", "porn", "nba", "Macdonald",

    Total has 128837 tweets, and here are the plot for each number of tweets for different keyword

    https://i.imgur.com/z4xbbyt.png" alt="">

    Note: The dataset is carefully curated, anonymized, and stripped of any personally identifiable information to protect user privacy.

  4. Data from: IA Tweets Analysis Dataset (Spanish)

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Guerrero-Contreras; Gabriel Guerrero-Contreras; Sara Balderas-Díaz; Sara Balderas-Díaz; Alejandro Serrano-Fernández; Andrés Muñoz; Andrés Muñoz; Alejandro Serrano-Fernández (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. http://doi.org/10.5281/zenodo.10821485
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel Guerrero-Contreras; Gabriel Guerrero-Contreras; Sara Balderas-Díaz; Sara Balderas-Díaz; Alejandro Serrano-Fernández; Andrés Muñoz; Andrés Muñoz; Alejandro Serrano-Fernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description

    This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

    Data Collection Method

    Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

    Dataset Content

    • ID: A unique identifier for each tweet.
    • text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.
    • polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).
    • favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.
    • retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.
    • user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.
    • user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.
    • user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.
    • user_followers_count: The current number of followers the account has. It is a non-negative integer.
    • user_friends_count: The number of users that the account is following. It is a non-negative integer.
    • user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.
    • user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.
    • user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.
    • user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

    Cite as

    Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

    Potential Use Cases

    This dataset is aimed at academic researchers and practitioners with interests in:

    • Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.
    • Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.
    • Exploring correlations between user engagement metrics and sentiment in discussions about AI.

    Data Format and File Type

    The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

    License

    The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.

  5. s

    Twitter cascade dataset

    • researchdata.smu.edu.sg
    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). Twitter cascade dataset [Dataset]. http://doi.org/10.25440/smu.12062709.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.

  6. Twitter Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated May 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2025). Twitter Dataset [Dataset]. https://brightdata.com/products/datasets/twitter
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    May 18, 2025
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our Twitter dataset for diverse applications to enrich business strategies and market insights. Analyzing this dataset provides a comprehensive understanding of social media trends, empowering organizations to refine their communication and marketing strategies. Access the entire dataset or customize a subset to fit your needs. Popular use cases include market research to identify trending topics and hashtags, AI training by reviewing factors such as tweet content, retweets, and user interactions for predictive analytics, and trend forecasting by examining correlations between specific themes and user engagement to uncover emerging social media preferences.

  7. u

    Data from: Google Analytics & Twitter dataset from a movies, TV series and...

    • portalcientificovalencia.univeuropea.com
    • figshare.com
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Analytics & Twitter dataset from a movies, TV series and videogames website [Dataset]. https://portalcientificovalencia.univeuropea.com/documentos/67321ed3aea56d4af0485dc8
    Explore at:
    Dataset updated
    2024
    Authors
    Yeste, Víctor; Yeste, Víctor
    Description

    Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio

  8. Twitter Tweets Sentiment Dataset

    • kaggle.com
    • opendatabay.com
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  9. i

    Information Diffusion Dataset on Twitter with User Tweets

    • ieee-dataport.org
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zejian Wang (2023). Information Diffusion Dataset on Twitter with User Tweets [Dataset]. https://ieee-dataport.org/documents/information-diffusion-dataset-twitter-user-tweets
    Explore at:
    Dataset updated
    Dec 3, 2023
    Authors
    Zejian Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We looked at 10

  10. s

    Twitter bot profiling

    • researchdata.smu.edu.sg
    • smu.edu.sg
    • +1more
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). Twitter bot profiling [Dataset]. http://doi.org/10.25440/smu.12062706.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This dataset comprises a set of Twitter accounts in Singapore that are used for social bot profiling research conducted by the Living Analytics Research Centre (LARC) at Singapore Management University (SMU). Here a bot is defined as a Twitter account that generates contents and/or interacts with other users automatically (at least according to human judgment). In this research, Twitter bots have been categorized into three major types:

    Broadcast bot. This bot aims at disseminating information to general audience by providing, e.g., benign links to news, blogs or sites. Such bot is often managed by an organization or a group of people (e.g., bloggers). Consumption bot. The main purpose of this bot is to aggregate contents from various sources and/or provide update services (e.g., horoscope reading, weather update) for personal consumption or use. Spam bot. This type of bots posts malicious contents (e.g., to trick people by hijacking certain account or redirecting them to malicious sites), or promotes harmless but invalid/irrelevant contents aggressively.

    This categorization is general enough to cater for new, emerging types of bot (e.g., chatbots can be viewed as a special type of broadcast bots). The dataset was collected from 1 January to 30 April 2014 via the Twitter REST and streaming APIs. Starting from popular seed users (i.e., users having many followers), their follow, retweet, and user mention links were crawled. The data collection proceeds by adding those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. Using this procedure, a total of 159,724 accounts have been collected. To identify bots, the first step is to check active accounts who tweeted at least 15 times within the month of April 2014. These accounts were then manually checked and labelled, of which 589 bots were found. As many more human users are expected in the Twitter population, the remaining accounts were randomly sampled and manually checked. With this, 1,024 human accounts were identified. In total, this results in 1,613 labelled accounts. Related Publication: R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E.-P. Lim. (2016). On profiling bots in social media. Proceedings of the International Conference on Social Informatics (SocInfo’16), 92-109. Bellevue, WA. https://doi.org/10.1007/978-3-319-47880-7_6

  11. f

    Twitter dataset

    • figshare.com
    csv
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreyas Poojary; Mohammed Riza; Rashmi Laxmikant Malghan (2025). Twitter dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28390334.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    figshare
    Authors
    Shreyas Poojary; Mohammed Riza; Rashmi Laxmikant Malghan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains tweets labeled for sentiment analysis, categorized into Positive, Negative, and Neutral sentiments. The dataset includes tweet IDs, user metadata, sentiment labels, and tweet text, making it suitable for Natural Language Processing (NLP), machine learning, and AI-based sentiment classification research. Originally sourced from Kaggle, this dataset is curated for improved usability in social media sentiment analysis.

  12. u

    Data from: IA Tweets Analysis Dataset (Spanish)

    • produccioncientifica.uca.es
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://produccioncientifica.uca.es/documentos/67321e53aea56d4af04854c2
    Explore at:
    Dataset updated
    2024
    Authors
    Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés
    Description

    Cite as

    Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

    General Description

    This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

    Data Collection Method

    Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

    Dataset Content

    ID: A unique identifier for each tweet.

    text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

    polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

    favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

    retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

    user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

    user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

    user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

    user_followers_count: The current number of followers the account has. It is a non-negative integer.

    user_friends_count: The number of users that the account is following. It is a non-negative integer.

    user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

    user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

    user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

    user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

    Potential Use Cases

    This dataset is aimed at academic researchers and practitioners with interests in:

    Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

    Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

    Exploring correlations between user engagement metrics and sentiment in discussions about AI.

    Data Format and File Type

    The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

    License

    The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.

  13. o

    Twitter Public Sentiment Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Twitter Public Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/04ea3224-1b10-48d4-871a-496c9a2633ff
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Telecommunications & Network Data
    Description

    This dataset provides a collection of 1000 tweets designed for sentiment analysis. The tweets were sourced from Twitter using Python and systematically generated using various modules to ensure a balanced representation of different tweet types, user behaviours, and sentiments. This includes the use of a random module for IDs and text, a faker module for usernames and dates, and a textblob module for assigning sentiment. The dataset's purpose is to offer a robust foundation for analysing and visualising sentiment trends and patterns, aiding in the initial exploration of data and the identification of significant patterns or trends.

    Columns

    • Tweet ID: A unique identifier assigned to each individual tweet.
    • Text: The actual textual content of the tweet.
    • User: The username of the individual who posted the tweet.
    • Created At: The date and time when the tweet was originally published.
    • Likes: The total number of likes or approvals the tweet received.
    • Retweets: The total count of times the tweet was shared by other users.
    • Sentiment: The categorised emotional tone of the tweet, typically labelled as positive, neutral, or negative.

    Distribution

    The dataset is provided in a CSV file format. It consists of 1000 individual tweet records, structured in a tabular layout with the columns detailed above. A sample file will be made available separately on the platform.

    Usage

    This dataset is ideal for: * Analysing and visualising sentiment trends and patterns in social media. * Initial data exploration to uncover insights into tweet characteristics and user emotions. * Identifying underlying patterns or trends within social media conversations. * Developing and training machine learning models for sentiment classification. * Academic research into Natural Language Processing (NLP) and social media dynamics. * Educational purposes, allowing students to practise data analysis and visualisation techniques.

    Coverage

    The dataset spans tweets created between January and April 2023, as observed from the included data samples. While specific geographic or demographic information for users is not available within the dataset, the nature of Twitter implies a general global scope, reflecting a variety of user behaviours and sentiments without specific regional or population group focus.

    License

    CC0

    Who Can Use It

    This dataset is valuable for: * Data Scientists and Machine Learning Engineers working on NLP tasks and model development. * Researchers in fields such as Natural Language Processing, Machine Learning Algorithms, Deep Learning, and Computer Science. * Data Analysts looking to extract insights from social media content. * Academics and Students undertaking projects related to sentiment analysis or social media studies. * Anyone interested in understanding online sentiment and user behaviour on social media platforms.

    Dataset Name Suggestions

    • Twitter Public Sentiment Dataset
    • Social Media Text Sentiment Analysis
    • General Tweet Mood Data
    • Twitter Sentiment Collection 2023
    • Microblog Sentiment Dataset

    Attributes

    Original Data Source: Twitter Sentiment Analysis using Roberta and VaderTwitter Sentiment Analysis using Roberta and Vader

  14. Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 16, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Qazi; Muhammad Imran; Muhammad Imran; Ferda Ofli; Ferda Ofli; Umair Qazi (2020). GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information [Dataset]. http://doi.org/10.5281/zenodo.3878599
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 16, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Umair Qazi; Muhammad Imran; Muhammad Imran; Ferda Ofli; Ferda Ofli; Umair Qazi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.

  15. Z

    Data from: Twitter Dataset on the Russo-Ukrainian War

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shevtsov, Alexander (2023). Twitter Dataset on the Russo-Ukrainian War [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8431046
    Explore at:
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    Shevtsov, Alexander
    Pratikakis, Polyvios
    Lamprou, Ioannis
    Antonakaki, Despoina
    Ioannidis, Sotiris
    Area covered
    Ukraine
    Description

    On 24 February 2022, Russia invaded Ukraine, also known now as the Russo-Ukrainian War. We obtained our dataset through Twitter API from 23 February of 2022 until 23 June of 2023. The collected dataset has 127.275.386 tweets, shared in the form of anonymized text, where the tweet/user IDs and user mentions are anonymized and do not provide any personal information. The provided dataset contains user discussion in more than 70 languages, where the 20 most popular are : 'eng', 'fr', 'de', 'mix', 'it', 'es', 'ja', 'ru', 'pl', 'uk', 'tr', 'th', 'hi', 'qme', 'qht', 'nl', 'fi', 'ar', 'zh' and 'pt'. For the purpose of the information integrity tweets are separated and stored in different files ordered by creation date. The provided dataset is shared for further research purposes. Additionally, we provide the list of tweets IDs at the GitHub repository which can be retracted via Twitter API. Furthermore, we also manage to execute some initial analysis including: volume/activity, hashtags popularity, sentiment and military intelligence and publish the results in the web portal.

  16. m

    The Climate Change Twitter Dataset

    • data.mendeley.com
    • kaggle.com
    Updated May 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimitrios Effrosynidis (2022). The Climate Change Twitter Dataset [Dataset]. http://doi.org/10.17632/mw8yd7z9wc.2
    Explore at:
    Dataset updated
    May 19, 2022
    Authors
    Dimitrios Effrosynidis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541

    The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.

    The following columns are in the dataset:

    ➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.

    Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.

  17. Cashtag Piggybacking dataset - Twitter dataset enriched with financial data

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Cresci; Fabrizio Lillo; Daniele Regoli; Serena Tardelli; Serena Tardelli; Maurizio Tesconi; Stefano Cresci; Fabrizio Lillo; Daniele Regoli; Maurizio Tesconi (2020). Cashtag Piggybacking dataset - Twitter dataset enriched with financial data [Dataset]. http://doi.org/10.5281/zenodo.2686862
    Explore at:
    zip, application/x-troff-meAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stefano Cresci; Fabrizio Lillo; Daniele Regoli; Serena Tardelli; Serena Tardelli; Maurizio Tesconi; Stefano Cresci; Fabrizio Lillo; Daniele Regoli; Maurizio Tesconi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is composed of

    • Twitter dataset of ~9M tweets mentioning stocks (cashtags) traded on the most important US markets, shared between May and September 2017 (users data enriched with bot classification label)
    • Financial information about ~30k companies found in those tweets, retrieved from Google Finance

    Refer to the paper below for more details.

    Cresci, S., Lillo, F., Regoli, D., Tardelli, S., & Tesconi, M. (2019). Cashtag Piggybacking: Uncovering Spam and Bot Activity in Stock Microblogs on Twitter. ACM Transactions on the Web (TWEB), 13(2), 11.

  18. f

    A Twitter Dataset on Tweets about People who Got Lost due to Dementia

    • figshare.com
    application/gzip
    Updated Jan 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelvin KF Tsoi; Nicholas B Chan; Felix CH Chan; Lingling Zhang; Annisa CH Lee; Helen ML Meng (2018). A Twitter Dataset on Tweets about People who Got Lost due to Dementia [Dataset]. http://doi.org/10.6084/m9.figshare.5788125.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 16, 2018
    Dataset provided by
    figshare
    Authors
    Kelvin KF Tsoi; Nicholas B Chan; Felix CH Chan; Lingling Zhang; Annisa CH Lee; Helen ML Meng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used and analyzed in the paper "How can we Better Use Twitter to find a Person who Got Lost due to Dementia?".A total of five tables are included. 1. raw_tweets.rds: All tweets that mentioned (i) "Dementia" or "Alzheimer"; and (ii) "Lost" or "Missing", which were crawled from Twitter from April to May 2017. 2. raw_userinfo.rds: The corresponding Twitter user info of Tweets.3. filtered_tweets.csv: Tweets that were included in the study. Details (age, gender, place, etc.) of the corresponding lost person mentioned in each tweet were appended in this table. 4. filtered_userinfo.csv: The corresponding Twitter user info of Tweets that were included in the study. Occupation (police / media / others) of each user were appended in this table. 5. cleansed_lostcases.csv: A cleansed table that shows several features of the lost cases.

  19. d

    Population of X/Twitter users and web domains embedded in a multidimensional...

    • data.sciencespo.fr
    tsv
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine Vendeville; Jimena Royo-Letelier; Duncan Cassells; Jean-Philippe Cointet; Maxime Crépel; Tim Faverjon; Théophile Lenoir; Béatrice Mazoyer; Benjamin Ooghe-Tabanou; Armin Pournaki; Hiroki Yamashita; Pedro Ramaciotti; Antoine Vendeville; Jimena Royo-Letelier; Duncan Cassells; Jean-Philippe Cointet; Maxime Crépel; Tim Faverjon; Théophile Lenoir; Béatrice Mazoyer; Benjamin Ooghe-Tabanou; Armin Pournaki; Hiroki Yamashita; Pedro Ramaciotti (2025). Population of X/Twitter users and web domains embedded in a multidimensional political opinion space [Dataset]. http://doi.org/10.21410/7E4/QPECFF
    Explore at:
    tsv(100846), tsv(106000433), tsv(177962), tsv(32523281), tsv(146217)Available download formats
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    data.sciencespo
    Authors
    Antoine Vendeville; Jimena Royo-Letelier; Duncan Cassells; Jean-Philippe Cointet; Maxime Crépel; Tim Faverjon; Théophile Lenoir; Béatrice Mazoyer; Benjamin Ooghe-Tabanou; Armin Pournaki; Hiroki Yamashita; Pedro Ramaciotti; Antoine Vendeville; Jimena Royo-Letelier; Duncan Cassells; Jean-Philippe Cointet; Maxime Crépel; Tim Faverjon; Théophile Lenoir; Béatrice Mazoyer; Benjamin Ooghe-Tabanou; Armin Pournaki; Hiroki Yamashita; Pedro Ramaciotti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The undertaking of several studies of political phenomena in social media mandates the operationalization of the notion of political stance of users and contents involved. Relevant examples include the study of segregation and polarization online, the study of political diversity in content diets in social media, or AI explainability. While many research designs rely on operationalizations best suited for the US setting, few allow addressing more general design, in which users and content might take stances on multiple ideology and issue dimensions, going beyond traditional Liberal-Conservative or Left-Right scales. To advance the study of more general online ecosystems, we present a dataset of X/Twitter population of users in the French political Twittersphere and web domains embedded in a political space spanned by dimensions measuring attitudes towards immigration, the EU, liberal values, elites and institutions, nationalism and the environment. We provide several benchmarks validating the positions of these entities (based on both, LLM and human annotations), and discuss several applications for this dataset.

  20. o

    NVIDIA Twitter Mentions Dataset

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). NVIDIA Twitter Mentions Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/9b095d5b-5edb-4cab-bf53-57796695b8c5
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    This dataset provides 100,000 tweets that mention NVIDIA, collected during 2022 and 2023. It is designed to support sentiment analysis on natural language sentences, which can significantly enhance the accuracy of market prediction. The dataset acknowledges that financial markets are notably influenced by investor sentiments, and many investment decisions are based on information from public sources or intuitive judgements.

    Columns

    • Datetime: This column indicates the date and time when the tweet was posted.
    • Tweet Id: This column contains the unique identification number for each tweet.
    • Text: This column holds the full content of the tweet.
    • Username: This column specifies the Twitter username of the individual who sent the tweet.

    Distribution

    The data file is typically provided in a CSV format. This dataset comprises 100,000 tweets. While the total number of rows is approximately 100,000, specific daily record counts are available, showing variations in tweet volume across different dates. The dataset covers a time span from 21st November 2022 to 6th February 2023.

    Usage

    This dataset is ideally suited for various applications, including: * Conducting sentiment analysis to gauge public opinion and investor sentiment towards NVIDIA. * Developing and testing models for market prediction, particularly in relation to technology stocks. * Performing natural language processing (NLP) tasks such as text classification, topic modelling, and entity recognition on social media data. * Applying time series analysis to understand trends and patterns in tweet volumes and sentiment over time.

    Coverage

    The dataset has a global reach, collecting tweets without specific geographic limitations. It includes tweets from a time range spanning from November 2022 to February 2023. No specific demographic information about the tweeters is provided, as it covers a general user base on Twitter mentioning NVIDIA.

    License

    CC0

    Who Can Use It

    • Data Scientists and Machine Learning Engineers: For building and refining sentiment analysis models and NLP applications.
    • Financial Analysts and Traders: To inform investment decisions by integrating social media sentiment into their market prediction strategies.
    • Academic Researchers: For studies on social media influence, market dynamics, and the application of natural language processing in finance.
    • Business Intelligence Professionals: To monitor brand perception and public sentiment regarding NVIDIA.

    Dataset Name Suggestions

    • NVIDIA Twitter Mentions Dataset
    • NVIDIA Stock Sentiment Tweets
    • Social Media Sentiment for NVIDIA
    • NVIDIA Financial Market Tweets
    • NVIDIA Tweet Activity 2022-2023

    Attributes

    Original Data Source: 100K Nvidia Tweets

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Pfeffer, Jürgen (2022). Just Another Day on Twitter: A Complete 24 Hours of Twitter Data [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2516

Just Another Day on Twitter: A Complete 24 Hours of Twitter Data

Related Article
Explore at:
Dataset updated
Oct 16, 2022
Dataset provided by
GESIS search
GESIS, Köln
Authors
Pfeffer, Jürgen
License

https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

Description

At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.

Search
Clear search
Close search
Google apps
Main menu