43 datasets found
  1. US_Congressional_Tweets_Dataset

    • kaggle.com
    zip
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Yáñez Feijóo (2024). US_Congressional_Tweets_Dataset [Dataset]. https://www.kaggle.com/datasets/oscaryezfeijo/us-congressional-tweets-dataset
    Explore at:
    zip(243754786 bytes)Available download formats
    Dataset updated
    Jan 4, 2024
    Authors
    Oscar Yáñez Feijóo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    United States
    Description

    The "US Congressional Tweets Dataset" is a comprehensive collection of tweets from US Congressional members spanning from 2008 to 2017. This dataset is valuable for organizations like Lobbyists4America, which aims to gain insights into legislative trends and influences for effective lobbying strategies. The dataset is structured into two primary components: users_df and tweets_df.

    Dataset Structure:

    1. users_df: This DataFrame provides detailed information about the Twitter accounts of various congressional members. It includes a range of attributes such as:

      • Account creation date (created_at), follower and friend counts (followers_count, friends_count).
      • Profile-related information like description, location, and verification status.
      • Various Twitter-specific features like contributors_enabled, default_profile, is_translator, etc.
    2. tweets_df: This DataFrame contains the actual tweet data from these congressional accounts. Key columns include:

      • created_at: The timestamp of the tweet.
      • favorite_count and retweet_count: Indicators of the tweet's popularity.
      • text: The text content of the tweet.
      • Metadata such as user_id, lang (language), and source (device/app used for tweeting).
      • Other attributes like possibly_sensitive, quoted_status_id, and engagement-related fields.

    Analysis Performed:

    The dataset is utilized for various analyses, including:

    1. Network Analysis: Exploring the connections and interactions between different congressional members on Twitter, potentially revealing influential figures or groups within Congress.

    2. Sentiment Analysis: Using libraries like TextBlob and NLTK, this analysis assesses the sentiment (positive, negative, neutral) of tweets to understand the general tone and stance of congressional members on various issues.

    3. Correlation Analysis: Investigating relationships between different numerical features in the dataset, such as whether higher tweet frequencies correlate with more followers.

    4. Word Clustering/Topic Modeling: Utilizing NMF (Non-Negative Matrix Factorization) from scikit-learn to cluster words and identify major themes or topics discussed in the tweets.

    5. Time Series Analysis: Observing trends and patterns in tweeting behavior over time, such as increased activity around elections or significant political events.

    Python Libraries Used:

    • Pandas: For data manipulation and analysis.
    • Matplotlib: For visualizing the data.
    • TextBlob and NLTK: For processing textual data and performing sentiment analysis.
    • scikit-learn (sklearn): For machine learning tasks like NMF for topic modeling.
    • spaCy: An advanced natural language processing library.
    • NetworkX: For conducting network analysis.
    • ipywidgets and pytz: For creating interactive elements and handling time zones in the data, respectively.

    Conclusion:

    The "US Congressional Tweets Dataset" is a rich source for analyzing the digital footprint of US Congressional members. Through the application of various data science techniques, Lobbyists4America can extract meaningful insights about political sentiments, networking patterns, and topical trends among lawmakers. This information is crucial for tailoring lobbying efforts and understanding the legislative landscape.

  2. Twitter data for movie Extraction 2

    • kaggle.com
    zip
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MT9899 (2023). Twitter data for movie Extraction 2 [Dataset]. https://www.kaggle.com/datasets/mt9899/twitter-data-for-movie-extraction-2
    Explore at:
    zip(649271 bytes)Available download formats
    Dataset updated
    Jun 28, 2023
    Authors
    MT9899
    Description

    This data frame was created using the python. How the data was extracted was explained in this article How to Extract Data from Twitter for Sentiment Analysis

    I have also explained the text cleaning/preprocessing which can be found in this article A Comprehensive Guide to Text Preprocessing for Twitter Data: Getting Ready for Sentiment Analysis.

    My next target is to perform a Sentiment analysis. Read more articles on Medium.com and follow me to get in touch regarding Data Science hands-on projects.

  3. Dallas Mavericks Twitter Data for 2021-2022 Games

    • kaggle.com
    zip
    Updated Jun 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Huggler (2022). Dallas Mavericks Twitter Data for 2021-2022 Games [Dataset]. https://www.kaggle.com/alexhuggler/dallas-mavericks-twitter-data-for-20212022-games
    Explore at:
    zip(29010517 bytes)Available download formats
    Dataset updated
    Jun 14, 2022
    Authors
    Alex Huggler
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Twitter data that was compiled for any Tweets that contained the key phrase "Dallas Mavericks", between the dates of Sep 30, 2021 and June 1, 2022. This time period would represent the 2021-2022 NBA season and the post season, where the Dallas Mavericks went on to be the 4th in the Western Conference. This dataset was created to capture fan sentiment on twitter pertaining to the Maverick's season and post season performances.

  4. d

    Replication Data for: Project Homonationalism

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vossen, Job P.H. (2023). Replication Data for: Project Homonationalism [Dataset]. http://doi.org/10.7910/DVN/FK03A6
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Vossen, Job P.H.
    Description

    The various documents were compiled together and formed the empirical input for word-embedding based dictionary analysis for the project on homonationalism in the lower countries

  5. Z

    Twiter Dataset on climate change discussions: COP27, IPCC, climate refugees...

    • data.niaid.nih.gov
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vicens, Julian; Massachs, Joan (2024). Twiter Dataset on climate change discussions: COP27, IPCC, climate refugees and Doñana - Clint project [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14095217
    Explore at:
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    Eurecat, Centre Tecnològic de Catalunya
    Authors
    Vicens, Julian; Massachs, Joan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CLINT Data

    This repository contains the date used in the project CLINT and the paper "A cross-platform analysis of polarization and echo chambers in climate change discussions"

    Open Twitter Data

    We used the Twitter’s search to gather historical tweets and the streaming API to follow specified accounts and also collect in real-time tweets that mention specific keywords. To comply with Twitter’s Terms of Service, we are only publicly releasing the tweet IDs of the collected tweets. The data is released for non-commercial research use.

    With Twitter's changes to its Academic API policies, it’s no longer possible to collect or rehydrate tweets as we usually did, however we open data in case at some point it will become feasible to do it.

    IPCC Doñana Climate Refugees COP27

    Number of tweets 352,723 1,487,425 1,938,932 6,225,508

    Number of authors 157,056 290,782 841,454 1,351,903

    First tweet date 2023-03-18 2019-01-01 2008-03-10 2022-09-01

    Last tweet date 2023-03-26 2023-04-30 2022-12-31 2022-11-27

  6. f

    "Stronger In". A Dataset of 1,005 Tweets by StrongerIn with Archive Summary,...

    • city.figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ernesto Priego (2023). "Stronger In". A Dataset of 1,005 Tweets by StrongerIn with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends [Dataset]. http://doi.org/10.6084/m9.figshare.3456617.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    City, University of London
    Authors
    Ernesto Priego
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an Excel spreadsheet file containing an archive of 1,005 @StrongerIn Tweets publicly published by the queried account between12/06/2016 13:34:35 and 21/06/2016 13:11:34 BST.The spreadsheet contains four more sheets containing a data summary from the archive, a table of tweets' sources, and tables of corpus term and trend counts and collocate counts.The Tweets contained in the Archive sheet were collected using Martin Hawksey's TAGS 6.0. The profile_image_url column has been removed.The text analysis was performed using Stéfan Sinclair's & Geoffrey Rockwell's Voyant Tools (c 2016).The data is shared as is. The sharing of this dataset complies with Twitter's Developer Rules of the Road.Please note that both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012). Therefore it cannot be guaranteed this file contains each and every Tweet actually published by the queried Twitter account during the indicated period, and is shared for comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web, they were explicitly meant for public consumption and distribution and are responsibility of the original authors. Any copyright belongs to its original authors.No Personally identifiable information (PII), nor Sensitive Personal Information (SPI) was collected nor is contained in this dataset.This dataset is shared as a sample and as an act of citizen scholarship in order to archive, document and encourage open educational and historical research and analysis.

  7. Kaggle Tweets

    • kaggle.com
    zip
    Updated Apr 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tensor Girl (2021). Kaggle Tweets [Dataset]. https://www.kaggle.com/usharengaraju/kaggle-tweets-2010-2021
    Explore at:
    zip(38796821 bytes)Available download formats
    Dataset updated
    Apr 18, 2021
    Authors
    Tensor Girl
    Description

    Context

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Source : https://en.wikipedia.org/wiki/Kaggle

    Content

    The dataset contains tweets regarding "Kaggle" from verified twitter accounts

    Acknowledgements

    "Kaggle" Tweets are scraped using Twint.

    Twint is an advanced Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API.

    https://pypi.org/project/twint/

  8. f

    "Vote Leave". A Dataset of 1,100 Tweets by vote_leave with Archive Summary,...

    • city.figshare.com
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ernesto Priego (2023). "Vote Leave". A Dataset of 1,100 Tweets by vote_leave with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends [Dataset]. http://doi.org/10.6084/m9.figshare.3452834.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    City, University of London
    Authors
    Ernesto Priego
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an Excel spreadsheet file containing an archive of 1,100 @vote_leave Tweets publicly published by the queried account between 12/06/2016 09:06:22 - 21/06/2016 09:29:29 BST.The spreadsheet contains four more sheets containing a data summary from the archive, a table of tweets' sources, and tables of corpus term and trend counts and collocate counts.The Tweets contained in the Archive sheet were collected using Martin Hawksey's TAGS 6.0. The profile_image_url column has been removed.The text analysis was performed using Stéfan Sinclair's & Geoffrey Rockwell's Voyant Tools (c 2016).The data is shared as is. The sharing of this dataset complies with Twitter's Developer Rules of the Road.Please note that both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012). Therefore it cannot be guaranteed this file contains each and every Tweet actually published by the queried Twitter account during the indicated period, and is shared for comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web, they were explicitly meant for public consumption and distribution and are responsibility of the original authors. Any copyright belongs to its original authors.No Personally identifiable information (PII), nor Sensitive Personal Information (SPI) was collected nor is contained in this dataset.This dataset is shared as a sample and as an act of citizen scholarship in order to archive, document and encourage open educational and historical research and analysis.

  9. d

    Replication Data for: Don't @ Me: Experimentally Reducing Partisan...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Munger, Kevin (2023). Replication Data for: Don't @ Me: Experimentally Reducing Partisan Incivility on Twitter [Dataset]. http://doi.org/10.7910/DVN/OUYTUP
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Munger, Kevin
    Description

    Replication Data for Don't @ Me: Experimentally Reducing Partisan Incivility on Twitter

  10. Data from: A dataset of Spanish tweets on people and communities LGBTQI+...

    • zenodo.org
    • produccioncientifica.uhu.es
    • +1more
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacinto Mata; Jacinto Mata; Estrella Gualda; Estrella Gualda (2025). A dataset of Spanish tweets on people and communities LGBTQI+ during the COVID-19 pandemic 2020-2022 [LGBTQI+ Dataset 2020-2022_es] [Dataset]. http://doi.org/10.5281/zenodo.15071096
    Explore at:
    Dataset updated
    Mar 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jacinto Mata; Jacinto Mata; Estrella Gualda; Estrella Gualda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 16, 2025
    Description

    The LGBTQI+ Dataset 2020-2022_es is a collection of 410,015 original tweets extracted from the social network Twitter between January 1, 2020, and December 31, 2022. To ensure data quality and relevance, retweets, replies, and other duplicate content were excluded, retaining only original tweets. The tweets were collected by Jacinto Mata (University of Huelva, I2C/CITES) with the support of the Python programming language and using the twarc2 tool and the Academic API v2 of Twitter. Tbis data collection is part of the project “Conspiracy Theories and Hate Speech Online: Comparison of patterns in narratives and social networks about COVID-19, immigrants and refugees and LGBTI people [NON-CONSPIRA-HATE!]”, PID2021-123983OB-I00, funded by MCIN/AEI/10.13039/501100011033/ by FEDER/EU.

    The search criteria (words and hashtags) used for the data collection followed the objectives of the aforementioned project and were defined by Estrella Gualda, Francisco Javier Santos Fernández and Jacinto Mata (University of Huelva, Spain). Terms and hashtags used for the search and extraction of tweets were: #orgullogay, #orgullotrans, #OrgulloLGTB, #OrgulloLGTBI, #Díadelorgullo, #TRANSFOBIA, #transexuales, #LGTB, #LGTBI, #LGTBIQ, #LGTBQ, #LGTBQ+, anti-gay, "anti gay", anti-trans, "anti trans", "Ley Anti-LGTB", "ley trans", "anti-ley trans".

    This dataset collected in the frame of the NON-CONSPIRA-HATE! project had the aim of identifying and mapping online hate speech narratives and conspiracy theories towards LGBTIQ+ people and community. Additionally, the dataset is intended to compare communication patterns in social media (rhetoric, language, micro-discourses, semantic networks, emotions, etc.) deployed in different datasets collected in this project. This dataset also contributes to mapping the actors, communities, and networks that spread hate messages and conspiracy theories, aiming to understand the patterns and strategies implemented by extremist sectors on social media. he dataset includes messages that address a wide range of topics related to the LGBTQI+ community, such as rights, visibility, the fight against discrimination and transphobia, as well as debates surrounding the Trans Law and other related issues. It includes expressions of support and celebration of Pride as well as hate speech and opposition to LGBTQI+ rights, along with debates and controversies surrounding these issues.

    This dataset offers a wide range of possibilities for research in various disciplines, as the following examples express:

    Social Sciences & Digital Humanities:
    - Analysis of opinions, attitudes, and trends toward the LGBTIQ+ people and community.
    - Studies on the evolution of public discourse and polarization around issues such as transphobia, hate speech, disinformation, LGBTIQ+ rights and pride, and others.
    - Analysis on social and political actors, leaders or organizations disseminating diverse narratives on LGBTIQ+
    - Research on the impact of specific events (e.g., Pride Day) on social media conversations.
    - Investigations on social and semantic networks around LGBTIQ+ people and community.
    - Analysis of narratives, discourses and rethoric around gender identity and sexual diversity.
    - Comparative studies on the representation of the LGBTIQ+ people and community in different cultural or geographic contexts.

    Computer Science and Artificial Intelligence:
    - Development of algorithms for the automatic detection of hate speech, discriminatory language, or offensive content.
    - Training natural language processing (NLP) models to analyze sentiments and emotions in texts related to the LGBTIQ+ people and community.

    For more information on other technical details of the dataset and the structure of the .jsonl data, see the “Readme.txt” file.

  11. Twitter US Airline Sentiment ✈️

    • kaggle.com
    zip
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maruf Hossain (2025). Twitter US Airline Sentiment ✈️ [Dataset]. https://www.kaggle.com/datasets/marufnthewindows/twitter-us-airline-sentiment
    Explore at:
    zip(1134990 bytes)Available download formats
    Dataset updated
    Mar 10, 2025
    Authors
    Maruf Hossain
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset originates from Crowdflower's "Data for Everyone" library and focuses on sentiment analysis related to major U.S. airlines. The data was collected from Twitter in February 2015, where contributors classified tweets into positive, negative, or neutral sentiments. Additionally, negative tweets were further categorized based on reasons such as "late flight" or "rude service".

    Dataset Features 🗂️ Includes both CSV and SQLite database formats. 🔍 Provides insights into the sentiment of tweets (positive, neutral, or negative) for six major U.S. airlines. 🔗 The dataset has undergone slight reformatting, and the code for transformations is available on GitHub. This dataset is an excellent resource for sentiment analysis, text classification, and NLP practice, making it ideal for data science and machine learning projects. 🚀✈️

  12. d

    Harvard CGA Streaming Billion Geotweet Dataset

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CGA, Harvard (2023). Harvard CGA Streaming Billion Geotweet Dataset [Dataset]. http://doi.org/10.7910/DVN/3FDVCA
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    CGA, Harvard
    Description

    Funded by a grant from the Sloan Foundation, and with support from Massachusetts Open Cloud, the Center for Geographic Analysis(CGA) at Harvard developed a “big geodata”, remotely hosted, real-time-updated dataset which is a prototype for a new data type hosted outside Dataverse which supports streaming updates, and is accessed via an API. The CGA developed 1) the software and hardware platform to support interactive exploration of a billion spatio-temporal objects, nicknamed the "BOP" (billion object platform) 2) an API to provide query access to the archive from Dataverse 3) client-side tools for querying/visualizing the contents of the archive and extracting data subsets. This project is currently no longer active. For more information please see: http://gis.harvard.edu/services/project-consultation/project-resume/billion-object-platform-bop. “Geotweets” are tweets containing a GPS coordinate from the originating device. Currently 1-2% of tweets are geotweets, about 8 million per day. The CGA has been harvesting geotweets since 2012.

  13. ALLINTERACT_RawData1_v3.01

    • data.europa.eu
    unknown
    Updated Jun 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). ALLINTERACT_RawData1_v3.01 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7948071?locale=hr
    Explore at:
    unknown(5785)Available download formats
    Dataset updated
    Jun 7, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of EC Horizon 2020 project ALLINTERACT Widening and diversifying citizen engagement in science (872396). It contains the raw data obtained from the fieldwork, which consists of: 1) Literature Review, 2) Social Media Analytics, 3) Focus Groups, 4) Survey and 5) Social Media Communicative Observation. 1) Literature Review The objective of the literature review was to address the following topics in gender and education: a) How citizens’ benefit from scientific research, b) Citizen awareness of the impact of scientific research, c) Awareness-raising initiatives succeeding at engaging citizens in scientific participation, including the Open Access movement and citizen science initiatives, d) Awareness-raising actions that foster the recruitment of new talent in sciences and e) Policies that promote awareness-raising actions and citizen engagement in science. In order to do so, the searches were carried out in the top scientific databases, namely Web of Science (mainly in those journals indexed in Journal Citation Reports) and Scopus. The articles were published between 2010-2021 in journals indexed Q1 or Q2 in JCR or in Q1 journals indexed in Scopus. Relevant reports from EU-funded research projects and official EU documents were also included. We provide one word file with the following information of each topic (a-e) in gender and education. - Keywords used - Criteria of selection - Identified sources - Outcomes - Annexes: Grids with the details of the identified socurces 2) Social Media Analytic It is the raw data obtained from social media interactions (Twitter, Facebook, Instagram and Reddit) among citizens about citizen participation in science and research with social impact related to two Sustainable Development Goals: Quality Education and Gender Equality. The data collection followed a twofold strategy 1) Top-Down, in which researchers identified and selected relevant Twitter and Instagram hashtags and Facebook and Reddit pages and 2) Bottom-Up, in which Twitter hashtags were selected based on daily Trending Topics. The data was collected between March 9th and March 16th 2021 and has been obtained, cleaned and anonymized following Allinteract - Social Media Analytics Protocol (Flecha & Pulido, 2021). We provide five Excel files (one for each social network explored). Each file contains the main information of the extracted messages, however the information extracted in each case is slightly different. -Twitter: Tweet ID, Time, Tweet Type, Retweeted By, Number of Retweets, Hashtags -Facebook: Post ID, Video, Type, Likes, Created Time, Updated Time, Comment ID, Comment Likes, Comment Time, Page Likes -Instagram: Likes, comments, date -Reddit: Row ID, sub_id, sub_title, sub_score, sub_date, comment_id, comment_score, comment_date 3) Focus Groups This data file contains the pseudonymized transcription of a total of 6 focus groups in gender and 6 in education, which were conducted between October 2021 and February 2022. These focus groups are the pre-test and therefore, the groups are distributed in control group or experimental group. The participants of the gender focus groups were women (including vulnerable women) from a women’s group, members of an LGBTQI group and women (including young women) from a women’s group. The participants of the education focus groups were parents, teachers and students. We provide a word file with the literal transcriptions of the focus groups in the language in which the focus groups were conducted (English, Spanish or Portuguese). 4) Survey This data file contains the anonym answers of the survey conducted with participants from 12 countries, through a CATI/CAWI method. The survey was conducted between November 2021 and February 2022 and consists of 59 questions. The exploitation of this data has been carried out with the SPSS software. We provide an excel file with the 59 questions and the answers of 7507 participants. 5) Social Media Communicative Observation The Social Media Communicative Observation aims to explore the effects of introducing scientific pieces of evidence in social media interactions as an initiative to increase participation through awareness. In order to do so, scientific evidence on gender and education were introduced in 10 Facebook groups (5 related to gender and 5 to education), 10 Reddit communities (5 related to gender and 5 to education) and 2 Social Impact Platforms (Sappho and Adhyayana). We provide an excel file with the anonymized interactions among users around the introduced piece of evidence. This Excel file contains the following information: Group of documents, document name, code, start, final, weight, segment, changed by, changed, created, comment, area and percentage (%). 6) Focus Group – Post test This data file contains the pseudonymized transcription of a total of 6 focus groups post test Funding: We acknowledge support of this work by the project "ALLINTERACT Widening and diversifying

  14. c

    Making Climate Social: Tweets Related to Climate Change, 2019

    • datacatalogue.cessda.eu
    • datacatalogue.ukdataservice.ac.uk
    Updated Sep 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pearce, W (2025). Making Climate Social: Tweets Related to Climate Change, 2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-855230
    Explore at:
    Dataset updated
    Sep 26, 2025
    Dataset provided by
    University of Sheffield
    Authors
    Pearce, W
    Time period covered
    Mar 25, 2019 - Mar 26, 2019
    Area covered
    United Kingdom
    Variables measured
    Time unit, Text unit
    Measurement technique
    Tweets were retrieved and collected using DMI-TCAT between the given dates https://github.com/digitalmethodsinitiative/dmi-tcat contiaining the keywords (#climate] OR [climate change] OR [global warming] OR [climatechange] OR [globalwarming]. In line with Twitter terms and conditions, tweet IDs are deposited here.
    Description

    Social media is a transformative digital technology, collapsing the "six degrees of separation" which have previously characterised many social networks, and breaking down many of the barriers to individuals communicating with each other. Some commentators suggest that this is having profound effects across society, that social media has revolutionised the communication of controversial public issues such as climate change, and that this has significantly increased the volume and variety of scientists, politicians, journalists, non-governmental organisations, think tanks and members of the public in contact with each other. Tweets were collected related to climate change. The data deposited is a list of 53,296 tweet IDs, which can be used to retrieve tweets.

    Social media is a transformative digital technology, collapsing the "six degrees of separation" which have previously characterised many social networks, and breaking down many of the barriers to individuals communicating with each other. Some commentators suggest that this is having profound effects across society, that social media has revolutionised the communication of controversial public issues such as climate change, and that this has significantly increased the volume and variety of scientists, politicians, journalists, non-governmental organisations, think tanks and members of the public in contact with each other. For example, in 2012 over 4000 tweets about climate change were sent every day.

    Social media communication can act as a trusted source of public information about climate change, foster public participation in climate science, be a campaigning tool and trigger polarising events with far-reaching effects (e.g. Climategate). However, despite these broad changes in the communication environment, we lack a detailed understanding of the characteristics of social media climate change communications, the wider contexts for these communications, and what the social media revolution means for the relationship between science, politics and publics. Using an innovative interdisciplinary methodological approach that combines social media big data analysis with fine grained ethnographic description, this project aims to: 1) discover the key contributors to social media climate change communication, the content they discuss, and how these change over time and space; 2) locate the connections between contributors, explore how social media usage is influenced by personal, professional and intellectual backgrounds, and how these influences vary over time and space; 3) identify the opportunities and challenges presented by social media for future public discussions of climate change. In this way, Making Climate Social will establish the contributors, content, connections and contexts which make up social media climate change communications, how these change over time and space, and what they mean for future public discussions of the science and politics of climate change.

    The Met Office is a project partner, hosting a knowledge exchange visit by the PI, where he will interact with key climate scientists, the Communications Team and Customer Centre and give a seminar to research staff in both climate and weather research. The PI will also meet regularly with the Met Office, Department for Energy and Climate Change and the cross-sector project Advisory Board to ensure that research findings reach and affect relevant audiences: i) academic audiences in science and technology studies, climate change communication and social media researchers; ii) publics interested in climate change and/or social media usage; iii) government, scientific organisations and universities with responsibility for supporting social media usage by climate change researchers. The project will achieve this through: i) high-quality research articles published in leading journals across a range of specialist academic journals; ii) a dedicated project blog, Twitter account @MakCliSoc, and series of Guardian blogposts to build awareness with, and disseminate findings to, a broad range of stakeholders and publics; iii) the Climate Change Social Radar: an innovative and interactive collaboration with digital developers to provide an engaging web interface through which to explore project data and reflect on broader ethical issues of social media; iv) succinct policy briefings tailored for key stakeholders and written in plain English.

    The long term goal of this project is to make Making Climate Social a trusted source of information that tracks the dynamics of social media climate change communications, providing a counterpart to the Media and Climate Change Observatory (Colorado) which focuses on traditional media coverage of climate change.

  15. n

    Sentiment analysis (supervised and unsupervised classification) of original...

    • data.ncl.ac.uk
    • resodate.org
    xlsx
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diana Contreras; Javier Hervas; Nipun Balan; Philip James (2023). Sentiment analysis (supervised and unsupervised classification) of original Twitter data posted in English about the 10th anniversary of the 2010 Haiti Earthquake [Dataset]. http://doi.org/10.25405/data.ncl.19478021.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Newcastle University
    Authors
    Diana Contreras; Javier Hervas; Nipun Balan; Philip James
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Haiti
    Description

    This database contains the sentiment analysis (SA) of original tweets posted in English by users related to the 10th anniversary of the 2010 Haitian earthquake. This classification includes supervised and unsupervised classification. The latest one was performed using the no-code machine learning (ML) platform for text analysis: MonkeyLearn comparing the classification's confidence and accuracy (ACC) when training the algorithm with the 1, 5 and 10 per cent of the tweets. We can observe that the confidence and the ACC in the classification increase as the number of trained tweets.

  16. Justin Trudeau Tweets

    • kaggle.com
    zip
    Updated Jul 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MC (2018). Justin Trudeau Tweets [Dataset]. https://www.kaggle.com/datascienceai/justin-trudeau-tweets
    Explore at:
    zip(65047 bytes)Available download formats
    Dataset updated
    Jul 21, 2018
    Authors
    MC
    Description

    Dataset

    This dataset was created by MC

    Contents

  17. Top 1000 Twitter Celebrity Tweets And Embeddings

    • kaggle.com
    zip
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Shahriar Sakib (2022). Top 1000 Twitter Celebrity Tweets And Embeddings [Dataset]. https://www.kaggle.com/datasets/ahmedshahriarsakib/top-1000-twitter-celebrity-tweets-embeddings
    Explore at:
    zip(179140866 bytes)Available download formats
    Dataset updated
    Jul 12, 2022
    Authors
    Ahmed Shahriar Sakib
    Description

    Context

    This dataset contains tweets and embeddings of the top 1000 Twitter celebrity accounts

    Content

    1. Tweets -

    2. Embeddings -

    NB: - There are almost 10% of the Twitter accounts were private, changed their username, or suspended. In the end, the number of users remains 915. - There are some unofficial Celebrity accounts (ex - twitter.com/sonunigam) with a very small amount of tweets. We can filter those users based on their tweet count. Here is a good research paper on this topic - 25 Tweets to Know You: A New Model to Predict Personality with Social Media

    Featured Notebook

    Live App

    Live in Streamlit

    GitHub Project

    Download

    kaggle API Command

    !kaggle datasets download -d ahmedshahriarsakib/top-1000-twitter-celebrity-tweets-embeddings
    

    Disclaimer

    The tweets which were scraped are all publicly available and it's intended for educational purposes only.

    Acknowledgement

    Cover image credit - bestfunquiz- Which Celebrity On Twitter Should Follow You

  18. Q

    Interviews regarding data curation for qualitative data reuse and big social...

    • data.qdr.syr.edu
    bin, pdf, txt
    Updated Apr 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Mannheimer; Sara Mannheimer (2023). Interviews regarding data curation for qualitative data reuse and big social research [Dataset]. http://doi.org/10.5064/F6GWMU4O
    Explore at:
    pdf(111223), pdf(170851), pdf(174860), pdf(220706), pdf(181317), pdf(155781), pdf(176948), pdf(186400), pdf(216506), pdf(186156), pdf(166627), pdf(204315), pdf(120883), pdf(223955), pdf(197623), pdf(209721), pdf(212401), pdf(111468), pdf(175067), pdf(194133), pdf(194606), bin(254918656), pdf(174896), txt(8346), pdf(180451), pdf(192049), pdf(119959), pdf(214380), bin(2258685), pdf(547705), pdf(189347), pdf(196971), pdf(115127), pdf(213879), pdf(146828), pdf(195493), pdf(177017), pdf(189665), pdf(149437), pdf(183110), pdf(221008), pdf(200024)Available download formats
    Dataset updated
    Apr 26, 2023
    Dataset provided by
    Qualitative Data Repository
    Authors
    Sara Mannheimer; Sara Mannheimer
    License

    https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions

    Time period covered
    Mar 1, 2019 - Jun 1, 2023
    Area covered
    United States
    Description

    Project Overview Trends toward open science practices, along with advances in technology, have promoted increased data archiving in recent years, thus bringing new attention to the reuse of archived qualitative data. Qualitative data reuse can increase efficiency and reduce the burden on research subjects, since new studies can be conducted without collecting new data. Qualitative data reuse also supports larger-scale, longitudinal research by combining datasets to analyze more participants. At the same time, qualitative research data can increasingly be collected from online sources. Social scientists can access and analyze personal narratives and social interactions through social media such as blogs, vlogs, online forums, and posts and interactions from social networking sites like Facebook and Twitter. These big social data have been celebrated as an unprecedented source of data analytics, able to produce insights about human behavior on a massive scale. However, both types of research also present key epistemological, ethical, and legal issues. This study explores the issues of context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership, with a focus on data curation strategies. The research suggests that connecting qualitative researchers, big social researchers, and curators can enhance responsible practices for qualitative data reuse and big social research. This study addressed the following research questions: RQ1: How is big social data curation similar to and different from qualitative data curation? RQ1a: How are epistemological, ethical, and legal issues different or similar for qualitative data reuse and big social research? RQ1b: How can data curation practices such as metadata and archiving support and resolve some of these epistemological and ethical issues? RQ2: What are the implications of these similarities and differences for big social data curation and qualitative data curation, and what can we learn from combining these two conversations? Data Description and Collection Overview The data in this study was collected using semi-structured interviews that centered around specific incidents of qualitative data archiving or reuse, big social research, or data curation. The participants for the interviews were therefore drawn from three categories: researchers who have used big social data, qualitative researchers who have published or reused qualitative data, and data curators who have worked with one or both types of data. Six key issues were identified in a literature review, and were then used to structure three interview guides for the semi-structured interviews. The six issues are context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. Participants were limited to those working in the United States. Ten participants from each of the three target populations—big social researchers, qualitative researchers who had published or reused data, and data curators were interviewed. The interviews were conducted between March 11 and October 6, 2021. When scheduling the interviews, participants received an email asking them to identify a critical incident prior to the interview. The “incident” in critical incident interviewing technique is a specific example that focuses a participant’s answers to the interview questions. The participants were asked their permission to have the interviews recorded, which was completed using the built-in recording technology of Zoom videoconferencing software. The author also took notes during the interviews. Otter.ai speech-to-text software was used to create initial transcriptions of the interview recordings. A hired undergraduate student hand-edited the transcripts for accuracy. The transcripts were manually de-identified. The author analyzed the interview transcripts using a qualitative content analysis approach. This involved using a combination of inductive and deductive coding approaches. After reviewing the research questions, the author used NVivo software to identify chunks of text in the interview transcripts that represented key themes of the research. Because the interviews were structured around each of the six key issues that had been identified in the literature review, the author deductively created a parent code for each of the six key issues. These parent codes were context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. The author then used inductive coding to create sub-codes beneath each of the parent codes for these key issues. Selection and Organization of Shared Data The data files consist of 28 of the interview transcripts themselves – transcripts from Big Science Researchers (BSR), Data Curators (DC), and Qualitative Researchers (QR)...

  19. d

    Replication Data for: Papua Disinformation Project

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McRae, Dave; Quiroga, Mar; Russo-Batterham, Daniel; Doyle, Kim (2023). Replication Data for: Papua Disinformation Project [Dataset]. http://doi.org/10.7910/DVN/OK7YOO
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    McRae, Dave; Quiroga, Mar; Russo-Batterham, Daniel; Doyle, Kim
    Description

    Due to restrictions from Twitter API v2 for Academic Research, we are only able to provide a severely limited version of the dataset used in this project. The dataset contains tweet IDs and the components numbers assigned to sets of duplicate or near duplicate tweets, as detailed in the project readme files

  20. Z

    D1.1.ALLINTERACT_RawData

    • data.niaid.nih.gov
    • data.europa.eu
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marta Soler-Gallart (2023). D1.1.ALLINTERACT_RawData [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4729724
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset provided by
    University of Barcelona
    Authors
    Marta Soler-Gallart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the raw data obtained from social media interactions (Twitter, Facebook, Instagram and Reddit) among citizens about citizen participation in science and research with social impact related to two Sustainable Development Goals: Quality Education and Gender Equality. The data collection has followed a twofold strategy 1) Top-Down, in which researchers identified and selected relevant Twitter and Instagram hashtags and Facebook and Reddit pages and 2) Bottom-Up, in which Twitter hashtags were selected based on daily Trending Topics.

    The data was collected between March 9th and March 16th 2021 and has been obtained, cleaned and anonymized following ALLINTERACT Protocol for Social Media Analytics. This dataset is part of EC Horizon 2020 project ALLINTERACT Widening and diversifying citizen engagement in science (872396).

    We provide five Excel files (one for each social network explored). Each file contains the main information of the extracted messages, however the information extracted in each case is slightly different.

    Twitter: Row ID, Tweet ID, Tweet, Time, Tweet Type, Retweeted By, Number of Retweets, Hashtags, Number of Tweets , Number of Followers, Number Following

    Facebook: Row ID, Post ID, Post, Link, Link Name, Link Caption, Link Description, Video, Type, Likes, Created Time, Updated Time, Comment ID, Comment Text, Comment Likes, Comment Time, Page Likes

    Instagram: Url, content, likes, comments, date

    Reddit: Row ID, sub_id, sub_title, sub_text, sub_score, sub_date, sub_link, comment_id, comment_body, comment_score, comment_date, comment_link

    Funding: We acknowledge support of this work by the project "ALLINTERACT Widening and diversifying citizen engagement in science” (872396) from the European Commission Horizon 2020 programme.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Oscar Yáñez Feijóo (2024). US_Congressional_Tweets_Dataset [Dataset]. https://www.kaggle.com/datasets/oscaryezfeijo/us-congressional-tweets-dataset
Organization logo

US_Congressional_Tweets_Dataset

As on Coursera's SQL for Data Science Capstone Project UCDavis

Explore at:
zip(243754786 bytes)Available download formats
Dataset updated
Jan 4, 2024
Authors
Oscar Yáñez Feijóo
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Area covered
United States
Description

The "US Congressional Tweets Dataset" is a comprehensive collection of tweets from US Congressional members spanning from 2008 to 2017. This dataset is valuable for organizations like Lobbyists4America, which aims to gain insights into legislative trends and influences for effective lobbying strategies. The dataset is structured into two primary components: users_df and tweets_df.

Dataset Structure:

  1. users_df: This DataFrame provides detailed information about the Twitter accounts of various congressional members. It includes a range of attributes such as:

    • Account creation date (created_at), follower and friend counts (followers_count, friends_count).
    • Profile-related information like description, location, and verification status.
    • Various Twitter-specific features like contributors_enabled, default_profile, is_translator, etc.
  2. tweets_df: This DataFrame contains the actual tweet data from these congressional accounts. Key columns include:

    • created_at: The timestamp of the tweet.
    • favorite_count and retweet_count: Indicators of the tweet's popularity.
    • text: The text content of the tweet.
    • Metadata such as user_id, lang (language), and source (device/app used for tweeting).
    • Other attributes like possibly_sensitive, quoted_status_id, and engagement-related fields.

Analysis Performed:

The dataset is utilized for various analyses, including:

  1. Network Analysis: Exploring the connections and interactions between different congressional members on Twitter, potentially revealing influential figures or groups within Congress.

  2. Sentiment Analysis: Using libraries like TextBlob and NLTK, this analysis assesses the sentiment (positive, negative, neutral) of tweets to understand the general tone and stance of congressional members on various issues.

  3. Correlation Analysis: Investigating relationships between different numerical features in the dataset, such as whether higher tweet frequencies correlate with more followers.

  4. Word Clustering/Topic Modeling: Utilizing NMF (Non-Negative Matrix Factorization) from scikit-learn to cluster words and identify major themes or topics discussed in the tweets.

  5. Time Series Analysis: Observing trends and patterns in tweeting behavior over time, such as increased activity around elections or significant political events.

Python Libraries Used:

  • Pandas: For data manipulation and analysis.
  • Matplotlib: For visualizing the data.
  • TextBlob and NLTK: For processing textual data and performing sentiment analysis.
  • scikit-learn (sklearn): For machine learning tasks like NMF for topic modeling.
  • spaCy: An advanced natural language processing library.
  • NetworkX: For conducting network analysis.
  • ipywidgets and pytz: For creating interactive elements and handling time zones in the data, respectively.

Conclusion:

The "US Congressional Tweets Dataset" is a rich source for analyzing the digital footprint of US Congressional members. Through the application of various data science techniques, Lobbyists4America can extract meaningful insights about political sentiments, networking patterns, and topical trends among lawmakers. This information is crucial for tailoring lobbying efforts and understanding the legislative landscape.

Search
Clear search
Close search
Google apps
Main menu