61 datasets found
  1. Social Media Extremism Detection Dataset

    • kaggle.com
    zip
    Updated Nov 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Suresh (2025). Social Media Extremism Detection Dataset [Dataset]. https://www.kaggle.com/datasets/adityasureshgithub/digital-extremism-detection-curated-dataset
    Explore at:
    zip(121048 bytes)Available download formats
    Dataset updated
    Nov 23, 2025
    Authors
    Aditya Suresh
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    NEW UPDATE:

    Show your skills off in the Social Media Extremism Challenge @ https://www.kaggle.com/competitions/social-media-extremism-detection-challenge! Try your luck at tackling this challenging classification problem! After the competition is completed, we will be adding 200+ hand-labelled entries to this dataset so stay tuned!

    We would like to thank Assistant Professor Leilani H. Gilpin (UC Santa Cruz) and the AIEA Lab for their guidance and support in the development of this dataset. —*Aditya Suresh, Anthony Lu, Vishnu Iyer*

    About this data: Social media has seen an increasing rise in the quantity and intensity of extremist content throughout various different services. With cases such as the various different white supremacist movements across the world, recruitment for terrorist organizations through affiliated accounts, and a general sense of hate emerging through the modern era of polarization, it becomes increasingly vital to be able to recognize these patterns and adequately combat the harms of extremism digitally on a global scale.

    Citations: Our dataset would not have been possible without the aid of an already preexisting dataset found on Kaggle, Version 1 of "Hate Speech Detection curated Dataset🤬" by Alban Nyantudre in 2023. The link can be found here: https://www.kaggle.com/datasets/waalbannyantudre/hate-speech-detection-curated-dataset/data. Accessed in 2025, it was truly essential to our work. With over 400,000 messages of real, cleaned posts, we would not have been able to source and label our data points without this crucial resource.

    Classification: Our team hand labelled nearly 3,000 pieces of data from our sourced database of posts, filtering every on of them into a blanket tag of "EXTREMIST" and "NON_EXTREMIST." As many messages digitally utilize context in order to spread harmful rhetoric, we followed a general rule of classifying terms as extremist so long as they "provoked harm to a person or a group of people, whether it be through advocacy for violence, discrimination, or other hurtful sentiments, based off of a characteristic of the group."

    Value of the data: This dataset can be utilized to create extremist sentiment analysis systems and machine learning algorithms, as it reflects on current linguistics, as stated by the source material for the data points themselves. In addition, it can be used as a benchmark for comparing with other extremism datasets and other extremist sentiment analysis systems.

    Potential Errors: Although we feel very confident in our own labeling ability, a possibility of potentially wrong data points does exist due to the fact that these data points lack quantifiable identifiers and as such human errors are possible within the data. We do not believe this to occur often, but in full transparency is an issue that we endeavor to resolve in subsequent updates.

  2. Twitter user data

    • kaggle.com
    zip
    Updated Aug 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BARKHA VERMA (2020). Twitter user data [Dataset]. https://www.kaggle.com/barkhaverma/twitter-user-data
    Explore at:
    zip(3163744 bytes)Available download formats
    Dataset updated
    Aug 23, 2020
    Authors
    BARKHA VERMA
    Description

    Context

    A Twitter dataset composed of 20,000 rows, Twitter User Data includes the following information: user name, random tweet, account profile, image, and location information.

    Content

    The dataset contains the following fields:

    unit_id: a unique id for user

    golden: whether the user was included in the gold standard for the model; TRUE or FALSE

    unit_state: state of the observation; one of finalized (for contributor-judged) or golden (for gold standard observations)

    trusted_judgments: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations

    last_judgment_at: date and time of last contributor judgment; blank for gold standard observations

    gender: one of male, female, or brand (for non-human profiles)

    gender:confidence: a float representing confidence in the provided gender

    profile_yn: "no" here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it

    profile_yn:confidence: confidence in the existence/non-existence of the profile

    created: date and time when the profile was created

    description: the user's profile description

    fav_number: number of tweets the user has favorited

    gender_gold: if the profile is golden, what is the gender?

    link_color: the link color on the profile, as a hex value

    name: the user's name

    profile_yn_gold: whether the profile y/n value is golden

    profileimage: a link to the profile image

    retweet_count: number of times the user has retweeted (or possibly, been retweeted)

    sidebar_color: color of the profile sidebar, as a hex value

    text: text of a random one of the user's tweets

    tweet_coord: if the user has location turned on, the coordinates as a string with the format "[latitude, longitude]"

    tweet_count: number of tweets that the user has posted

    tweet_created: when the random tweet (in the text column) was created

    tweet_id: the tweet id of the random tweet

    tweet_location: location of the tweet; seems to not be particularly normalized

    user_timezone: the timezone of the user

    Acknowledgements

    https://data.world/data-society/twitter-user-data

  3. d

    Social Networks (ISSP 2017) - Czech Republic - Dataset - B2FIND

    • demo-b2find.dkrz.de
    Updated Nov 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Social Networks (ISSP 2017) - Czech Republic - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/52fbb3da-063d-5d67-a42a-db835ab57633
    Explore at:
    Dataset updated
    Nov 11, 2025
    Area covered
    Czechia
    Description

    The International Social Survey Program (ISSP) is a long-term international research project that has existed since 1983. This research project is based on international and cross-project collaboration in social research. It focuses on attitudes towards important societal issues. ISSP combines two levels of comparison - a comparison in an international and a time perspective. Research activities are carried out continuously in annual cycles. The topics of Social Networks has been studied three times since the existence of the ISSP: in 1986, 2001 and 2017. The latest module has slightly changed its name to "Social Networks and Social Resources". Investigations are always partly a replication of previous studies. ISSP Social Networks modules essentially deal with issues, such as nature and contacts with family members and friends, participation in associations and groups, duties and rights in social networks and social trust.

  4. Data from: Analysis of the Quantitative Impact of Social Networks General...

    • figshare.com
    • produccioncientifica.ucm.es
    doc
    Updated Oct 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Parra; Santiago Martínez Arias; Sergio Mena Muñoz (2022). Analysis of the Quantitative Impact of Social Networks General Data.doc [Dataset]. http://doi.org/10.6084/m9.figshare.21329421.v1
    Explore at:
    docAvailable download formats
    Dataset updated
    Oct 14, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    David Parra; Santiago Martínez Arias; Sergio Mena Muñoz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union". Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content? To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic. In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
    Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained. To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market. It includes:

    Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures

  5. Data from: Testing and Estimation of Social Network Dependence With Time to...

    • tandf.figshare.com
    txt
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lin Su; Wenbin Lu; Rui Song; Danyang Huang (2024). Testing and Estimation of Social Network Dependence With Time to Event Data [Dataset]. http://doi.org/10.6084/m9.figshare.8132456.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Lin Su; Wenbin Lu; Rui Song; Danyang Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Nowadays, events are spread rapidly along social networks. We are interested in whether people’s responses to an event are affected by their friends’ characteristics. For example, how soon will a person start playing a game given that his/her friends like it? Studying social network dependence is an emerging research area. In this work, we propose a novel latent spatial autocorrelation Cox model to study social network dependence with time-to-event data. The proposed model introduces a latent indicator to characterize whether a person’s survival time might be affected by his or her friends’ features. We first propose a score-type test for detecting the existence of social network dependence. If it exists, we further develop an EM-type algorithm to estimate the model parameters. The performance of the proposed test and estimators are illustrated by simulation studies and an application to a time-to-event dataset about playing a popular mobile game from one of the largest online social network platforms. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

  6. Competition between Homophily and Information Entropy Maximization in Social...

    • plos.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jichang Zhao; Xiao Liang; Ke Xu (2023). Competition between Homophily and Information Entropy Maximization in Social Networks [Dataset]. http://doi.org/10.1371/journal.pone.0136896
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jichang Zhao; Xiao Liang; Ke Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In social networks, it is conventionally thought that two individuals with more overlapped friends tend to establish a new friendship, which could be stated as homophily breeding new connections. While the recent hypothesis of maximum information entropy is presented as the possible origin of effective navigation in small-world networks. We find there exists a competition between information entropy maximization and homophily in local structure through both theoretical and experimental analysis. This competition suggests that a newly built relationship between two individuals with more common friends would lead to less information entropy gain for them. We demonstrate that in the evolution of the social network, both of the two assumptions coexist. The rule of maximum information entropy produces weak ties in the network, while the law of homophily makes the network highly clustered locally and the individuals would obtain strong and trust ties. A toy model is also presented to demonstrate the competition and evaluate the roles of different rules in the evolution of real networks. Our findings could shed light on the social network modeling from a new perspective.

  7. (🌅 Sunset) Kaggle Users' Country + Regions Info

    • kaggle.com
    zip
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). (🌅 Sunset) Kaggle Users' Country + Regions Info [Dataset]. https://www.kaggle.com/datasets/bwandowando/kaggle-user-country-regions
    Explore at:
    zip(2376511 bytes)Available download formats
    Dataset updated
    Feb 14, 2024
    Authors
    BwandoWando
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    [Context]

    The official Meta-Kaggle dataset contains the Users.csv file which contains Username, DisplayName, RegisterDate, and PerformanceTier fields but doesn't contain location data of the Kaggle Users. This dataset augments that data with additional country and region information.

    [Note]

    I haven't included the username and displayname values on purpose, just the userid to be joined back to the Meta-Kaggle official Users.csv file.

    [Limitations]

    It is possible that some users haven't inputted their details when the scraper went through their accounts and thus have missing data. Another possibility is that users may have updated their info after the scraper went through their accounts, thus resulting in inconsistencies.

    [How I defined active in this dataset]

    • Users that have received an upvote in the forums, datasets, or notebooks
    • Users that have given an upvote in the forums, datasets, or notebooks
    • Users that have created a thread, a forum post, a notebook, or a dataset
    • Users that made a competition submission
    • Users that exist in the Meta-Kaggle Users dataset
    • Date cut-off of Jan 01, 2019

    [Update]

    • 15-Feb-2024- Since the Kaggle member's profile page update, the scrapers arent working anymore as the UI layout has changed. Will fix this when we get the time.
  8. Doctor Who dataset

    • kaggle.com
    zip
    Updated Apr 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    manuel-dileo (2021). Doctor Who dataset [Dataset]. https://www.kaggle.com/manueldileo/doctor-who-dataset
    Explore at:
    zip(44809 bytes)Available download formats
    Dataset updated
    Apr 19, 2021
    Authors
    manuel-dileo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Doctor Who simple graph is an undirected weighted graph where nodes are characters with information about their roles in the series and exist an edge between node i and node j with weight w iff i and j appear in the same episode w times.

    More informations are available on github: https://github.com/manuel-dileo/doctor-who-dataset

  9. m

    Top 50 trending topics (trends) of Twitter for 2018 (one hour interval)

    • data.mendeley.com
    Updated Feb 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Issa Annamoradnejad (2019). Top 50 trending topics (trends) of Twitter for 2018 (one hour interval) [Dataset]. http://doi.org/10.17632/d4ccnh588k.1
    Explore at:
    Dataset updated
    Feb 16, 2019
    Authors
    Issa Annamoradnejad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains top 50 trending topics (trends) of Twitter, obtained from Twitter Trends API in an hourly rate. For each hour, there exists a row in the dataset that contains the date, time, trending topic and the related tweets count (if available). Data is for more than 97% of 2018 which our script was available.

  10. Tweets With Emoji

    • kaggle.com
    zip
    Updated Apr 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ericwang2001 (2023). Tweets With Emoji [Dataset]. https://www.kaggle.com/datasets/ericwang1011/tweets-with-emoji/discussion
    Explore at:
    zip(48238750 bytes)Available download formats
    Dataset updated
    Apr 12, 2023
    Authors
    ericwang2001
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The data was obtained through the utilization of snscrape. The query used for retrieval was based on individual emojis. Relevant data was identified, and subsequently assessed for the presence of emojis as well as the sentence's adherence to English language conventions. The language detection analysis was conducted using pycld3, which was inspired by the paper "The WiLI benchmark dataset for written language identification." Each csv file consists of 20,000 distinct data entries. The file name is created based on emoji package (emoji.EMOJI_DATA) in Python.

    It should be noted that given the possible occurrence of small errors associated with pycld3, along with the potential for multiple emojis per data entry, there may exist instances of non-English tweets or duplicated tweets across different CSV files.

  11. COSGDD

    • kaggle.com
    zip
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhilchhh (2024). COSGDD [Dataset]. https://www.kaggle.com/datasets/akhilchhh/cosgdd
    Explore at:
    zip(536051 bytes)Available download formats
    Dataset updated
    Nov 27, 2024
    Authors
    Akhilchhh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Consolidated Open Source Global Development Dataset (COSGDD)

    Executive Summary

    Motivation

    Welcome to the Consolidated Open Source Global Development Dataset (COSGDD)!

    The Consolidated Open Source Global Development Dataset (COSGDD) was created to address the growing need for accessible, consolidated, and diverse global datasets for education, research, and policy-making. By combining data from publicly available, open-source datasets, COSGDD provides a one-stop resource for analyzing key socio-economic, environmental, and governance indicators across the globe.

    Streamlit Dashboard Link (The LIME explanation graph will take time to load) - https://cosgdd.streamlit.app/ Github Code Repo Link - https://github.com/AkhilByteWrangler/Consolidated-Open-Source-Global-Development-Dataset

    Overview

    Imagine having a magical map of the world that shows you not just the roads and mountains but also how happy people are, how much money they make, how clean the air is, and how fair their governments are. This dataset is that magical map - but in the form of organized data!

    It combines facts and figures from trusted sources to help researchers, governments, companies, and YOU understand how the world works and how to make it better.

    Why Does This Dataset Exist?

    The world is complicated. Happiness doesn’t depend on just one thing like money; it’s also about health, fairness, relationships, and even how clean the air is. But these pieces of the puzzle are scattered across many places. This dataset brings everything together in one place, making it easier to:
    - Answer big questions like:
    - What makes people happy?
    - Is wealth or freedom more important for well-being?
    - How does urbanization affect happiness?
    - Find patterns and trends across countries.
    - Make smart decisions based on real-world data.

    Who Should Use This Dataset?

    This dataset is for anyone curious about the world, including:
    - Researchers: Study connections between happiness, governance, and sustainability.
    - Policy Makers: Design better policies to improve quality of life.
    - Data Enthusiasts: Explore trends and patterns using statistics or machine learning.
    - Businesses: Understand societal needs to improve Corporate Social Responsibility (CSR).

    Description of Data

    This dataset consolidates data from well-established sources such as the World Happiness Report, The Economist Democracy Index, environmental databases, and more. It includes engineered features to deepen understanding of well-being and sustainability.

    Core Features

    • Happiness Metrics:
      • Life Ladder: Self-reported happiness scores.
    • Economic Indicators:
      • Log GDP per capita: Log-transformed measure of wealth.
      • Tax Revenue: Government revenue as a share of GDP.
    • Social Indicators:
      • Social support: Proportion of people with reliable social networks.
      • Freedom to make life choices: Self-reported freedom levels.
    • Environmental Metrics:
      • Total Emissions: Aggregated greenhouse gas emissions.
      • Renewables Production: Share of renewable energy production.
    • Governance Indicators:
      • Democracy_Index: Quantitative measure of democratic governance.
      • Rule_of_Law_Index: Assessment of the legal system’s strength.
    • Engineered Features:
      • Freedom_Index: Combines wealth and freedom.
      • Generosity_Per_Dollar: Normalized generosity against GDP.
      • Environmental_Bonus: Evaluates environmental efficiency relative to economic output.
      • See full documentation for more.

    Core Columns

    1. Country

    • Unit: Country name as a string.
    • Source: Sourced from all contributing datasets (e.g., World Happiness Report, UN datasets).
    • Significance:
      Identifies the geographic region for the data. Essential for country-specific analyses, comparisons, and aggregations.

    2. Year

    • Unit: Year as an integer (e.g., 2024).
    • Source: Included across all datasets.
    • Significance:
      Indicates the time frame of the data. Vital for studying trends, changes over time, and time-series modeling.

    Happiness Metrics

    3. Life Ladder

    • Unit: Scale from 0 (worst possible life) to 10 (best possible life).
    • Source: World Happiness Report.
    • Significance:
      Captures subjective well-being based on self-reported happiness. A central measure for studying the quality of life globally.

    4. Log GDP per Capita

    • Unit: Logarithmic transformation of GDP per capita in constant international dollars.
    • Source: World Happiness Report, based on World Bank data.
    • Significance:
      Provides a no...
  12. d

    Geolytica POIData.xyz Points of Interest (POI) Geo Data - Sweden

    • datarade.ai
    .csv
    Updated Apr 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geolytica (2022). Geolytica POIData.xyz Points of Interest (POI) Geo Data - Sweden [Dataset]. https://datarade.ai/data-products/geolytica-poidata-xyz-points-of-interest-poi-geo-data-sweden-geolytica
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Apr 2, 2022
    Dataset authored and provided by
    Geolytica
    Area covered
    Sweden
    Description

    https://store.poidata.xyz/se

    Point-of-interest (POI) is defined as a physical entity (such as a business) in a geo location (point) which may be (of interest).

    We strive to provide the most accurate, complete and up to date point of interest datasets for all countries of the world. The Sweden POI Dataset is one of our worldwide POI datasets with over 98% coverage.

    This is our process flow:

    Our machine learning systems continuously crawl for new POI data
    Our geoparsing and geocoding calculates their geo locations
    Our categorization systems cleanup and standardize the datasets
    Our data pipeline API publishes the datasets on our data store
    

    POI Data is in a constant flux - especially so during times of drastic change such as the Covid-19 pandemic.

    Every minute worldwide on an average day over 200 businesses will move, over 600 new businesses will open their doors and over 400 businesses will cease to exist.

    In today's interconnected world, of the approximately 200 million POIs worldwide, over 94% have a public online presence. As a new POI comes into existence its information will appear very quickly in location based social networks (LBSNs), other social media, pictures, websites, blogs, press releases. Soon after that, our state-of-the-art POI Information retrieval system will pick it up.

    We offer our customers perpetual data licenses for any dataset representing this ever changing information, downloaded at any given point in time. This makes our company's licensing model unique in the current Data as a Service - DaaS Industry. Our customers don't have to delete our data after the expiration of a certain "Term", regardless of whether the data was purchased as a one time snapshot, or via a recurring payment plan on our data update pipeline.

    The main differentiators between us vs the competition are our flexible licensing terms and our data freshness.

    The core attribute coverage is as follows:

    Poi Field Data Coverage (%) poi_name 100 brand 9 poi_tel 46 formatted_address 100 main_category 97 latitude 100 longitude 100 neighborhood 5 source_url 60 email 12 opening_hours 38

    The dataset may be viewed online at https://store.poidata.xyz/se and a data sample may be downloaded at https://store.poidata.xyz/datafiles/se_sample.csv

  13. Monthly Samples of German Tweets

    • zenodo.org
    zip
    Updated Mar 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nane Kratzke; Nane Kratzke (2023). Monthly Samples of German Tweets [Dataset]. http://doi.org/10.5281/zenodo.3559456
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nane Kratzke; Nane Kratzke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains German tweets and Twitter accounts recorded from the public Twitter Streaming API using the following filters:

    • terms: 'a', 'e', 'i', 'o', 'u', and 'n'
    • language: 'de'

    This filter combination should record (almost) all German tweets (in German it is very unlikely that terms do not contain vowels or the frequently used character 'n').

    This dataset might be useful for the following use cases:

    • Natural language processing (focussing on Twitter specifics in German, there exist only little German datasets)
    • Social Network Analysis (Twitter network)
    • Identifying behavioural patterns (retweeting, quoting, replying, hate speech, ...)
    • Sharing political (or other domain-specific) content
    • Bot detection
    • and more ...

    This dataset will be updated monthly. Each sample (starting in April 2019) will follow the following naming pattern:

    • german-tweet-sample--.zip (size: ~ 1GB)

    It will contain several bunches of recorded JSON gzipped files. Each bunch of records contains approximately 50k recorded tweets/accounts (size: ~ 6MB).

  14. Jacksepticeye Tweets

    • kaggle.com
    zip
    Updated Dec 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Jacksepticeye Tweets [Dataset]. https://www.kaggle.com/datasets/thedevastator/engagement-reach-and-popularity-of-jacksepticeye
    Explore at:
    zip(1293508 bytes)Available download formats
    Dataset updated
    Dec 27, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Engagement, Reach, and Popularity of Jacksepticeye Tweets

    An Insight into Social Media Interaction

    By Twitter [source]

    About this dataset

    This dataset provides an insight into the reach and impact of Jacksepticeye's tweets. With curated content covering everything from gaming to life reflections, these tweets offer a snapshot not only of his global popularity, but also his ability to engage with an audience and ignite conversation. From each tweet, you can learn data points like its content, the number of likes it received, which replies popped up in response, how many times it was retweeted or marked as a favorite, and the overall relevance of that particular tweet in terms of its contribution to conversations worldwide. This comprehensive dataset is a great opportunity to explore the power behind Jacksepticeye's social media presence!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is in csv format and contains information about different tweets such as their content and the response they received from audiences in terms of likes, retweets and other measures. The following columns are included:

    • Tweet ID: A unique identifier for each tweet
    • Tweet content: The text contained within a tweet
    • Likes: Number of times a user has interacted with a specific tweet by pressing the “like” button
    • Replies: Number of direct replies to the original tweet
    • Re-Tweets: Number of times users have shared/re-tweeted a specific tweet
    • Retweeted : Indicates whether or not it was retweeted by someone else
    • Relevance : A measure on how relevant this conversation was at that particular time

    This data can be used for an array of tasks such as sentiment analysis (measuring how people feel about certain topics) or network analysis (understanding who were most influential in spreading Jackseptiye's message). You could also use this data to understand any changes in engagement metrics over time or measure which topics generate greater responses from audiences.

    To begin using this dataset first import it into your scripting language. After importing you can start exploring what insights could be gained with it, by asking questions such as ‘Which type of posts perform better?’ or ‘What types on conversations does Jacksepticeye tend to have?’ By focusing on one question at a time you can start looking for correlations between variables, gaining better understanding into why certain types over post perform differently than other ones . With variable manipulation techniques like select/filter you could group posts according to adhoc groups that answer your initial questions ('gaming', 'travel' etc). Once you narrow down these interests fields together with relevance indices quickly become much more easier to manage & interpret since they now operate under meaningful contexts rather than individual observations & associated figures (likes etc). Working off existing workbooks greatly increases efficiency while analysing datasets so make sure that if one exists already (and updates don't occure frequently enough) take advantage if it!

    Research Ideas

    • Identifying the types of content that performs best on the platform: By analyzing the engagement, reach, and popularity of tweets, marketers can determine which topics generate higher engagement and reach to inform their own strategies.

    • Assessing user interactions: Examining reply counts and retweet counts reveals how users interact with Jacksepticeye's posts, helping to inform a better understanding of user dynamics on Twitter.

    • Measuring influencer marketing ROI: Since this dataset contains the number of likes and retweets for each post, marketers can compare these values to assess the success of an influencer marketing campaign by determining whether it had a positive effect on followers' engagement with Jacksepticeye's content

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Twitter.

  15. Instagram users in Central & Western Europe 2019-2028

    • statista.com
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department (2025). Instagram users in Central & Western Europe 2019-2028 [Dataset]. https://www.statista.com/topics/4106/social-media-usage-in-europe/
    Explore at:
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Description

    The number of Instagram users in Central & Western Europe was forecast to increase between 2024 and 2028 by in total 4.5 million users (+3.52 percent). This overall increase does not happen continuously, notably not in 2028. The Instagram user base is estimated to amount to 132.21 million users in 2028. Notably, the number of Instagram users of was continuously increasing over the past years.User figures, shown here with regards to the platform instagram, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Instagram users in countries like Eastern Europe and Northern Europe.

  16. e

    Data for: Multi-level network dataset of ten Swiss wetlands governance cases...

    • opendata.eawag.ch
    • opendata-stage.eawag.ch
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Data for: Multi-level network dataset of ten Swiss wetlands governance cases based on qualitative interviews and quantitative surveys - Package - ERIC [Dataset]. https://opendata.eawag.ch/dataset/multi-level-network-dataset-of-social-ecological-interdependencies
    Explore at:
    Dataset updated
    Jul 29, 2022
    Description

    The dataset of this paper originated from quantitative survey data and qualitative expert interviews with organizational actors relevant to the governance of ten Swiss wetlands from 2019 till 2021. Multi-level networks represent wetlands governance for each of the ten cases. Collaboration networks of actors form a first level of the multi-level networks. The collaboration network is connected to multiple other network levels that account for the social and ecological systems those actors are active in. 521 actors relevant to the management of the ten wetlands are included in the collaboration network; quantitative survey data exists for 71% of them. A unique feature of the collaboration network is that it differentiates between positive and negative forms of collaboration depending on actors' activity areas. Therefore, the data describes not only if actors collaborate but also how and where actors collaborate. Further additional two-mode networks (actor participation in forums and involvement in other regions outside the case area) are also elicited in the survey and connected to the collaboration network. The dataset also contains data on ecological system interdependencies in the form of conceptual maps derived from 34 expert interviews (2-4 experts per case).

  17. Analysis of social media and organizational learning

    • researchdata.up.ac.za
    pdf
    Updated Feb 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harry Moongela; Marie Hattingh (2023). Analysis of social media and organizational learning [Dataset]. http://doi.org/10.25403/UPresearchdata.21952859.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 4, 2023
    Dataset provided by
    University of Pretoria Football Clubhttp://www.up.ac.za/
    Authors
    Harry Moongela; Marie Hattingh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets consist of qualitative data collected through semi-structured in-depth interviews as well as a focus group from three different companies with seven industry experts.The data collected was to address the use of social media to enhance organisational learning and also to address the gap that exists in terms of the integration of organisational learning (OL) and social media and also address the lack of guidelines for organisations that would like to implement the use of social media to facilitate OL. The data were triangulated by comparing the results from the three companies.

  18. MetaFilter Community Data

    • kaggle.com
    zip
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). MetaFilter Community Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/metafilter-community-data/code
    Explore at:
    zip(329048128 bytes)Available download formats
    Dataset updated
    Dec 19, 2023
    Authors
    The Devastator
    Description

    MetaFilter Community Data

    MetaFilter Community Interactions and Data

    By pmlandwehr [source]

    About this dataset

    The MetaFilter Infodump dataset is a comprehensive collection of community interactions, post and comment data, user information, and tag data from the popular online community, MetaFilter. With a rich history spanning 17 years, this dataset offers invaluable insights into the dynamics and evolution of this long-standing community.

    MetaFilter is an established link blog that has been in existence since July 14, 1999. What sets it apart is its active moderation by a dedicated staff who ensure the quality and authenticity of content by implementing a $5 membership fee to deter trolling and bad-faith behavior.

    The Infodump serves as a regular dump of records capturing various user interactions across MetaFilter and its affiliated sites. It provides an extensive collection of data that sheds light on who interacted with whom, the topics that sparked collective interest within the community, and how the community has transformed over time. To fully comprehend the intricacies of this dataset and understand its variables, we highly recommend referring to the canonical data dictionary available on MetaFilter's Wiki page dedicated to the Infodump.

    This dataset holds immense value for researchers interested in studying interaction patterns within online communities or analyzing trends in community interests over extended periods. By exploring seventeen years' worth of historical data from a single online community like MetaFilter, researchers can gain valuable insights into user engagement dynamics and uncover fascinating trends.

    Please note that while working with this dataset warrants attention to certain caveats (e.g., users' privacy), it presents unparalleled opportunities for investigating social dynamics within an online setting.

    To ensure accuracy and relevancy throughout ongoing research endeavors involving this dataset, updates will be made regularly using APIs provided by data.world alongside their ability to track URLs for convenience.

    With columns such as post_title**, researchers can delve into posts' topics across numerous threads on MetaFilter. Analyzing these post titles allows researchers to identify subjects commonly discussed within this vibrant online community.

    Embrace this MetaFilter Infodump dataset to explore the vast tapestry of interactions, discussions, and trends that have shaped one of the internet's pioneering communities

    How to use the dataset

    How to Use the MetaFilter Infodump Dataset

    The MetaFilter Infodump dataset is a valuable resource for analyzing community interactions, post and comment data, user information, and tag data from the MetaFilter online community. This guide will provide an overview of the dataset and offer insights on how to effectively utilize it for your analysis.

    Dataset Overview

    The MetaFilter Infodump dataset contains records spanning 17 years of online community history. It includes information on interactions between users, such as comments and posts, as well as user profiles and tag data. The dataset is structured in a tabular format with various columns providing specific information about each record.

    Dataset Files

    • meta_commentlength.csv: This file contains the index, comment ID, and length of each comment in the MetaFilter Infodump dataset.
    • music_posttitles.csv: This file contains the index, post ID, and title of music-related posts on MetaFilter.
    • askme_commentlength.csv: This file contains the index, comment ID, and length of comments in the askme section of the MetaFilter community.

    Data Dictionary

    To fully understand the dataset structure and column meanings mentioned above:

    Columns

    File: meta_commentlength.csv | Column name | Description | |:--------------|:---------------------------------------------------| | length | The length of the comment in characters. (Numeric) |

    File: music_posttitles.csv | Column name | Description | |:--------------|:------------------------------| | title | The title of the post. (Text) |

    File: askme_commentlength.csv | Column name | Description | |:--------------|:---------------------------------------------------| | length | The length of the comment in characters. (Numeric) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit pmlandwehr.

  19. Tunisian Arabizi Dialect Data - Sentiment Analysis

    • kaggle.com
    zip
    Updated Dec 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alban NYANTUDRE (2023). Tunisian Arabizi Dialect Data - Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/waalbannyantudre/tunisian-arabizi-dialect-data-sentiment-analysis
    Explore at:
    zip(3089928 bytes)Available download formats
    Dataset updated
    Dec 23, 2023
    Authors
    Alban NYANTUDRE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    -**Motivation:** On various Social Media platforms, people tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. While being so diverse and rich, Arabic language and particularly Arabic dialects, are still under represented and not yet fully exploited. Arabizi is a term describing a system of writing Arabic using English characters. This term comes from two words “arabi” (Arabic) and “Engliszi” (English). Arabizi is the representation of Arabic sounds using Latin letters and numbers to replace the non existing equivalent ones. Particularly in Tunisia, this way of writing was introduced as ”Tunizi”. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F11399696%2F55a8562b118fc4d877b6dc3c60c34782%2FTUNIZI%20example%20-%20Copie.jpg?generation=1703292391903003&alt=media" alt=""> Tunizi example comments and their Modern Standard Arabic (MSA) and English translations.

    -**About this Dataset :** This is a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, Negative, and Neutral.

    -**Value of this Data :** The authors introduced this large Tunizi dataset built for the sentiment analysis task, in order to help Tunisian and other researchers interested in the Natural Language Processing (NLP) field. This dataset can be also used for other NLP subtasks such as dialect identification, named entities recognition, etc...

    Specifications table
    SubjectNatural Language Processing - NLP
    Type of dataText
    RegionNorth Africa - Tunisia
    Data formatAnnotated, Analysed, Filtered Data
    Data ArticleIntroducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis
    Data source locationhttps://data.mendeley.com/datasets/9sxpkmm8xn/1

    -**How the data were acquired:**
    According to the article authors, because of the lack of available Tunisian dialect data (books, wikipedia, etc.), they used a Common Crawl-based dataset extracted from social media. It is collected from comments on various social networks. The chosen posts included sports, politics, comedy, TV shows, TV series, arts and Tunisian music videos such that the dataset is representative and contains different types of ages, background, writing, etc. This data does not include any confidential information since it is collected from comments on public Social Media posts. However, negative comments may include offensive or insulting content. This dataset relates directly to people from different regions, different ages and different genders. A filter was applied to ensure that only Latin based comments are included. The extracted data was preprocessed by removing links, emoji symbols, and ponctuations|

    Header & Thumbnail Image : Credits @VectorStock

  20. d

    Replication Data for: The finance research network in Brazil: a small world

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mendes-Da-Silva, Wesley (2023). Replication Data for: The finance research network in Brazil: a small world [Dataset]. http://doi.org/10.7910/DVN/FTOYYL
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Mendes-Da-Silva, Wesley
    Description

    The study of the role of collaboration networks in the production of knowledge is important and has attracted the attention of a substantial number of researchers and policy makers around the world. This paper aims to analyze the structural properties of relationship networks among Finance researchers in Brazil. By applying Social Network Analysis to data from 532 articles produced by 806 researchers between 2003 and 2012, this article's results suggest that: (a) the Brazilian environment has structural features that indicate the existence of Small Worlds; (b) a small fraction (~3%) of researchers has regular production; (c) the higher the centrality of researchers in the network, the greater the number of articles published by them.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aditya Suresh (2025). Social Media Extremism Detection Dataset [Dataset]. https://www.kaggle.com/datasets/adityasureshgithub/digital-extremism-detection-curated-dataset
Organization logo

Social Media Extremism Detection Dataset

Curated Dataset of Extremist Social Media Messages

Explore at:
zip(121048 bytes)Available download formats
Dataset updated
Nov 23, 2025
Authors
Aditya Suresh
License

https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

Description

NEW UPDATE:

Show your skills off in the Social Media Extremism Challenge @ https://www.kaggle.com/competitions/social-media-extremism-detection-challenge! Try your luck at tackling this challenging classification problem! After the competition is completed, we will be adding 200+ hand-labelled entries to this dataset so stay tuned!

We would like to thank Assistant Professor Leilani H. Gilpin (UC Santa Cruz) and the AIEA Lab for their guidance and support in the development of this dataset. —*Aditya Suresh, Anthony Lu, Vishnu Iyer*

About this data: Social media has seen an increasing rise in the quantity and intensity of extremist content throughout various different services. With cases such as the various different white supremacist movements across the world, recruitment for terrorist organizations through affiliated accounts, and a general sense of hate emerging through the modern era of polarization, it becomes increasingly vital to be able to recognize these patterns and adequately combat the harms of extremism digitally on a global scale.

Citations: Our dataset would not have been possible without the aid of an already preexisting dataset found on Kaggle, Version 1 of "Hate Speech Detection curated Dataset🤬" by Alban Nyantudre in 2023. The link can be found here: https://www.kaggle.com/datasets/waalbannyantudre/hate-speech-detection-curated-dataset/data. Accessed in 2025, it was truly essential to our work. With over 400,000 messages of real, cleaned posts, we would not have been able to source and label our data points without this crucial resource.

Classification: Our team hand labelled nearly 3,000 pieces of data from our sourced database of posts, filtering every on of them into a blanket tag of "EXTREMIST" and "NON_EXTREMIST." As many messages digitally utilize context in order to spread harmful rhetoric, we followed a general rule of classifying terms as extremist so long as they "provoked harm to a person or a group of people, whether it be through advocacy for violence, discrimination, or other hurtful sentiments, based off of a characteristic of the group."

Value of the data: This dataset can be utilized to create extremist sentiment analysis systems and machine learning algorithms, as it reflects on current linguistics, as stated by the source material for the data points themselves. In addition, it can be used as a benchmark for comparing with other extremism datasets and other extremist sentiment analysis systems.

Potential Errors: Although we feel very confident in our own labeling ability, a possibility of potentially wrong data points does exist due to the fact that these data points lack quantifiable identifiers and as such human errors are possible within the data. We do not believe this to occur often, but in full transparency is an issue that we endeavor to resolve in subsequent updates.

Search
Clear search
Close search
Google apps
Main menu