69 datasets found
  1. Fedivertex

    • kaggle.com
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Damie (2025). Fedivertex [Dataset]. http://doi.org/10.34740/kaggle/ds/6877842
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marc Damie
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25663426%2Fef0839f1c6342b2f89b87d08acfb4b74%2Fpeertube_graph(1).png?generation=1746770713374326&alt=media" alt="Peertube "follow" graph">

    Above is the Peertube "follow" graph. The colours correspond to the language of the server (purple: unknown, green: French, blue: English, black: German, orange: Italian, grey: others).

    Introduction

    Decentralized machine learning---where each client keeps its own data locally and uses its own computational resources to collaboratively train a model by exchanging peer-to-peer messages---is increasingly popular, as it enables better scalability and control over the data. A major challenge in this setting is that learning dynamics depend on the topology of the communication graph, which motivates the use of real graph datasets for benchmarking decentralized algorithms. Unfortunately, existing graph datasets are largely limited to for-profit social networks crawled at a fixed point in time and often collected at the user scale, where links are heavily influenced by the platform and its recommendation algorithms. The Fediverse, which includes several free and open-source decentralized social media platforms such as Mastodon, Misskey, and Lemmy, offers an interesting real-world alternative. We introduce Fedivertex, a new dataset covering seven social networks from the Fediverse, crawled weekly on a weekly basis.

    We refer to our paper for a detailed presentation of the graphs: [SOON]

    Usage

    Python

    We implemented a simple Python API to interact easily with the dataset: https://pypi.org/project/fedivertex/

    pip3 install fedivertex
    

    This package automatically downloads the dataset and generate NetworkX graphs.

    from fedivertex import GraphLoader
    
    loader.list_graph_types("mastodon")
    # List available graphs for a given software, here federation and active_user
    
    G = loader.get_graph(software = "mastodon", graph_type = "active_user", index = 0, only_largest_component = True)
    # G contains the Networkx graph of the giant component of the active users graph at the 1st date of collection
    

    We also provide a Kaggle notebook demonstrating simple operations using this library: https://www.kaggle.com/code/marcdamie/exploratory-graph-data-analysis-of-fedivertex

    Available graphs

    The dataset contains graphs crawled on a daily basis on 7 social networks from the Fediverse. Each graph quantifies/characterizes the interaction differently depending on the information provided by the public API of these networks.

    We present briefly the graph below (NB: the term "instance" refers to servers on the Fediverse):

    • [Bookwyrm/Friendica/Lemmy/Mastodon/Misskey/Pleroma] "federation" graphs: If two instances know each other they are connected in this graph. The federation graph then corresponds to the undirected communication graph between instances.
    • Peertube "follow" graphs: On Peertube, an instance X can follow an instance Y to let its users see all the videos posted on Y. This graph is a directed graph with edges of weight 1 for following.
    • Lemmy "federation with blocks" graphs: This graph completes the federation graph with negative edges when an instance X blocks instance Y. The graph is directed.
    • Lemmy "cross-instance" graphs: two instances are connected as soon as there exists a pair of users who published a message in the same thread, but possibly on a third instance. This is an undirected graph, less sparse than "intra-instance".
    • Lemmy "intra-instance" graphs: the instance X is linked to Y if an user of X has published a message on instance Y. This graph is directed and very sparse.
    • [Mastodon/Misskey/Pleroma] "active users" graphs: For each instance, we consider the set of the 10K most recently active users. Then, for each user of an instance X, we consider the list of the users they follow, and add 1 to the edge from X to Y where Y is the instance the followed users. The weight of the edge from X to Y thus encodes how much the content seen on instance X is generated in instance Y. Note that this graph thus contains self loops.

    These graphs provide diverse perspectives on the Fediverse as they capture more or less subtle phenomenon. For example, "federation" graphs are dense, while "intra-instance" graphs are sparse. We have performed a detailed exploratory data analysis in this notebook.

    Gephi

    Our CSV files are formatted so that they can be directly imported into Gephi for graph visualization. Find below an example Gephi visualization of the Misskey "active users" graph (without the misskey.io node). The colours correspond to the language of the server (purple:Unknown, red: Japanese, brown: Korean, blue: English, yellow: Chinese).

    ![Misskey "active users" graph](https://www.go...

  2. Social media users in the United States 2020-2029

    • statista.com
    • ai-chatbox.pro
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Social media users in the United States 2020-2029 [Dataset]. https://www.statista.com/statistics/278409/number-of-social-network-users-in-the-united-states/
    Explore at:
    Dataset updated
    Dec 12, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    The number of social media users in the United States was forecast to continuously increase between 2024 and 2029 by in total 26 million users (+8.55 percent). After the ninth consecutive increasing year, the social media user base is estimated to reach 330.07 million users and therefore a new peak in 2029. Notably, the number of social media users of was continuously increasing over the past years.The shown figures regarding social media users have been derived from survey data that has been processed to estimate missing demographics.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).

  3. Social Media Channels and Statistics at the National Archives

    • catalog.data.gov
    • data.amerigeoss.org
    • +1more
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Archives and Records Administration (2024). Social Media Channels and Statistics at the National Archives [Dataset]. https://catalog.data.gov/dataset/social-media-channels-and-statistics-at-the-national-archives
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    National Archives and Records Administrationhttp://www.archives.gov/
    Description

    More than 100 social media channels and statistics for the National Archives and Records Administration.

  4. Z

    NetVotes iKnow Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Labatut, Vincent (2024). NetVotes iKnow Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6816075
    Explore at:
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Figueiredo, Rosa
    Arınık, Nejat
    Labatut, Vincent
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description. This is the data used in the experiment of the following conference paper:

    N. Arınık, R. Figueiredo, and V. Labatut, “Signed Graph Analysis for the Interpretation of Voting Behavior,” in International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities, Graz, AT, 2017, vol. 2025. ⟨hal-01583133⟩

    Source code. The code source is accessible on GitHub: https://github.com/CompNet/NetVotes

    Citation. If you use the data or source code, please cite the above paper.

    @InProceedings{Arinik2017, author = {Arınık, Nejat and Figueiredo, Rosa and Labatut, Vincent}, title = {Signed Graph Analysis for the Interpretation of Voting Behavior}, booktitle = {International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities}, year = {2017}, volume = {2025}, series = {CEUR Workshop Proceedings}, address = {Graz, AT}, url = {http://ceur-ws.org/Vol-2025/paper_rssna_1.pdf},}

    Details.

    RAW INPUT FILESThe 'itsyourparliament' folder contains all raw input files for further data processing (such as network extraction).The folder structure is as follows:* itsyourparliament/** domains: There are 28 domain files. Each file corresponds to a domain (such as Agriculture, Economy, etc.) and contains corresponding vote identifiers and their "itsyourparliament.eu" links.** meps: There are 870 Member of Parliament (MEP) files. Each file contains the MEP information (such as name, country, address, etc.)** votes: There are 7513 vote files. Each file contains the votes expressed by MEPs# NETWORKS AND CORRESPONDING PARTITIONSThis work studies the voting behavior of French and Italian MEPs on "Agriculture and Rural Development" (AGRI) and "Economic and Monetary Affairs" (ECON) for each separate year of the 7th EP term (2009-10, 2010-11, 2011-12, 2012-13, 2013-14). Note that the interpretation part (section 4) of the published paper is limited to only a few of these instances (2009-10 in ECON and 2012-13 in AGRI).The extracted networks are located in the "networks" folder and the corresponding partitions are in the "partitions" folder. Both folders have the same structure, which is as follows:COUNTRY-NAME|_DOMAIN-NAME|_2009-10|_2010-11|_2011-12|_2012-13|_2013-14## NETWORKSThe networks in this folder are used in the article. All those networks are the ones obtained after the filtering step (as explained in the article). The networks are in 'Graphml' format. These networks are enriched with some MEPs' properties (such as name, political party, etc.) associated with each node.## ALL NETWORKSFor those who are interested in other countries or domains, we make available all possible networks that we can extract from raw data with vs. without filtering step.COUNTRY-NAME|_m3|_negtr=NA_postr=NA: This folder contains all filtered networks. Note that the filtering step is explained in Section 2.1.2 of the article.|_bygroup|_bycountry|_negtr=0_postr=0: This folder contains all original networks (i.e. no filtering step).|_bygroup|_bycountry## PARTITIONSThe partitions are obtained in this way: First, the Ex-CC (exact) method is run and we denote 'k' for the the number of detected cluster in output. This 'k' value is the reference point in order to run the ILS-RCC (heuristic) method by specifying the number of desired cluster in output. Then, ILS-RCC is run with various values ('k', 'k+1', 'k+2'). All those results are integrated into the initial network graphml files and then converted into gephi format so that this will help dive in the results in interactive way.Note that we need to handle the absent MEPs in clustering results. Because, those MEPs correspond to isolated nodes in networks. Each isolated node is considered a single cluster node in Ex-CC results. We simply omit those nodes in order to find the 'k' (number of detected cluster) value before running ILS-RCC. Not also that ILS-RCC does not process isolated nodes such that an isolated node can be part of a cluster.

    ----------------------# COMPARISON RESULTSThe 'material-stats' folder contains all the comparison results obtained for Ex-CC and ILS-CC. The csv files associated with plots are also provided.The folder structure is as follows:* material-stats/** execTimePerf: The plot shows the execution time of Ex-CC and ILS-CC based on randomly generated complete networks of different size.** graphStructureAnalysis: The plots show the weights and links statistics for all instances.** ILS-CC-vs-Ex-CC: The folder contains 4 different comparisons between Ex-CC and ILS-CC: Imbalance difference, number of detected clusters, difference of the number of detected clusters, NMI (Normalized Mutual Information)

    ----------------------Funding: Agorantic FR 3621, FMJH Program Gaspard Monge in optimization and operation research (Project 2015-2842H)

  5. Average daily time spent on social media worldwide 2012-2024

    • statista.com
    • ai-chatbox.pro
    Updated Apr 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Average daily time spent on social media worldwide 2012-2024 [Dataset]. https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/
    Explore at:
    Dataset updated
    Apr 10, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    How much time do people spend on social media? As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.

  6. Geo-location Graphs

    • kaggle.com
    Updated Nov 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Geo-location Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-geo-location/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API, and consists of 196,591 nodes and 950,327 edges. We have collected a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.

    Brightkite was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally directed but we have constructed a network with undirected edges when there is a friendship in both ways. We have also collected a total of 4,491,143 checkins of these users over the period of Apr. 2008 - Oct. 2010.

    Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

    The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

    SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

    https://snap.stanford.edu/data/index.html

  7. Hong Kong Social Contact Dynamics

    • kaggle.com
    Updated Feb 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Hong Kong Social Contact Dynamics [Dataset]. https://www.kaggle.com/datasets/thedevastator/hong-kong-social-contact-dynamics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Hong Kong
    Description

    Hong Kong Social Contact Dynamics

    Understanding Age, Gender and Network Dynamics

    By [source]

    About this dataset

    This dataset provides an in-depth look at the dynamics of social interaction, particularly in Hong Kong. It contains comprehensive information regarding individuals, households and interactions between individuals such as their ages, frequency and duration of contact, and genders. This data can be utilized to evaluate various social and economic trends, behaviors, as well as dynamics observed at different levels. For example, this data set is an ideal tool to recognize population-level trends such as age and gender diversification of contacts or investigate the structure of social networks in addition to the implications of contact patterns on health and economic outcomes. Additionally, it offers valuable insights into dissimilar groups of people including their permanent residence activities related to work or leisure by enabling one to understand their interactions along with contact dynamics within their respective populations. Ultimately this dataset is key for attaining a comprehensive understanding of social contact dynamics which are fundamental for grasping why these interactions are crucial in Hong Kong's society today

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides detailed information about the social contact dynamics in Hong Kong. With this dataset, it is possible to gain a comprehensive understanding of the patterns of various forms of social contact - from permanent residence and work contacts to leisure contacts. This guide will provide an overview and guidelines on how to use this dataset for analysis.

    Exploring Trends and Dynamics:

    To begin exploring the trends and dynamics of social contact in Hong Kong, start by looking at demographic factors such as age, gender, ethnicity, and educational attainment associated with different types of contacts (permanent residence/work/leisure). Consider the frequency and duration of contacts within these segments to identify any potential differences between them. Additionally, look at how these factors interact with each other – observe which segments have higher levels of interaction with each other or if there are any differences between different population groups based on their demographic characteristics. This can be done through visualizations such as line graphs or bar charts which can illustrate trends across timeframes or population demographics more clearly than raw numbers would alone.

    Investigating Social Networks:

    The data collected through this dataset also allows for investigation into social networks – understanding who connects with who in both real-life interactions as well as through digital channels (if applicable). Focus on analyzing individual or family networks rather than larger groups in order to get a clearer picture without having too much complexity added into the analysis time. Analyze commonalities among individuals within a network even after controlling for certain factors that could affect interaction such as age or gender – utilize clustering techniques for this step if appropriate– then focus on comparing networks between individuals/families overall using graph theory methods such as length distributions (the average number of relationships one has) , degrees (the number of links connected from one individual or family unit), centrality measures(identifying individuals who serve an important role bridging two different parts fo he network) etc., These methods will help provide insights into varying structures between large groups rather than focusing only on small-scale personal connections among friends / colleagues / relatives which may not always offer accurate portrayals due to their naturally limited scope

    Modeling Health Implications:

    Finally, consider modeling health implications stemming from these observed patterns– particularly implications that may not be captured by simpler measures like count per contact hour (which does not differentiate based on intensity). Take into account aspects like viral transmission risk by analyzing secondary effects generated from contact events captured in the data – things like physical proximity when multiple people meet up together over multiple days

    Research Ideas

    • Analyzing the age, gender and contact dynamics of different areas within Hong Kong to understand the local population trends and behavior.
    • Investigating the structure of social networks to study how patterns of contact vary among socio economic backgro...
  8. P

    Bluesky Social Dataset Dataset

    • paperswithcode.com
    Updated Apr 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Bluesky Social Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/bluesky-social-dataset
    Explore at:
    Dataset updated
    Apr 28, 2024
    Description

    Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

    The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

    Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.

    This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.

    Dataset Here is a description of the dataset files.

    followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v). posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line. interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date. graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread. feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author); feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp. feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp; scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

    Citation If used for research purposes, please cite the following paper describing the dataset details:

    Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984

    Acknowledgments: This work is supported by :

    the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).

  9. Orkut Social Network and Communities (SNAP)

    • kaggle.com
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Orkut Social Network and Communities (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-com-orkut/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Orkut social network and ground-truth communities

    https://snap.stanford.edu/data/com-Orkut.html

    Dataset information

    Orkut (http://www.orkut.com/) is a free on-line social network where users form friendship each other. Orkut also allows users form a group which
    other members can then join. We consider such user-defined groups as
    ground-truth communities. We provide the Orkut friendship social network
    and ground-truth communities. This data is provided by Alan Mislove et al. (http://socialnetworks.mpi-sws.org/data-imc2007.html)

    We regard each connected component in a group as a separate ground-truth
    community. We remove the ground-truth communities which have less than 3
    nodes. We also provide the top 5,000 communities with highest quality
    which are described in our paper (http://arxiv.org/abs/1205.6233). As for
    the network, we provide the largest connected component.

    Dataset statistics
    Nodes 3,072,441
    Edges 117,185,083
    Nodes in largest WCC 3072441 (1.000)
    Edges in largest WCC 117185083 (1.000)
    Nodes in largest SCC 3072441 (1.000)
    Edges in largest SCC 117185083 (1.000)
    Average clustering coefficient 0.1666
    Number of triangles 627584181
    Fraction of closed triangles 0.01414
    Diameter (longest shortest path) 9
    90-percentile effective diameter 4.8

    Source (citation)
    J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

    Files
    File Description
    com-orkut.ungraph.txt.gz Undirected Orkut network
    com-orkut.all.cmty.txt.gz Orkut communities
    com-orkut.top5000.cmty.txt.gz Orkut communities (Top 5,000)

    Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

    The graph in the SNAP data set is 1-based, with nodes numbered 1 to
    3,072,626.

    In the SuiteSparse Matrix Collection, Problem.A is the undirected
    Orkut network, a matrix of size n-by-n with n=3,072,441, which is
    the number of unique user id's appearing in any edge.

    Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
    node id's are the same as the SNAP data set (1-based).

    C = Problem.aux.Communities_all is a sparse matrix of size n by 15,301,901 which represents the same number communities in the com-orkut.all.cmty.txt file. The kth line in that file defines the kth community, and is the
    column C(:,k), where where C(i,k)=1 if person nodeid(i) is in the kth
    community. Row C(i,:) and row/column i of the A matrix thus refer to the
    same person, nodeid(i).

    Ctop = Problem.aux.Communities_to...

  10. Z

    A Dataset of Multilingual Facebook Comments on Moros and Armed Conflict in...

    • data.niaid.nih.gov
    • repository.uantwerpen.be
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cruz, Frances Antoinette (2024). A Dataset of Multilingual Facebook Comments on Moros and Armed Conflict in the Southern Philippines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10971589
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    Cruz, Frances Antoinette
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Philippines, Mindanao
    Description

    This dataset is a collection of 12,478 social media comments found on the official Facebook pages of ten Philippine newspapers, The Philippine Daily Inquirer, Manila Bulletin, The Philippine Star, The Manila Times, Sunstar Cebu, Sunstar Davao, Cebu Daily News, The Freeman, Sunstar Davao, MindaNews, and The Mindanao Times, spanning the years 2015, 2017 and 2019. The comments contain terms related to the Moro identity and the Mamasapano Clash, the Marawi Siege and the establishment of BARMM in the southern Philippines, allowing researchers to study semantic fields with regard to Muslims and the relationship between the texts and the source newspaper, their region of origin, and political administration, among other variables. All comments in the dataset were downloaded through Facebook's Graph API via Facepager (Jünger & Keyling, 2019).

    One CSV file (MMB151719SOCMED_v2.csv) is provided, along with a codebook that contains descriptions of the variables and codes used in the CSV file, and a Readme document with a changelog.

    Each social media comment is annotated with the following metadata:

    object_id: identifier associated with the comment;

    message: the textual string of the comment;

    message_proc: the textual string of the comment after pre-processing;

    lang_label: categorical value for the language of the comment (Tagalog (Filipino), Cebuano, English, Taglish, Bislog, Bislish, Trilingual or Other);

    from_name: identifier of public pages (not profiles of individuals) leaving comments (NaN for profiles of individuals, 'NAME' for public pages besides the newspapers, otherwise, the page name of the newspaper);

    created_time: Facebook Graph API's-generated string for the date and time the comment was posted;

    month_year: categorical value in the form string+YY (e.g. Jun-15) of the month and year when the comment was posted;

    year: numerical value in the form YY;

    newspaper: categorical value for the newspaper Facebook page under which the comment was found;

    corpus: categorical value for comments from the main corpus or the side (control) corpus;

    administration: categorical value for political administration (pbsa = President Benigno Aquino III, prrd = President Rodrigo Roa Duterte);

    count: numerical value referring to the number of string sequences without spaces;

    The dataset may only be used for non-commercial purposes and is licensed under the CC BY-NC-SA 4.0 DEED.

    V2 - 05/06/2024

    Corrections

    Corrections made to region to include Luzon, Visayas and Mindanao (as opposed to Mindanao, non-Mindanao);

    Corrections made to administration coding.

    This dataset is described by:

    Cruz, F. A. (2024). A Multilingual Collection of Facebook Comments on the Moro Identity and Armed Conflict in the Southern Philippines. Journal of Open Humanities Data, 10(1), 41. DOI: https://doi.org/10.5334/johd.219

    Bibiliography

    Jünger, J., & Keyling, T. (2019). Facepager: An application for automated data retrieval on the web (4.5.3) [Computer software]. https://github.com/strohne/Facepager/

  11. s

    Social Media Worldwide Usage Statistics

    • searchlogistics.com
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Social Media Worldwide Usage Statistics [Dataset]. https://www.searchlogistics.com/learn/statistics/social-media-addiction-statistics/
    Explore at:
    Dataset updated
    Nov 12, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    56.8% of the world’s total population is active on social media.

  12. Twitter users in the United States 2019-2028

    • statista.com
    • ai-chatbox.pro
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department (2024). Twitter users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
    Explore at:
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Area covered
    United States
    Description

    The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.

  13. Reddit users in the United States 2019-2028

    • statista.com
    • ai-chatbox.pro
    Updated Jun 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department (2024). Reddit users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
    Explore at:
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Area covered
    United States
    Description

    The number of Reddit users in the United States was forecast to continuously increase between 2024 and 2028 by in total 10.3 million users (+5.21 percent). After the ninth consecutive increasing year, the Reddit user base is estimated to reach 208.12 million users and therefore a new peak in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Reddit users in countries like Mexico and Canada.

  14. p

    Graph of Flickr Photo-Sharing Social Network Crawled in May 2006

    • purr.purdue.edu
    Updated Nov 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Gleich (2022). Graph of Flickr Photo-Sharing Social Network Crawled in May 2006 [Dataset]. http://doi.org/10.4231/D39P2W550
    Explore at:
    Dataset updated
    Nov 1, 2022
    Dataset provided by
    PURR
    Authors
    David Gleich
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Crawl of the Flickr photo-sharing social network from May 2006 returning a graph with 820,878 nodes and 9,837,214 edges. Dataset is distributed as a SMAT file with README file with code to read file in Python and MATLAB.

  15. P

    PATTERN Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Apr 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vijay Prakash Dwivedi; Chaitanya K. Joshi; Anh Tuan Luu; Thomas Laurent; Yoshua Bengio; Xavier Bresson (2021). PATTERN Dataset [Dataset]. https://paperswithcode.com/dataset/pattern
    Explore at:
    Dataset updated
    Apr 2, 2021
    Authors
    Vijay Prakash Dwivedi; Chaitanya K. Joshi; Anh Tuan Luu; Thomas Laurent; Yoshua Bengio; Xavier Bresson
    Description

    PATTERN is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. PATTERN tests the fundamental graph task of recognizing specific predetermined subgraphs.

  16. H

    Official USA Cities Simplified Roads Network

    • dataverse.harvard.edu
    Updated Apr 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabien Pfaender (2017). Official USA Cities Simplified Roads Network [Dataset]. http://doi.org/10.7910/DVN/19UK7N
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Fabien Pfaender
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Complete dataset of all 29,850 USA cities Roads network as a graph in the shp format. The extracts follow 2016 official USA cities boundaries. Graph are identified by their [city_code].shp. Cities code are provided by the Tiger Census Dataset. Graph have been created by extracting all openstreetmap.org (osm) maps for each USA Cityextracting the graph from osm extract using the policosm python github librarysimplifying the graph by removing all degree two nodes to retain only a workable transportation network. Original road length is retained as an attribute Nodes includes latitude and longitude attributes from WGS84 projection Edges includes length in meter (precision < 1m), tag:highway value from osm See policosm on github for more informations on extractions algorithm

  17. s

    Worldwide Social Media Addiction Facts

    • searchlogistics.com
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Worldwide Social Media Addiction Facts [Dataset]. https://www.searchlogistics.com/learn/statistics/social-media-addiction-statistics/
    Explore at:
    Dataset updated
    Nov 12, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over 210 million people worldwide suffer from social media addiction.

  18. Developer Community and Code Datasets

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Oxylabs
    Area covered
    Philippines, El Salvador, Tuvalu, Bahamas, Guyana, Saint Pierre and Miquelon, United Kingdom, South Sudan, Marshall Islands, Djibouti
    Description

    Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

    Data Sources:

    1. GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

    2. StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

    3. DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

    Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

    With our datasets, you'll receive:

    • Usernames;
    • Companies;
    • Locations;
    • Job Titles;
    • Follower Counts;
    • Contact Details;
    • Employability Statuses;
    • And More.

    Choose from various output formats, storage options, and delivery frequencies:

    • Get datasets in CSV, JSON, or other preferred formats.
    • Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.
    • Receive datasets either once or as per your agreed-upon schedule.

    Why choose our Datasets?

    1. Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

    2. Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

    3. Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!

  19. Breaking Bad : Network Analysis

    • kaggle.com
    Updated Jan 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jishnu (2023). Breaking Bad : Network Analysis [Dataset]. https://www.kaggle.com/datasets/jishnukoliyadan/breaking-bad-network-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jishnu
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This data is collected for the Network Analysis of Breaking Bad television series.

    Inspiration

    DataCamp's A Network Analysis of Game of Thrones was the inspiration for the project. Since there is no relationship dataset for the Breaking Bad is available, decided to generate relationship dataset from episode summaries for the graph network analysis.

    Dataset

    The data was collected using web scrapping from the fandom page of Breaking Bad series.

  20. o

    A dataset of Covid-related misinformation videos and their spread on social...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Feb 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksi Knuutila (2021). A dataset of Covid-related misinformation videos and their spread on social media [Dataset]. http://doi.org/10.5281/zenodo.4557827
    Explore at:
    Dataset updated
    Feb 23, 2021
    Authors
    Aleksi Knuutila
    Description

    This dataset contains metadata about all Covid-related YouTube videos which circulated on public social media, but which YouTube eventually removed because they contained false information. It describes 8,122 videos that were shared between November 2019 and June 2020. The dataset contains unique identifiers for the videos and social media accounts that shared the videos, statistics on social media engagement and metadata such as video titles and view counts where they were recoverable. We publish the data alongside the code used to produce on Github. The dataset has reuse potential for research studying narratives related to the coronavirus, the impact of social media on knowledge about health and the politics of social media platforms.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marc Damie (2025). Fedivertex [Dataset]. http://doi.org/10.34740/kaggle/ds/6877842
Organization logo

Fedivertex

The Fediverse Graph Dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marc Damie
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25663426%2Fef0839f1c6342b2f89b87d08acfb4b74%2Fpeertube_graph(1).png?generation=1746770713374326&alt=media" alt="Peertube "follow" graph">

Above is the Peertube "follow" graph. The colours correspond to the language of the server (purple: unknown, green: French, blue: English, black: German, orange: Italian, grey: others).

Introduction

Decentralized machine learning---where each client keeps its own data locally and uses its own computational resources to collaboratively train a model by exchanging peer-to-peer messages---is increasingly popular, as it enables better scalability and control over the data. A major challenge in this setting is that learning dynamics depend on the topology of the communication graph, which motivates the use of real graph datasets for benchmarking decentralized algorithms. Unfortunately, existing graph datasets are largely limited to for-profit social networks crawled at a fixed point in time and often collected at the user scale, where links are heavily influenced by the platform and its recommendation algorithms. The Fediverse, which includes several free and open-source decentralized social media platforms such as Mastodon, Misskey, and Lemmy, offers an interesting real-world alternative. We introduce Fedivertex, a new dataset covering seven social networks from the Fediverse, crawled weekly on a weekly basis.

We refer to our paper for a detailed presentation of the graphs: [SOON]

Usage

Python

We implemented a simple Python API to interact easily with the dataset: https://pypi.org/project/fedivertex/

pip3 install fedivertex

This package automatically downloads the dataset and generate NetworkX graphs.

from fedivertex import GraphLoader

loader.list_graph_types("mastodon")
# List available graphs for a given software, here federation and active_user

G = loader.get_graph(software = "mastodon", graph_type = "active_user", index = 0, only_largest_component = True)
# G contains the Networkx graph of the giant component of the active users graph at the 1st date of collection

We also provide a Kaggle notebook demonstrating simple operations using this library: https://www.kaggle.com/code/marcdamie/exploratory-graph-data-analysis-of-fedivertex

Available graphs

The dataset contains graphs crawled on a daily basis on 7 social networks from the Fediverse. Each graph quantifies/characterizes the interaction differently depending on the information provided by the public API of these networks.

We present briefly the graph below (NB: the term "instance" refers to servers on the Fediverse):

  • [Bookwyrm/Friendica/Lemmy/Mastodon/Misskey/Pleroma] "federation" graphs: If two instances know each other they are connected in this graph. The federation graph then corresponds to the undirected communication graph between instances.
  • Peertube "follow" graphs: On Peertube, an instance X can follow an instance Y to let its users see all the videos posted on Y. This graph is a directed graph with edges of weight 1 for following.
  • Lemmy "federation with blocks" graphs: This graph completes the federation graph with negative edges when an instance X blocks instance Y. The graph is directed.
  • Lemmy "cross-instance" graphs: two instances are connected as soon as there exists a pair of users who published a message in the same thread, but possibly on a third instance. This is an undirected graph, less sparse than "intra-instance".
  • Lemmy "intra-instance" graphs: the instance X is linked to Y if an user of X has published a message on instance Y. This graph is directed and very sparse.
  • [Mastodon/Misskey/Pleroma] "active users" graphs: For each instance, we consider the set of the 10K most recently active users. Then, for each user of an instance X, we consider the list of the users they follow, and add 1 to the edge from X to Y where Y is the instance the followed users. The weight of the edge from X to Y thus encodes how much the content seen on instance X is generated in instance Y. Note that this graph thus contains self loops.

These graphs provide diverse perspectives on the Fediverse as they capture more or less subtle phenomenon. For example, "federation" graphs are dense, while "intra-instance" graphs are sparse. We have performed a detailed exploratory data analysis in this notebook.

Gephi

Our CSV files are formatted so that they can be directly imported into Gephi for graph visualization. Find below an example Gephi visualization of the Misskey "active users" graph (without the misskey.io node). The colours correspond to the language of the server (purple:Unknown, red: Japanese, brown: Korean, blue: English, yellow: Chinese).

![Misskey "active users" graph](https://www.go...

Search
Clear search
Close search
Google apps
Main menu