Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25663426%2Fef0839f1c6342b2f89b87d08acfb4b74%2Fpeertube_graph(1).png?generation=1746770713374326&alt=media" alt="Peertube "follow" graph">
Above is the Peertube "follow" graph. The colours correspond to the language of the server (purple: unknown, green: French, blue: English, black: German, orange: Italian, grey: others).
Decentralized machine learning---where each client keeps its own data locally and uses its own computational resources to collaboratively train a model by exchanging peer-to-peer messages---is increasingly popular, as it enables better scalability and control over the data. A major challenge in this setting is that learning dynamics depend on the topology of the communication graph, which motivates the use of real graph datasets for benchmarking decentralized algorithms. Unfortunately, existing graph datasets are largely limited to for-profit social networks crawled at a fixed point in time and often collected at the user scale, where links are heavily influenced by the platform and its recommendation algorithms. The Fediverse, which includes several free and open-source decentralized social media platforms such as Mastodon, Misskey, and Lemmy, offers an interesting real-world alternative. We introduce Fedivertex, a new dataset covering seven social networks from the Fediverse, crawled weekly on a weekly basis.
We refer to our paper for a detailed presentation of the graphs: [SOON]
We implemented a simple Python API to interact easily with the dataset: https://pypi.org/project/fedivertex/
pip3 install fedivertex
This package automatically downloads the dataset and generate NetworkX graphs.
from fedivertex import GraphLoader
loader.list_graph_types("mastodon")
# List available graphs for a given software, here federation and active_user
G = loader.get_graph(software = "mastodon", graph_type = "active_user", index = 0, only_largest_component = True)
# G contains the Networkx graph of the giant component of the active users graph at the 1st date of collection
We also provide a Kaggle notebook demonstrating simple operations using this library: https://www.kaggle.com/code/marcdamie/exploratory-graph-data-analysis-of-fedivertex
The dataset contains graphs crawled on a daily basis on 7 social networks from the Fediverse. Each graph quantifies/characterizes the interaction differently depending on the information provided by the public API of these networks.
We present briefly the graph below (NB: the term "instance" refers to servers on the Fediverse):
These graphs provide diverse perspectives on the Fediverse as they capture more or less subtle phenomenon. For example, "federation" graphs are dense, while "intra-instance" graphs are sparse. We have performed a detailed exploratory data analysis in this notebook.
Our CSV files are formatted so that they can be directly imported into Gephi for graph visualization. Find below an example Gephi visualization of the Misskey "active users" graph (without the misskey.io
node). The colours correspond to the language of the server (purple:Unknown, red: Japanese, brown: Korean, blue: English, yellow: Chinese).
. After the ninth consecutive increasing year, the social media user base is estimated to reach 330.07 million users and therefore a new peak in 2029. Notably, the number of social media users of was continuously increasing over the past years.The shown figures regarding social media users have been derived from survey data that has been processed to estimate missing demographics.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
More than 100 social media channels and statistics for the National Archives and Records Administration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description. This is the data used in the experiment of the following conference paper:
N. Arınık, R. Figueiredo, and V. Labatut, “Signed Graph Analysis for the Interpretation of Voting Behavior,” in International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities, Graz, AT, 2017, vol. 2025. ⟨hal-01583133⟩
Source code. The code source is accessible on GitHub: https://github.com/CompNet/NetVotes
Citation. If you use the data or source code, please cite the above paper.
@InProceedings{Arinik2017, author = {Arınık, Nejat and Figueiredo, Rosa and Labatut, Vincent}, title = {Signed Graph Analysis for the Interpretation of Voting Behavior}, booktitle = {International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities}, year = {2017}, volume = {2025}, series = {CEUR Workshop Proceedings}, address = {Graz, AT}, url = {http://ceur-ws.org/Vol-2025/paper_rssna_1.pdf},}
Details.
----------------------# COMPARISON RESULTSThe 'material-stats' folder contains all the comparison results obtained for Ex-CC and ILS-CC. The csv files associated with plots are also provided.The folder structure is as follows:* material-stats/** execTimePerf: The plot shows the execution time of Ex-CC and ILS-CC based on randomly generated complete networks of different size.** graphStructureAnalysis: The plots show the weights and links statistics for all instances.** ILS-CC-vs-Ex-CC: The folder contains 4 different comparisons between Ex-CC and ILS-CC: Imbalance difference, number of detected clusters, difference of the number of detected clusters, NMI (Normalized Mutual Information)
----------------------Funding: Agorantic FR 3621, FMJH Program Gaspard Monge in optimization and operation research (Project 2015-2842H)
How much time do people spend on social media? As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API, and consists of 196,591 nodes and 950,327 edges. We have collected a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.
Brightkite was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally directed but we have constructed a network with undirected edges when there is a friendship in both ways. We have also collected a total of 4,491,143 checkins of these users over the period of Apr. 2008 - Oct. 2010.
Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.
The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.
SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset provides an in-depth look at the dynamics of social interaction, particularly in Hong Kong. It contains comprehensive information regarding individuals, households and interactions between individuals such as their ages, frequency and duration of contact, and genders. This data can be utilized to evaluate various social and economic trends, behaviors, as well as dynamics observed at different levels. For example, this data set is an ideal tool to recognize population-level trends such as age and gender diversification of contacts or investigate the structure of social networks in addition to the implications of contact patterns on health and economic outcomes. Additionally, it offers valuable insights into dissimilar groups of people including their permanent residence activities related to work or leisure by enabling one to understand their interactions along with contact dynamics within their respective populations. Ultimately this dataset is key for attaining a comprehensive understanding of social contact dynamics which are fundamental for grasping why these interactions are crucial in Hong Kong's society today
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides detailed information about the social contact dynamics in Hong Kong. With this dataset, it is possible to gain a comprehensive understanding of the patterns of various forms of social contact - from permanent residence and work contacts to leisure contacts. This guide will provide an overview and guidelines on how to use this dataset for analysis.
Exploring Trends and Dynamics:
To begin exploring the trends and dynamics of social contact in Hong Kong, start by looking at demographic factors such as age, gender, ethnicity, and educational attainment associated with different types of contacts (permanent residence/work/leisure). Consider the frequency and duration of contacts within these segments to identify any potential differences between them. Additionally, look at how these factors interact with each other – observe which segments have higher levels of interaction with each other or if there are any differences between different population groups based on their demographic characteristics. This can be done through visualizations such as line graphs or bar charts which can illustrate trends across timeframes or population demographics more clearly than raw numbers would alone.
Investigating Social Networks:
The data collected through this dataset also allows for investigation into social networks – understanding who connects with who in both real-life interactions as well as through digital channels (if applicable). Focus on analyzing individual or family networks rather than larger groups in order to get a clearer picture without having too much complexity added into the analysis time. Analyze commonalities among individuals within a network even after controlling for certain factors that could affect interaction such as age or gender – utilize clustering techniques for this step if appropriate– then focus on comparing networks between individuals/families overall using graph theory methods such as length distributions (the average number of relationships one has) , degrees (the number of links connected from one individual or family unit), centrality measures(identifying individuals who serve an important role bridging two different parts fo he network) etc., These methods will help provide insights into varying structures between large groups rather than focusing only on small-scale personal connections among friends / colleagues / relatives which may not always offer accurate portrayals due to their naturally limited scope
Modeling Health Implications:
Finally, consider modeling health implications stemming from these observed patterns– particularly implications that may not be captured by simpler measures like count per contact hour (which does not differentiate based on intensity). Take into account aspects like viral transmission risk by analyzing secondary effects generated from contact events captured in the data – things like physical proximity when multiple people meet up together over multiple days
- Analyzing the age, gender and contact dynamics of different areas within Hong Kong to understand the local population trends and behavior.
- Investigating the structure of social networks to study how patterns of contact vary among socio economic backgro...
Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.
This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.
Dataset Here is a description of the dataset files.
followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v). posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line. interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date. graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread. feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author); feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp. feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp; scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.
Citation If used for research purposes, please cite the following paper describing the dataset details:
Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984
Acknowledgments: This work is supported by :
the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/com-Orkut.html
Dataset information
Orkut (http://www.orkut.com/) is a free on-line social network where users
form friendship each other. Orkut also allows users form a group which
other members can then join. We consider such user-defined groups as
ground-truth communities. We provide the Orkut friendship social network
and ground-truth communities. This data is provided by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)
We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.
Dataset statistics
Nodes 3,072,441
Edges 117,185,083
Nodes in largest WCC 3072441 (1.000)
Edges in largest WCC 117185083 (1.000)
Nodes in largest SCC 3072441 (1.000)
Edges in largest SCC 117185083 (1.000)
Average clustering coefficient 0.1666
Number of triangles 627584181
Fraction of closed triangles 0.01414
Diameter (longest shortest path) 9
90-percentile effective diameter 4.8
Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based
on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233
Files
File Description
com-orkut.ungraph.txt.gz Undirected Orkut network
com-orkut.all.cmty.txt.gz Orkut communities
com-orkut.top5000.cmty.txt.gz Orkut communities (Top 5,000)
The graph in the SNAP data set is 1-based, with nodes numbered 1 to
3,072,626.
In the SuiteSparse Matrix Collection, Problem.A is the undirected
Orkut network, a matrix of size n-by-n with n=3,072,441, which is
the number of unique user id's appearing in any edge.
Problem.aux.nodeid is a list of the node id's that appear in the SNAP data
set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).
C = Problem.aux.Communities_all is a sparse matrix of size n by 15,301,901
which represents the same number communities in the com-orkut.all.cmty.txt
file. The kth line in that file defines the kth community, and is the
column C(:,k), where where C(i,k)=1 if person nodeid(i) is in the kth
community. Row C(i,:) and row/column i of the A matrix thus refer to the
same person, nodeid(i).
Ctop = Problem.aux.Communities_to...
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is a collection of 12,478 social media comments found on the official Facebook pages of ten Philippine newspapers, The Philippine Daily Inquirer, Manila Bulletin, The Philippine Star, The Manila Times, Sunstar Cebu, Sunstar Davao, Cebu Daily News, The Freeman, Sunstar Davao, MindaNews, and The Mindanao Times, spanning the years 2015, 2017 and 2019. The comments contain terms related to the Moro identity and the Mamasapano Clash, the Marawi Siege and the establishment of BARMM in the southern Philippines, allowing researchers to study semantic fields with regard to Muslims and the relationship between the texts and the source newspaper, their region of origin, and political administration, among other variables. All comments in the dataset were downloaded through Facebook's Graph API via Facepager (Jünger & Keyling, 2019).
One CSV file (MMB151719SOCMED_v2.csv) is provided, along with a codebook that contains descriptions of the variables and codes used in the CSV file, and a Readme document with a changelog.
Each social media comment is annotated with the following metadata:
object_id: identifier associated with the comment;
message: the textual string of the comment;
message_proc: the textual string of the comment after pre-processing;
lang_label: categorical value for the language of the comment (Tagalog (Filipino), Cebuano, English, Taglish, Bislog, Bislish, Trilingual or Other);
from_name: identifier of public pages (not profiles of individuals) leaving comments (NaN for profiles of individuals, 'NAME' for public pages besides the newspapers, otherwise, the page name of the newspaper);
created_time: Facebook Graph API's-generated string for the date and time the comment was posted;
month_year: categorical value in the form string+YY (e.g. Jun-15) of the month and year when the comment was posted;
year: numerical value in the form YY;
newspaper: categorical value for the newspaper Facebook page under which the comment was found;
corpus: categorical value for comments from the main corpus or the side (control) corpus;
administration: categorical value for political administration (pbsa = President Benigno Aquino III, prrd = President Rodrigo Roa Duterte);
count: numerical value referring to the number of string sequences without spaces;
The dataset may only be used for non-commercial purposes and is licensed under the CC BY-NC-SA 4.0 DEED.
V2 - 05/06/2024
Corrections
Corrections made to region to include Luzon, Visayas and Mindanao (as opposed to Mindanao, non-Mindanao);
Corrections made to administration coding.
This dataset is described by:
Cruz, F. A. (2024). A Multilingual Collection of Facebook Comments on the Moro Identity and Armed Conflict in the Southern Philippines. Journal of Open Humanities Data, 10(1), 41. DOI: https://doi.org/10.5334/johd.219
Bibiliography
Jünger, J., & Keyling, T. (2019). Facepager: An application for automated data retrieval on the web (4.5.3) [Computer software]. https://github.com/strohne/Facepager/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
56.8% of the world’s total population is active on social media.
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
The number of Reddit users in the United States was forecast to continuously increase between 2024 and 2028 by in total 10.3 million users (+5.21 percent). After the ninth consecutive increasing year, the Reddit user base is estimated to reach 208.12 million users and therefore a new peak in 2028. Notably, the number of Reddit users of was continuously increasing over the past years.User figures, shown here with regards to the platform reddit, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once. Reddit users encompass both users that are logged in and those that are not.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Reddit users in countries like Mexico and Canada.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Crawl of the Flickr photo-sharing social network from May 2006 returning a graph with 820,878 nodes and 9,837,214 edges. Dataset is distributed as a SMAT file with README file with code to read file in Python and MATLAB.
PATTERN is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. PATTERN tests the fundamental graph task of recognizing specific predetermined subgraphs.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Complete dataset of all 29,850 USA cities Roads network as a graph in the shp format. The extracts follow 2016 official USA cities boundaries. Graph are identified by their [city_code].shp. Cities code are provided by the Tiger Census Dataset. Graph have been created by extracting all openstreetmap.org (osm) maps for each USA Cityextracting the graph from osm extract using the policosm python github librarysimplifying the graph by removing all degree two nodes to retain only a workable transportation network. Original road length is retained as an attribute Nodes includes latitude and longitude attributes from WGS84 projection Edges includes length in meter (precision < 1m), tag:highway value from osm See policosm on github for more informations on extractions algorithm
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over 210 million people worldwide suffer from social media addiction.
Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.
Data Sources:
GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.
StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.
DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.
Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.
With our datasets, you'll receive:
Choose from various output formats, storage options, and delivery frequencies:
Why choose our Datasets?
Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.
Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.
Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This data is collected for the Network Analysis of Breaking Bad television series.
DataCamp's A Network Analysis of Game of Thrones was the inspiration for the project. Since there is no relationship dataset for the Breaking Bad is available, decided to generate relationship dataset from episode summaries for the graph network analysis.
The data was collected using web scrapping from the fandom page of Breaking Bad series.
This dataset contains metadata about all Covid-related YouTube videos which circulated on public social media, but which YouTube eventually removed because they contained false information. It describes 8,122 videos that were shared between November 2019 and June 2020. The dataset contains unique identifiers for the videos and social media accounts that shared the videos, statistics on social media engagement and metadata such as video titles and view counts where they were recoverable. We publish the data alongside the code used to produce on Github. The dataset has reuse potential for research studying narratives related to the coronavirus, the impact of social media on knowledge about health and the politics of social media platforms.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25663426%2Fef0839f1c6342b2f89b87d08acfb4b74%2Fpeertube_graph(1).png?generation=1746770713374326&alt=media" alt="Peertube "follow" graph">
Above is the Peertube "follow" graph. The colours correspond to the language of the server (purple: unknown, green: French, blue: English, black: German, orange: Italian, grey: others).
Decentralized machine learning---where each client keeps its own data locally and uses its own computational resources to collaboratively train a model by exchanging peer-to-peer messages---is increasingly popular, as it enables better scalability and control over the data. A major challenge in this setting is that learning dynamics depend on the topology of the communication graph, which motivates the use of real graph datasets for benchmarking decentralized algorithms. Unfortunately, existing graph datasets are largely limited to for-profit social networks crawled at a fixed point in time and often collected at the user scale, where links are heavily influenced by the platform and its recommendation algorithms. The Fediverse, which includes several free and open-source decentralized social media platforms such as Mastodon, Misskey, and Lemmy, offers an interesting real-world alternative. We introduce Fedivertex, a new dataset covering seven social networks from the Fediverse, crawled weekly on a weekly basis.
We refer to our paper for a detailed presentation of the graphs: [SOON]
We implemented a simple Python API to interact easily with the dataset: https://pypi.org/project/fedivertex/
pip3 install fedivertex
This package automatically downloads the dataset and generate NetworkX graphs.
from fedivertex import GraphLoader
loader.list_graph_types("mastodon")
# List available graphs for a given software, here federation and active_user
G = loader.get_graph(software = "mastodon", graph_type = "active_user", index = 0, only_largest_component = True)
# G contains the Networkx graph of the giant component of the active users graph at the 1st date of collection
We also provide a Kaggle notebook demonstrating simple operations using this library: https://www.kaggle.com/code/marcdamie/exploratory-graph-data-analysis-of-fedivertex
The dataset contains graphs crawled on a daily basis on 7 social networks from the Fediverse. Each graph quantifies/characterizes the interaction differently depending on the information provided by the public API of these networks.
We present briefly the graph below (NB: the term "instance" refers to servers on the Fediverse):
These graphs provide diverse perspectives on the Fediverse as they capture more or less subtle phenomenon. For example, "federation" graphs are dense, while "intra-instance" graphs are sparse. We have performed a detailed exploratory data analysis in this notebook.
Our CSV files are formatted so that they can be directly imported into Gephi for graph visualization. Find below an example Gephi visualization of the Misskey "active users" graph (without the misskey.io
node). The colours correspond to the language of the server (purple:Unknown, red: Japanese, brown: Korean, blue: English, yellow: Chinese).
![Misskey "active users" graph](https://www.go...