https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
more info : https://snap.stanford.edu/data/com-Youtube.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/com-Youtube.html
Dataset information
Youtube (http://www.youtube.com/) is a video-sharing web site that includes
a social network. In the Youtube social network, users form friendship each
other and users can create groups which other users can join. We consider
such user-defined groups as ground-truth communities. This data is provided
by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)
We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.
Network statistics
Nodes 1,134,890
Edges 2,987,624
Nodes in largest WCC 1134890 (1.000)
Edges in largest WCC 2987624 (1.000)
Nodes in largest SCC 1134890 (1.000)
Edges in largest SCC 2987624 (1.000)
Average clustering coefficient 0.0808
Number of triangles 3056386
Fraction of closed triangles 0.002081
Diameter (longest shortest path) 20
90-percentile effective diameter 6.5
Community statistics
Number of communities 8,385
Average community size 13.50
Average membership size 0.10
Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based
on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233
Files
File Description
com-youtube.ungraph.txt.gz Undirected Youtube network
com-youtube.all.cmty.txt.gz Youtube communities
com-youtube.top5000.cmty.txt.gz Youtube communities (Top 5,000)
The graph in the SNAP data set is 1-based, with nodes numbered 1 to
1,157,827.
In the SuiteSparse Matrix Collection, Problem.A is the undirected Youtube
network, a matrix of size n-by-n with n=1,134,890, which is the number of
unique user id's appearing in any edge.
Problem.aux.nodeid is a list of the node id's that appear in the SNAP data
set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).
C = Problem.aux.Communities_all is a sparse matrix of size n by 16,386
which represents the communities in the com-youtube.all.cmty.txt file.
The kth line in that file defines the kth community, and is the column
C(:,k), where C(i,k)=1 if person ...
Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. The Friendster dataset consist of ground-truth communities (based on user-defined groups) and the social network from induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community.
Task
Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.
After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.
As in previous years, we propose the task from a multilingual perspective:
NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.
Data
Input
The uncompressed dataset consists in a folder per language (en, es). Each folder contains:
The format of the XML files is:
The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.
b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.
IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.
Evaluation
The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.
Submission
Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:
mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY
Within OUTPUT-DIRECTORY
, we require two subfolders: en
and es
, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:
The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.
Related Work
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/com-Orkut.html
Dataset information
Orkut (http://www.orkut.com/) is a free on-line social network where users
form friendship each other. Orkut also allows users form a group which
other members can then join. We consider such user-defined groups as
ground-truth communities. We provide the Orkut friendship social network
and ground-truth communities. This data is provided by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)
We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.
Dataset statistics
Nodes 3,072,441
Edges 117,185,083
Nodes in largest WCC 3072441 (1.000)
Edges in largest WCC 117185083 (1.000)
Nodes in largest SCC 3072441 (1.000)
Edges in largest SCC 117185083 (1.000)
Average clustering coefficient 0.1666
Number of triangles 627584181
Fraction of closed triangles 0.01414
Diameter (longest shortest path) 9
90-percentile effective diameter 4.8
Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based
on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233
Files
File Description
com-orkut.ungraph.txt.gz Undirected Orkut network
com-orkut.all.cmty.txt.gz Orkut communities
com-orkut.top5000.cmty.txt.gz Orkut communities (Top 5,000)
The graph in the SNAP data set is 1-based, with nodes numbered 1 to
3,072,626.
In the SuiteSparse Matrix Collection, Problem.A is the undirected
Orkut network, a matrix of size n-by-n with n=3,072,441, which is
the number of unique user id's appearing in any edge.
Problem.aux.nodeid is a list of the node id's that appear in the SNAP data
set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).
C = Problem.aux.Communities_all is a sparse matrix of size n by 15,301,901
which represents the same number communities in the com-orkut.all.cmty.txt
file. The kth line in that file defines the kth community, and is the
column C(:,k), where where C(i,k)=1 if person nodeid(i) is in the kth
community. Row C(i,:) and row/column i of the A matrix thus refer to the
same person, nodeid(i).
Ctop = Problem.aux.Communities_to...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and like feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions. This dataset allows novel analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LiveJournal is a free on-line blogging community where users declare friendship each other. LiveJournal also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. We provide the LiveJournal friendship social network and ground-truth communities.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. For the social network, we take the induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community. This data is provided by The Web Archive Project, where the full graph is available.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
Orkut is a free on-line social network where users form friendship each other. Orkut also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. We provide the Orkut friendship social network and ground-truth communities. This data is provided by Alan Mislove et al.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
The DBLP computer science bibliography provides a comprehensive list of research papers in computer science. We construct a co-authorship network where two authors are connected if they publish at least one paper together. Publication venue, e.g, journal or conference, defines an individual ground-truth community; authors who published to a certain journal or conference form a community.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. Each product category provided by Amazon defines each ground-truth community.
We regard each connected component in a product category as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
The network was generated using email data from a large European research institution. We have anonymized information about all incoming and outgoing email between members of the research institution. Th...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study investigates the features of fake news networks and how they spread during the 2020 South Korean election. Using Actor-Network Theory (ANT), we assessed the network's central players and how they are connected. Results reveal the characteristics of the videoclips and channel networks responsible for the propagation of fake news. Analysis of the videoclip network reveals a high number of detected fake news videos and a high density of connections among users. Assessment of news videoclips on both actual and fake news networks reveals that the real news network is more concentrated. However, the scale of the network may play a role in these variations. Statistics for network centralization reveal that users are spread out over the network, pointing to its decentralized character. A closer look at the real and fake news networks inside videos and channels reveals similar trends. We find that the density of the real news videoclip network is higher than that of the fake news network, whereas the fake news channel networks are denser than their real news counterparts, which may indicate greater activity and interconnectedness in their transmission. We also found that fake news videoclips had more likes than real news videoclips, whereas real news videoclips had more dislikes than fake news videoclips. These findings strongly suggest that fake news videoclips are more accepted when people watch them on YouTube. In addition, we used semantic networks and automated content analysis to uncover common language patterns in fake news which helps us better understand the structure and dynamics of the networks involved in the dissemination of fake news. The findings reported here provide important insights on how fake news spread via social networks during the South Korean election of 2020. The results of this study have important implications for the campaign against fake news and ensuring factual coverage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Twitter follower-followee graph with 269,640 nodes and 6,818,501 edges from [Kwak], and we obtain the ground truth labels from [SybilSCAR]. Among them 178377 are benign and 91263 are Sybil. We divide 9000 Sybil and 17000 benign users (about 10%) from them as the training set and test on the overall social graph.
H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?” in WWW, 2010 B. Wang, L. Zhang, and N. Z. Gong, “SybilSCAR: Sybil detection in online social networks via local rule based propagation,” in IEEE INFOCOM, 2017.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
more info : https://snap.stanford.edu/data/com-Youtube.html