16 datasets found
  1. Truth Social Dataset

    • zenodo.org
    zip
    Updated Jan 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger (2023). Truth Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.7531625
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Truth Social data set containing a network of users, their associated posts, and additional information about each post. Collected from February 2022 through September 2022, this dataset contains 454,458 user entries and 845,060 Truth (Truth Social’s term for post) entries.

    Comprised of 12 different files, the entry count for each file is shown below.

    FileData Points
    users.tsv454,458
    follows.tsv4,002,115
    truths.tsv823,927
    quotes.tsv10,508
    replies.tsv506,276
    media.tsv184,884
    hashtags.tsv21,599
    external_urls.tsv173,947
    truth_hashtag_edges.tsv213,295
    truth_media_edges.tsv257,500
    truth_external_url_edges.tsv252,877
    truth_user_tag_edges.tsv145,234

    A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.

  2. h

    TrumpsTruthSocialPosts

    • huggingface.co
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    notmooodoo9 (2025). TrumpsTruthSocialPosts [Dataset]. https://huggingface.co/datasets/notmooodoo9/TrumpsTruthSocialPosts
    Explore at:
    Dataset updated
    Oct 24, 2025
    Authors
    notmooodoo9
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Truth Social Dataset (Limited Version)

    License: CC BY 4.0

      Overview
    

    This dataset contains posts and comments scraped from Truth Social, focusing on Donald Trump’s posts (“Truths” and “Retruths”).Due to the initial collection method, all media and URLs were excluded. Future versions will include complete post data, including images and links. Contains 31.8Million Comments, and over 18000 Posts all By Trump. As well as logged over 1.5Million unique users who commented on… See the full description on the dataset page: https://huggingface.co/datasets/notmooodoo9/TrumpsTruthSocialPosts.

  3. c

    Truth Social Price Prediction Data

    • coinbase.com
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Truth Social Price Prediction Data [Dataset]. https://www.coinbase.com/price-prediction/base-truth-social-4a61
    Explore at:
    Dataset updated
    Nov 13, 2025
    Variables measured
    Growth Rate, Predicted Price
    Measurement technique
    User-defined projections based on compound growth. This is not a formal financial forecast.
    Description

    This dataset contains the predicted prices of the asset Truth Social over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.

  4. YouTube Social Network with Communities (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). YouTube Social Network with Communities (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-com-youtube
    Explore at:
    zip(13777811 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    Youtube social network and ground-truth communities

    https://snap.stanford.edu/data/com-Youtube.html

    Dataset information

    Youtube (http://www.youtube.com/) is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider
    such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
    (http://socialnetworks.mpi-sws.org/data-imc2007.html)

    We regard each connected component in a group as a separate ground-truth
    community. We remove the ground-truth communities which have less than 3
    nodes. We also provide the top 5,000 communities with highest quality
    which are described in our paper (http://arxiv.org/abs/1205.6233). As for
    the network, we provide the largest connected component.

    Network statistics
    Nodes 1,134,890
    Edges 2,987,624
    Nodes in largest WCC 1134890 (1.000)
    Edges in largest WCC 2987624 (1.000)
    Nodes in largest SCC 1134890 (1.000)
    Edges in largest SCC 2987624 (1.000)
    Average clustering coefficient 0.0808
    Number of triangles 3056386
    Fraction of closed triangles 0.002081
    Diameter (longest shortest path) 20
    90-percentile effective diameter 6.5
    Community statistics
    Number of communities 8,385
    Average community size 13.50
    Average membership size 0.10

    Source (citation)
    J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

    Files
    File Description
    com-youtube.ungraph.txt.gz Undirected Youtube network
    com-youtube.all.cmty.txt.gz Youtube communities
    com-youtube.top5000.cmty.txt.gz Youtube communities (Top 5,000)

    Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

    The graph in the SNAP data set is 1-based, with nodes numbered 1 to
    1,157,827.

    In the SuiteSparse Matrix Collection, Problem.A is the undirected Youtube
    network, a matrix of size n-by-n with n=1,134,890, which is the number of
    unique user id's appearing in any edge.

    Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
    node id's are the same as the SNAP data set (1-based).

    C = Problem.aux.Communities_all is a sparse matrix of size n by 16,386
    which represents the communities in the com-youtube.all.cmty.txt file.
    The kth line in that file defines the kth community, and is the column
    C(:,k), where C(i,k)=1 if person ...

  5. Data from: Youtube social network

    • kaggle.com
    zip
    Updated Sep 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo De Tomasi (2019). Youtube social network [Dataset]. https://www.kaggle.com/datasets/lodetomasi1995/youtube-social-network/code
    Explore at:
    zip(10604317 bytes)Available download formats
    Dataset updated
    Sep 1, 2019
    Authors
    Lorenzo De Tomasi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.

    We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

    more info : https://snap.stanford.edu/data/com-Youtube.html

  6. s

    USER IDENTITY LINKAGE DATASET

    • smu.edu.sg
    Updated Feb 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2018). USER IDENTITY LINKAGE DATASET [Dataset]. https://www.smu.edu.sg/sites/default/files/archives/larc/larc.smu.edu.sg/user-identity-linkage-dataset.html
    Explore at:
    Dataset updated
    Feb 14, 2018
    Dataset authored and provided by
    Living Analytics Research Centre
    Description

    This dataset is crawled from three popular on-line social networks (OSNs), namely, Twitter, Facebook and Foursquare. We collected this dataset as follows. We first gathered a set of Singapore-based Twitter users who declared Singapore as location in their user profiles. From the Singapore-based Twitter users, we retrieve a subset of Twitter users who declared their Facebook or Foursquare accounts in their short bio description. In total, we collected 1,998 Twitter-Facebook user identity pairs (known as TW-FB ground truth matching pairs}, and 3,602 Twitter-Foursquare user identity pairs (known as TW-FQ ground truth matching pairs). To simulate a real-world setting, where a user identity in the source OSN may not have its corresponding matching user identity in the target OSN, we expanded the datasets by adding Twitter, Facebook and Foursquare users who are connected to users in the TW-FB ground truth matching pairs and TW-FQ ground truth matching pairs sets. Note that isolated users who do not have links to other users are removed from the data sets. After collecting the datasets, we extract the following user features using the OSNs' APIs. • Username: The username of the account. • Screen name: The natural name of the user account. It is usually formed using the first and last name of the user. • Profile Image: The thumbnail or image provided by the user to visually present herself. • Network: The relationship links between users.

  7. LiveJournal Social Network with Communities (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). LiveJournal Social Network with Communities (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-com-livejournal
    Explore at:
    zip(162104147 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LiveJournal social network and ground-truth communities

    https://snap.stanford.edu/data/com-LiveJournal.html

    Dataset information

    LiveJournal (http://www.livejournal.com/) is a free on-line blogging
    community where users declare friendship each other. LiveJournal also
    allows users form a group which other members can then join. We consider
    such user-defined groups as ground-truth communities. We provide the
    LiveJournal friendship social network and ground-truth communities.

    We regard each connected component in a group as a separate ground-truth
    community. We remove the ground-truth communities which have less than 3
    nodes. We also provide the top 5,000 communities with highest quality
    which are described in our paper (http://arxiv.org/abs/1205.6233). As for
    the network, we provide the largest connected component.

    Dataset statistics
    Nodes 3,997,962
    Edges 34,681,189
    Nodes in largest WCC 3997962 (1.000)
    Edges in largest WCC 34681189 (1.000)
    Nodes in largest SCC 3997962 (1.000)
    Edges in largest SCC 34681189 (1.000)
    Average clustering coefficient 0.2843
    Number of triangles 177820130
    Fraction of closed triangles 0.04559
    Diameter (longest shortest path) 17
    90-percentile effective diameter 6.5

    Source (citation)
    J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

    Files
    File Description
    com-lj.ungraph.txt.gz Undirected LiveJournal network
    com-lj.all.cmty.txt.gz LiveJournal communities
    com-lj.top5000.cmty.txt.gz LiveJournal communities (Top 5,000)

    Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

    The graph in the SNAP data set is 0-based, with nodes numbering 0 to
    4,036,537.

    In the SuiteSparse Matrix Collection, Problem.A is the undirected
    LiveJournal network, a matrix of size n-by-n with n=3,997,962, which is
    the number of unique user id's appearing in any edge.

    Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
    node id's are the same as the SNAP data set (0-based).

    C = Problem.aux.Communities_all is a sparse matrix of size n by 664,414
    which represents the communities in the com-lj.all.cmty.txt file. The kth line in that file defines the kth community, and is the column C(:,k),
    where C(i,k)=1 if person nodeid(i) is in the kth community. Row C(i,:)
    and row/column i of the A matrix thus refer to the same person, nodeid(i).

    Ctop = Problem.aux.Communities_top5000 is n-by-5000, with the same
    structure as the C array above, with the content of the
    com-lj.top5000.cmty.txt file.

  8. Collected feed statistics.

    • plos.figshare.com
    xls
    Updated Nov 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Failla; Giulio Rossetti (2024). Collected feed statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0310330.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrea Failla; Giulio Rossetti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and like feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions. This dataset allows novel analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.

  9. Truth Seeker Dataset 2023 (TruthSeeker2023)

    • kaggle.com
    zip
    Updated Oct 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pauline Peps (2024). Truth Seeker Dataset 2023 (TruthSeeker2023) [Dataset]. https://www.kaggle.com/datasets/paulinepeps/truth-seeker-dataset-2023-truthseeker2023
    Explore at:
    zip(51867979 bytes)Available download formats
    Dataset updated
    Oct 14, 2024
    Authors
    Pauline Peps
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    From the description: https://www.unb.ca/cic/datasets/truthseeker-2023.html

    CIC truth seeker dataset 2023 (TruthSeeker2023)

    This project aims to create the largest ground truth fake news analysis dataset for real and fake news content in relation to social media posts. Below illustrates the major contributions of the TruthSeeker dataset to the current fake news dataset landscape:

    One of the most extensive benchmark datasets with more than 180,000 labelled Tweets.

    Three-factor active learning verification method which involved utilising 456 unique, highly skilled, Amazon Mechanical Turkers for labelling each Tweet. To understand patterns and characteristics of Twitter users, three auxiliary social media scores are also introduced:**Bot, credibility, and influence score.**

    Conducted comprehensive analyses and evaluations on the TruthSeeker dataset, including the establishment of deep learning-based detection models, clustering-based event detection, and exploration of the relationship between tweet labels and the characteristics of online creators/spreaders.

    The application of multiple BERT-based models to assess the accuracy of real/fake tweet detection.

    The data for the Truth Seeker and Basic ML dataset were generated through the crawling of tweets related to Real and Fake news from the Politifact Dataset. Taking these ground truth values and crawling for tweets related to these topics (by manually generating keywords associated with the news in question to input into the twitter API), we were able to extract over 186,000 (before final processing) tweets related to 700 real and 700 fake pieces of news.

    Taking this raw tweet data, we then used crowdsourcing in the form of Amazon Mechanical Turk to generate a majority answer to how closely the tweet agrees with the Real/Fake news source statement. After, a majority agreement algorithm is employed to designate a validity to the associated tweets in both a 3 and 5 category classification column.

    This results in one of the largest ground truth datasets for fake news detection on twitter ever created. The TruthSeeker Dataset. Then we also generated a dataset of features from the tweet itself and the metadata of the user who posted the related tweet. Allowing the user to have the option to use both deep learning models as well as classical machine learning techniques.

    Feature dataset: here

    From: The Largest Social Media Ground-Truth Dataset for Real/Fake Content: TruthSeeker By: Sajjad Dadkhah; Xichen Zhang; Alexander Gerald Weismann; Amir Firouzi; Ali A. Ghorbani DOI: 10.1109/TCSS.2023.3322303

  10. H

    Data from: Census of Twitter Users: Scraping and Describing the National...

    • dataverse.harvard.edu
    • dataone.org
    Updated Aug 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Guan (2022). Census of Twitter Users: Scraping and Describing the National Network of South Korea [Dataset]. http://doi.org/10.7910/DVN/9GRCYU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Lu Guan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    South Korea
    Description

    Population-level national networks on social media are precious and essential for network science and behavioural science. This study proposes a data collection strategy for scraping online social networks at the population level, and thereby serving as a “ground truth” for the validation of both ego-centric and socio-centric data collection approaches. We proposed a set of validation approaches to evaluate the validity of our approach. Finally, we re-examined classical network and communication propositions (e.g., 80/20 rule, six degrees of separation) on the national network. Our proposed strategy would largely flourish the data collection pool of population-level social networks and further develop the research of network analysis in digital media environment.

  11. Orkut Social Network and Communities (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Orkut Social Network and Communities (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-com-orkut/discussion
    Explore at:
    zip(925908495 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Orkut social network and ground-truth communities

    https://snap.stanford.edu/data/com-Orkut.html

    Dataset information

    Orkut (http://www.orkut.com/) is a free on-line social network where users form friendship each other. Orkut also allows users form a group which
    other members can then join. We consider such user-defined groups as
    ground-truth communities. We provide the Orkut friendship social network
    and ground-truth communities. This data is provided by Alan Mislove et al. (http://socialnetworks.mpi-sws.org/data-imc2007.html)

    We regard each connected component in a group as a separate ground-truth
    community. We remove the ground-truth communities which have less than 3
    nodes. We also provide the top 5,000 communities with highest quality
    which are described in our paper (http://arxiv.org/abs/1205.6233). As for
    the network, we provide the largest connected component.

    Dataset statistics
    Nodes 3,072,441
    Edges 117,185,083
    Nodes in largest WCC 3072441 (1.000)
    Edges in largest WCC 117185083 (1.000)
    Nodes in largest SCC 3072441 (1.000)
    Edges in largest SCC 117185083 (1.000)
    Average clustering coefficient 0.1666
    Number of triangles 627584181
    Fraction of closed triangles 0.01414
    Diameter (longest shortest path) 9
    90-percentile effective diameter 4.8

    Source (citation)
    J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

    Files
    File Description
    com-orkut.ungraph.txt.gz Undirected Orkut network
    com-orkut.all.cmty.txt.gz Orkut communities
    com-orkut.top5000.cmty.txt.gz Orkut communities (Top 5,000)

    Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

    The graph in the SNAP data set is 1-based, with nodes numbered 1 to
    3,072,626.

    In the SuiteSparse Matrix Collection, Problem.A is the undirected
    Orkut network, a matrix of size n-by-n with n=3,072,441, which is
    the number of unique user id's appearing in any edge.

    Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
    node id's are the same as the SNAP data set (1-based).

    C = Problem.aux.Communities_all is a sparse matrix of size n by 15,301,901 which represents the same number communities in the com-orkut.all.cmty.txt file. The kth line in that file defines the kth community, and is the
    column C(:,k), where where C(i,k)=1 if person nodeid(i) is in the kth
    community. Row C(i,:) and row/column i of the A matrix thus refer to the
    same person, nodeid(i).

    Ctop = Problem.aux.Communities_to...

  12. Z

    Profiling Fake News Spreaders on Twitter

    • nde-dev.biothings.io
    Updated Sep 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FRANCISCO RANGEL (2020). Profiling Fake News Spreaders on Twitter [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_3692318
    Explore at:
    Dataset updated
    Sep 22, 2020
    Dataset provided by
    FRANCISCO RANGEL
    BILAL GHANEM
    PAOLO ROSSO
    ANASTASIA GIACHANOU
    Description

    Task

    Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.

    After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.

    As in previous years, we propose the task from a multilingual perspective:

    English

    Spanish

    NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.

    Data

    Input

    The uncompressed dataset consists in a folder per language (en, es). Each folder contains:

    A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.

    A truth.txt file with the list of authors and the ground truth.

    The format of the XML files is:

        Tweet 1 textual contents
        Tweet 2 textual contents
        ...
    

    The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.

    b2d5748083d6fdffec6c2d68d4d4442d:::0
    2bed15d46872169dc7deaf8d2b43a56:::0
    8234ac5cca1aed3f9029277b2cb851b:::1
    5ccd228e21485568016b4ee82deb0d28:::0
    60d068f9cafb656431e62a6542de2dc0:::1
    ...
    

    Output

    Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

    The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

    IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

    Evaluation

    The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.

    Submission

    Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

    We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:

    mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY

    Within OUTPUT-DIRECTORY, we require two subfolders: en and es, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:

    The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

    Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

    Related Work

    Bilal Ghanem, Paolo Rosso, Francisco Rangel. An Emotional Analysis of False Information in Social Media and News Articles. arXiv preprint arXiv:1908.09951 (2019). ACM Transactions on Internet Technology (TOIT). In Press.

    Anastasia Giachanou, Paolo Rosso, Fabio Crestani. Leveraging Emotional Signals for Credibility Detection. Proceedings of the 42nd International ACM Conference on Research and Development in Information Retrieval (SIGIR). pp 877–880. (2019)

    Andre Guess, Jonathan Nagler, and Joshua Tucker. Less than you think: Prevalence and predictors of fake news dissemination on Facebook. Science Advances vol. 5 (2019)

    Andrew Hall, Loren Terveen, Aaron Halfaker. Bot Detection in Wikidata Using Behavioral and Other Informal Cues. Proceedings of the ACM on Human-Computer Interaction. 2018 Nov 1;2(CSCW):64.

    Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, Gerhard Weikum. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp 22-32. (2018)

    Francisco Rangel and Paolo Rosso. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter. In: L. Cappellato, N. Ferro, D. E. Losada and H. Müller (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org, vol. 2380

    Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 2125.

    Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866.

    Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784

    Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.

    Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.

    Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179

    Francisco Rangel and Paolo Rosso On the Implications of the General Data Protection Regulation on the Organisation of Evaluation Tasks. In: Language and Law / Linguagem e Direito, Vol. 5(2), pp. 80-102

    Kai Shu, Suhang Wang, and Huan Liu. Understanding user profiles on social media for fake news detection. Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430--435 (2018)

    Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter. (2017)

  13. b

    From Physical Activity to Online Engagement - Datasets - data.bris

    • data.bris.ac.uk
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). From Physical Activity to Online Engagement - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/13vcslkc37coc2g8u7iqt7o6cm
    Explore at:
    Dataset updated
    Sep 17, 2025
    Description

    Previous studies have demonstrated the advantages of physical behaviour such as physical activity and sleep on mental health and provided an association between virtual behaviour, such as social media use and screen time, and mental health problems. Here, physical behaviour is defined as the data related to the person in the physical real world such as physical activity, sleep, and virtual behaviour is defined as behaviour involving the internet such as social networks, general web browsing, and instant messaging. We believe that a person's physical or virtual behaviour individually may not be the best indicator of their mental health. Current datasets do not include data on both physical and virtual behaviours. Therefore, we seek to run a data collection study that collects both physical and virtual behaviours. Additionally, we investigate if machine learning models that include both physical and virtual behaviour can better predict mental health. This study is conducted by using data collected via a custom-made app. This app is made to run in the background of a user's smartphone collecting physical activity, sleep, location, and audio inferences passively. Additionally, it offers users an ecological momentary assessment (EMA) platform where they may log information about their feelings and other significant occurrences through the Warwick-Edinburgh Mental Wellbeing survey. This will provide us with a ground truth to evaluate our models. We also collect social media data through Instagram and YouTube logs sent by the participants at the end of the study.

  14. Ground-truth Communities Graphs (flat)

    • kaggle.com
    zip
    Updated Jul 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2023). Ground-truth Communities Graphs (flat) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-communities--flat
    Explore at:
    zip(10776248507 bytes)Available download formats
    Dataset updated
    Jul 6, 2023
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    com-LiveJournal: LiveJournal social network and ground-truth communities

    LiveJournal is a free on-line blogging community where users declare friendship each other. LiveJournal also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. We provide the LiveJournal friendship social network and ground-truth communities.

    We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

    com-Friendster: Friendster social network and ground-truth communities

    Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. For the social network, we take the induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community. This data is provided by The Web Archive Project, where the full graph is available.

    We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

    com-Orkut: Orkut social network and ground-truth communities

    Orkut is a free on-line social network where users form friendship each other. Orkut also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. We provide the Orkut friendship social network and ground-truth communities. This data is provided by Alan Mislove et al.

    We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

    com-Youtube: Youtube social network and ground-truth communities

    Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.

    We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

    com-DBLP: DBLP collaboration network and ground-truth communities

    The DBLP computer science bibliography provides a comprehensive list of research papers in computer science. We construct a co-authorship network where two authors are connected if they publish at least one paper together. Publication venue, e.g, journal or conference, defines an individual ground-truth community; authors who published to a certain journal or conference form a community.

    We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

    com-Amazon: Amazon product co-purchasing network and ground-truth communities

    Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. Each product category provided by Amazon defines each ground-truth community.

    We regard each connected component in a product category as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

    email-Eu-core: email-Eu-core network

    The network was generated using email data from a large European research institution. We have anonymized information about all incoming and outgoing email between members of the research institution. Th...

  15. f

    Data sets used for hashtag analysis.

    • plos.figshare.com
    xlsx
    Updated Jan 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alon Sela; Omer Neter; Václav Lohr; Petr Cihelka; Fan Wang; Moti Zwilling; John Phillip Sabou; Miloš Ulman (2025). Data sets used for hashtag analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0309688.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Alon Sela; Omer Neter; Václav Lohr; Petr Cihelka; Fan Wang; Moti Zwilling; John Phillip Sabou; Miloš Ulman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Social networks are a battlefield for political propaganda. Protected by the anonymity of the internet, political actors use computational propaganda to influence the masses. Their methods include the use of synchronized or individual bots, multiple accounts operated by one social media management tool, or different manipulations of search engines and social network algorithms, all aiming to promote their ideology. While computational propaganda influences modern society, it is hard to measure or detect it. Furthermore, with the recent exponential growth in large language models (L.L.M), and the growing concerns about information overload, which makes the alternative truth spheres more noisy than ever before, the complexity and magnitude of computational propaganda is also expected to increase, making their detection even harder. Propaganda in social networks is disguised as legitimate news sent from authentic users. It smartly blended real users with fake accounts. We seek here to detect efforts to manipulate the spread of information in social networks, by one of the fundamental macro-scale properties of rhetoric—repetitiveness. We use 16 data sets of a total size of 13 GB, 10 related to political topics and 6 related to non-political ones (large-scale disasters), each ranging from tens of thousands to a few million of tweets. We compare them and identify statistical and network properties that distinguish between these two types of information cascades. These features are based on both the repetition distribution of hashtags and the mentions of users, as well as the network structure. Together, they enable us to distinguish (p − value = 0.0001) between the two different classes of information cascades. In addition to constructing a bipartite graph connecting words and tweets to each cascade, we develop a quantitative measure and show how it can be used to distinguish between political and non-political discussions. Our method is indifferent to the cascade’s country of origin, language, or cultural background since it is only based on the statistical properties of repetitiveness and the word appearance in tweets bipartite network structures.

  16. Individual characteristics of deliberate and accidental fake news...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kerstin Unfried; Jan Priebe (2024). Individual characteristics of deliberate and accidental fake news distributors. [Dataset]. http://doi.org/10.1371/journal.pone.0301818.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 9, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Kerstin Unfried; Jan Priebe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Individual characteristics of deliberate and accidental fake news distributors.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger (2023). Truth Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.7531625
Organization logo

Truth Social Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
Jan 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A Truth Social data set containing a network of users, their associated posts, and additional information about each post. Collected from February 2022 through September 2022, this dataset contains 454,458 user entries and 845,060 Truth (Truth Social’s term for post) entries.

Comprised of 12 different files, the entry count for each file is shown below.

FileData Points
users.tsv454,458
follows.tsv4,002,115
truths.tsv823,927
quotes.tsv10,508
replies.tsv506,276
media.tsv184,884
hashtags.tsv21,599
external_urls.tsv173,947
truth_hashtag_edges.tsv213,295
truth_media_edges.tsv257,500
truth_external_url_edges.tsv252,877
truth_user_tag_edges.tsv145,234

A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.

Search
Clear search
Close search
Google apps
Main menu