16 datasets found

Truth Social Dataset

zenodo.org

zip

Updated Jan 13, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger (2023). Truth Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.7531625

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7531625

Dataset updated

Jan 13, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A Truth Social data set containing a network of users, their associated posts, and additional information about each post. Collected from February 2022 through September 2022, this dataset contains 454,458 user entries and 845,060 Truth (Truth Social’s term for post) entries.

Comprised of 12 different files, the entry count for each file is shown below.

File	Data Points
users.tsv	454,458
follows.tsv	4,002,115
truths.tsv	823,927
quotes.tsv	10,508
replies.tsv	506,276
media.tsv	184,884
hashtags.tsv	21,599
external_urls.tsv	173,947
truth_hashtag_edges.tsv	213,295
truth_media_edges.tsv	257,500
truth_external_url_edges.tsv	252,877
truth_user_tag_edges.tsv	145,234

A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.

Data from: Youtube social network
kaggle.com
zip
Updated Sep 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo De Tomasi (2019). Youtube social network [Dataset]. https://www.kaggle.com/datasets/lodetomasi1995/youtube-social-network
Explore at:
zip(10604317 bytes)Available download formats
Dataset updated
Sep 1, 2019
Authors
Lorenzo De Tomasi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.

We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

more info : https://snap.stanford.edu/data/com-Youtube.html
YouTube Social Network with Communities (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). YouTube Social Network with Communities (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-com-youtube
Explore at:
zip(13777811 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Youtube social network and ground-truth communities

https://snap.stanford.edu/data/com-Youtube.html

Dataset information

Youtube (http://www.youtube.com/) is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider
such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)

We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.

Network statistics
Nodes 1,134,890
Edges 2,987,624
Nodes in largest WCC 1134890 (1.000)
Edges in largest WCC 2987624 (1.000)
Nodes in largest SCC 1134890 (1.000)
Edges in largest SCC 2987624 (1.000)
Average clustering coefficient 0.0808
Number of triangles 3056386
Fraction of closed triangles 0.002081
Diameter (longest shortest path) 20
90-percentile effective diameter 6.5
Community statistics
Number of communities 8,385
Average community size 13.50
Average membership size 0.10

Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

Files
File Description
com-youtube.ungraph.txt.gz Undirected Youtube network
com-youtube.all.cmty.txt.gz Youtube communities
com-youtube.top5000.cmty.txt.gz Youtube communities (Top 5,000)

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

The graph in the SNAP data set is 1-based, with nodes numbered 1 to
1,157,827.

In the SuiteSparse Matrix Collection, Problem.A is the undirected Youtube
network, a matrix of size n-by-n with n=1,134,890, which is the number of
unique user id's appearing in any edge.

Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).

C = Problem.aux.Communities_all is a sparse matrix of size n by 16,386
which represents the communities in the com-youtube.all.cmty.txt file.
The kth line in that file defines the kth community, and is the column
C(:,k), where C(i,k)=1 if person nodeid(i) is in the kth community. Row
C(i,:) and row/column i of the A matrix thus refer to the same person,
nodeid(i).

Ctop = Problem.aux.Communities_top5000 is n-by-5000, with the same
structure as the C array above, with the content of the
com-youtube.top5000.cmty.txt.gz file.
Profiling Fake News Spreaders on Twitter
zenodo.org
Updated Sep 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU (2020). Profiling Fake News Spreaders on Twitter [Dataset]. http://doi.org/10.5281/zenodo.3692319
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3692319
Dataset updated
Sep 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU
Description
Task

Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.

After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.

As in previous years, we propose the task from a multilingual perspective:

English

Spanish

NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.

Data

Input

The uncompressed dataset consists in a folder per language (en, es). Each folder contains:

A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.

A truth.txt file with the list of authors and the ground truth.

The format of the XML files is:

The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.

b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Evaluation

The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.

Submission

Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:

mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY

Within OUTPUT-DIRECTORY, we require two subfolders: en and es, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

Bilal Ghanem, Paolo Rosso, Francisco Rangel. An Emotional Analysis of False Information in Social Media and News Articles. arXiv preprint arXiv:1908.09951 (2019). ACM Transactions on Internet Technology (TOIT). In Press.

Anastasia Giachanou, Paolo Rosso, Fabio Crestani. Leveraging Emotional Signals for Credibility Detection. Proceedings of the 42nd International ACM Conference on Research and Development in Information Retrieval (SIGIR). pp 877–880. (2019)

Andre Guess, Jonathan Nagler, and Joshua Tucker. Less than you think: Prevalence and predictors of fake news dissemination on Facebook. Science Advances vol. 5 (2019)

Andrew Hall, Loren Terveen, Aaron Halfaker. Bot Detection in Wikidata Using Behavioral and Other Informal Cues. Proceedings of the ACM on Human-Computer Interaction. 2018 Nov 1;2(CSCW):64.

Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, Gerhard Weikum. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp 22-32. (2018)

Francisco Rangel and Paolo Rosso. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter. In: L. Cappellato, N. Ferro, D. E. Losada and H. Müller (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org, vol. 2380

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 2125.

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866.

Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784

Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.

Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.

Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179

Francisco Rangel and Paolo Rosso On the Implications of the General Data Protection Regulation on the Organisation of Evaluation Tasks. In: Language and Law / Linguagem e Direito, Vol. 5(2), pp. 80-102

Kai Shu, Suhang Wang, and Huan Liu. Understanding user profiles on social media for fake news detection. Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430--435 (2018)

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter. (2017)
u
Reevaluating Political Trust and Social Desirability in China - Dataset -...
bsos-data.umd.edu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reevaluating Political Trust and Social Desirability in China - Dataset - BSOS Data Repository [Dataset]. https://bsos-data.umd.edu/dataset/making-the-list-reevaluating-political-trust-and-social-desirability-in-china
Explore at:
Area covered
China
Description
The data comes from the Harvard Dataverse and covers information regarding political trust & regime support in China and self-monitoring, which determines the participants' desire for social desirability. Authors Nicholson and Huang obtained the data via a standard survey experiment that contains an embedded list experiment. The list experiment aspect is significant because list experiments are an "indirect way to gauge overreporting" (Nicholson and Haung). The data have possibilities for helping understand Chinese politics, such as how support varies at different government levels and how overreporting is affected by a person's social desirability. This data can be used in government classes and coding classes. The data should be used when learning about ordered logit and simple bar graphs. A regression should not be used. It could be used to compare the levels of trust in different regime types. It would be interesting to compare the results of other authoritarian countries, such as Turkey and Vietnam, to the results of these datasets from China. Additionally, data from these countries could be compared to democracies. People underreport in authoritarian governments and might not always tell the truth, so there is a chance that authoritarian countries could have similar levels of reported trust to the democratic countries. This experiment is also a list experiment, which reduces some of the underreporting. The data can be used to see whether certain demographic characteristics have more or less support for their government. Examples of demographic characteristics that could be looked at are gender, age, and education level.
f
Collected feed statistics.
plos.figshare.com
xls
Updated Nov 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Failla; Giulio Rossetti (2024). Collected feed statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0310330.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310330.t003
Dataset updated
Nov 5, 2024
Dataset provided by
PLOS ONE
Authors
Andrea Failla; Giulio Rossetti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and like feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions. This dataset allows novel analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.
Dataset of adaptive Children-Robot Interaction for Education based on...
zenodo.org
data.niaid.nih.gov
zip
Updated Feb 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Tozadore; Daniel Tozadore; Roseli Romero; Roseli Romero (2025). Dataset of adaptive Children-Robot Interaction for Education based on Autonomous Multimodal Users' Readings [Dataset]. http://doi.org/10.5281/zenodo.11174782
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11174782
Dataset updated
Feb 3, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Tozadore; Daniel Tozadore; Roseli Romero; Roseli Romero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Dataset of adaptive Children-Robot Interaction for Education based on Autonomous Multimodal Users’ Readings

## Background

This dataset is generated from multiple interactions between a Social Robot (NAO) and 5th grade students from a private school in São Paulo, Brazil.

In the interaction, the robot approached the content that teachers were approaching at the time with the participants students about the wasting system in Brazil.

The measures here are the readings that the R-CASTLE system did for each answer the students gave to the questions the robot asked.

For more information about how these measures were collected, please refer to this thesis at: https://doi.org/10.11606/T.55.2020.tde-31082020-093935

Since the goal of the R-CASTLE is to provide autonomous adaptation, we built a ground-truth dataset based on human feedback of an expert in education operating the robot in loco. The person was teleoperating the robot to change its behaviour (or not) according to observed values of the participants as Face Gaze, Facial emotion displayed, Number of spoken words, the correctness of the answer (based on pre-defined answers), and the time students took to answer. These measures are the 5th columns of this csv file. The evaluator could decide to increase (1), maintain (0), or decrease (-1) the level of difficulties of the following questions depending on the mentioned observed measures. This is the human true label, stored in the 6th column.

## Description:
Each row of this file is a tuple of the autonomous reading the robot made in the 5 first columns, plus the true label in the 6th row (True Value) and the Final Crisp Value using fuzzy classification in the 7th row (Final Crisp Value).

Deviations (integer): number of face deviations of the participant during the question answering identified by the system.

EmotionCount (integer): a balance between "good" and "bad" emotions (good - bad) identified by the system.

NumberWord (integer): number of words comprised in the sentence the participant gave.

SucRate/Ans/RWa: (between 0 and 1, where 0 is completely wrong and 1 is completely right): The success rate of the participant’s answer to that question, based on the expected answer programmed by their teachers.

Time2ans (float): The time spent to answer the question since the robot has finished the question until the end of the participant’s speech in seconds.

True Value (-1, 0, 1): Ground-truth value. Value of adaptation chosen by the human observing the interaction if the system needed to decrease, maintain, or increase the level of difficulty of asked questions.

Final Crisp Value (float): value of calculated fuzzy output based on the implementations in the paper: https://doi.org/10.1145/3395035.3425201

## Creators
Daniel Tozadore: dtozadore@gmail.com
Roseli Romero: rafrance@icmc.usp.br

## License:
[Creative Commons Licenses](https://creativecommons.org/share-your-work/cclicenses/)
COVID-19 Fake News Dataset
kaggle.com
zip
Updated Nov 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2020). COVID-19 Fake News Dataset [Dataset]. https://www.kaggle.com/arashnic/covid19-fake-news
Explore at:
zip(3948402 bytes)Available download formats
Dataset updated
Nov 4, 2020
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

As the COVID-19 virus quickly spreads around the world, unfortunately, misinformation related to COVID-19 also gets created and spreads like wild fire. Such misinformation has caused confusion among people, disruptions in society, and even deadly consequences in health problems. To be able to understand, detect, and mitigate such COVID-19 misinformation, therefore, has not only deep intellectual values but also huge societal impacts. To help researchers combat COVID-19 health misinformation, this dataset created.

#
#

https://img.etimg.com/thumb/msid-65836641,width-640,resizemode-4,imgsize-272192/fake-news.jpg" width="700">

Content

The datasets is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users' social engagement about such news. It includes 4,251 news, 296,000 related user engagements, 926 social platform posts about COVID-19, and ground truth labels.

Version 0.1 (05/17/2020) initial version corresponding to arXiv paper CoAID: COVID-19 HEALTHCARE MISINFORMATION DATASET

Version 0.2 (08/03/2020) added data from May 1, 2020 through July 1, 2020

Version 0.3 (11/03/2020) added data from July 1, 2020 through September 1, 2020

Acknowledgements

Limeng Cui Dongwon Lee, Pennsylvania State University.
Twitter follower-followee graph, labeled with benign/Sybil
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haoyu Lu (2023). Twitter follower-followee graph, labeled with benign/Sybil [Dataset]. http://doi.org/10.6084/m9.figshare.20057300.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20057300.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Haoyu Lu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Twitter follower-followee graph with 269,640 nodes and 6,818,501 edges from [Kwak], and we obtain the ground truth labels from [SybilSCAR]. Among them 178377 are benign and 91263 are Sybil. We divide 9000 Sybil and 17000 benign users (about 10%) from them as the training set and test on the overall social graph.

H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?” in WWW, 2010 B. Wang, L. Zhang, and N. Z. Gong, “SybilSCAR: Sybil detection in online social networks via local rule based propagation,” in IEEE INFOCOM, 2017.
f
Data sets used for user analysis.
plos.figshare.com
xlsx
Updated Jan 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alon Sela; Omer Neter; Václav Lohr; Petr Cihelka; Fan Wang; Moti Zwilling; John Phillip Sabou; Miloš Ulman (2025). Data sets used for user analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0309688.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309688.s002
Dataset updated
Jan 30, 2025
Dataset provided by
PLOS ONE
Authors
Alon Sela; Omer Neter; Václav Lohr; Petr Cihelka; Fan Wang; Moti Zwilling; John Phillip Sabou; Miloš Ulman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Social networks are a battlefield for political propaganda. Protected by the anonymity of the internet, political actors use computational propaganda to influence the masses. Their methods include the use of synchronized or individual bots, multiple accounts operated by one social media management tool, or different manipulations of search engines and social network algorithms, all aiming to promote their ideology. While computational propaganda influences modern society, it is hard to measure or detect it. Furthermore, with the recent exponential growth in large language models (L.L.M), and the growing concerns about information overload, which makes the alternative truth spheres more noisy than ever before, the complexity and magnitude of computational propaganda is also expected to increase, making their detection even harder. Propaganda in social networks is disguised as legitimate news sent from authentic users. It smartly blended real users with fake accounts. We seek here to detect efforts to manipulate the spread of information in social networks, by one of the fundamental macro-scale properties of rhetoric—repetitiveness. We use 16 data sets of a total size of 13 GB, 10 related to political topics and 6 related to non-political ones (large-scale disasters), each ranging from tens of thousands to a few million of tweets. We compare them and identify statistical and network properties that distinguish between these two types of information cascades. These features are based on both the repetition distribution of hashtags and the mentions of users, as well as the network structure. Together, they enable us to distinguish (p − value = 0.0001) between the two different classes of information cascades. In addition to constructing a bipartite graph connecting words and tweets to each cascade, we develop a quantitative measure and show how it can be used to distinguish between political and non-political discussions. Our method is indifferent to the cascade’s country of origin, language, or cultural background since it is only based on the statistical properties of repetitiveness and the word appearance in tweets bipartite network structures.
Profiling Hate Speech Spreaders on Twitter
zenodo.org
data.niaid.nih.gov
Updated Jun 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FRANCISCO RANGEL; BERTa CHULVI; GRETEL LIZ DE LA PEÑA; ELISABETTA FERSINI; PAOLO ROSSO; FRANCISCO RANGEL; BERTa CHULVI; GRETEL LIZ DE LA PEÑA; ELISABETTA FERSINI; PAOLO ROSSO (2022). Profiling Hate Speech Spreaders on Twitter [Dataset]. http://doi.org/10.5281/zenodo.4603578
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4603578
Dataset updated
Jun 11, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
FRANCISCO RANGEL; BERTa CHULVI; GRETEL LIZ DE LA PEÑA; ELISABETTA FERSINI; PAOLO ROSSO; FRANCISCO RANGEL; BERTa CHULVI; GRETEL LIZ DE LA PEÑA; ELISABETTA FERSINI; PAOLO ROSSO
Description
Task

Hate speech (HS) is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics. Given the huge amount of user-generated contents on Twitter, the problem of detecting, and therefore possibly contrasting the HS diffusion, is becoming fundamental, for instance for fighting against misogyny and xenophobia. To this end, in this task, we aim at identifying possible hate speech spreaders on Twitter as a first step towards preventing hate speech from being propagated among online users.

After having addressed several aspects of author profiling in social media from 2013 to 2020 (fake news spreaders, bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possible to discriminate authors that have shared some hate speech in the past from those that, to the best of our knowledge, have never done it.

As in previous years, we propose the task from a multilingual perspective:

English

Spanish

NOTE: Although we recommend participating in both languages (English and Spanish), it is possible to address the problem just for one language.

Award

We are happy to announce that the best performing team at the 9th International Competition on Author Profiling will be awarded 300,- Euro sponsored by Symanto

Data

Input

The uncompressed dataset consists of a folder per language (en, es). Each folder contains:

An XML file per author (Twitter user) with 100 tweets. The name of the XML file corresponding to the unique author id.

A truth.txt file with the list of authors and the ground truth.

The format of the XML files is:

The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.

b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

The naming of the output files is up to you. However, we recommend using the author-id as filename and "XML" as an extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Evaluation

The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.

Related Work

[1] Valerio Basile, Cristina Bosco, Elisabetta Fersini, Dora Nozza, Viviana Patti, Francisco Rangel, Paolo Rosso, Manuela Sanguinetti (2019). SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. Proc. SemEval 2019

[2] Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, Viviana Patti (2020). Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources & Evaluation. https://doi.org/10.1007/s10579-020-09502-8

[3] Paula Fortuna, Sérgio Nunes (2018). A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR) 51.4

[4] Maria Anzovino, Elisabetta Fersini, Paolo Rosso (2018). Automatic Identification and Classification of Misogynistic Language on Twitter. In: Proc. 23rd Int. Conf. on Applications of Natural Language to Information Systems, NLDB-2018, Springer-Verlag, LNCS(10859), pp. 57-64

[5] Elisabetta Fersini, Paolo Rosso, Maria Anzovino (2018). Overview of the task on automatic misogyny identification at IberEval 2018. Proc. IberEval 2018

[6] Elisabetta Fersini, Dora Nozza, Paolo Rosso (2018). Overview of the Evalita 2018 task on automatic misogyny identification (AMI). Proc. EVALITA 2018

[7] Cristina Bosco, Felice Dell'Orletta, Fabio Poletto, Manuela Sanguinetti, Maurizio Tesconi (2018). Overview of the EVALITA 2018 hate speech detection task. Proc. EVALITA 2018

[8] Samuel Caetano da Silva, Thiago Castro Ferreira, Ricelli Moreira Silva Ramos, Ivandre Paraboni (2020). Data-driven and psycholinguistics motivated approaches to hate speech detection. Computación y Sistemas, 24(3): 1179–1188

[9] Stiven Zimmerman, Udo Kruschwitz, Cris Fox (2018). Improving hate speech detection with deep learning ensembles. In Proc. of the Eleventh Int. Conf. on Language Resources and Evaluation (LREC 2018)

[10] Francisco Rangel, Anastasia Giachanou, Bilal Ghanem, Paolo Rosso. Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: L. Cappellato, C. Eickhoff, N. Ferro, and A. Névéol (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org, vol. 2696

[11] Francisco Rangel and Paolo Rosso. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter. In: L. Cappellato, N. Ferro, D. E. Losada and H. Müller (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org, vol. 2380

[12] Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 2125.

[13] Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866.

[14] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784

[15] Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.

[16] Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.

[17] Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179

[18] Francisco Rangel and Paolo Rosso On the Implications of the General Data Protection Regulation on the Organisation of Evaluation Tasks. In: Language and Law / Linguagem e Direito, Vol. 5(2), pp. 80-102

[19] Francisco Rangel, Marc Franco-Salvador, Paolo Rosso A Low Dimensionality Representation for Language Variety Identification. In: Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text
B
Residential School Locations Dataset (CSV Format)
borealisdata.ca
search.dataone.org
Updated Jun 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/RIYEMU
Dataset updated
Jun 5, 2019
Dataset provided by
Borealis
Authors
Rosa Orlandini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1863 - Jun 30, 1998
Area covered
Canada
Description
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
H
Replication Data for: Hurting or Healing? How Conflict Exposure and Trauma...
dataverse.harvard.edu
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karin Dyrstad; Amélie Godefroidt (2024). Replication Data for: Hurting or Healing? How Conflict Exposure and Trauma (Do Not) Shape Support for Truth Commissions [Dataset]. http://doi.org/10.7910/DVN/GK4FCA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GK4FCA
Dataset updated
Oct 23, 2024
Dataset provided by
Harvard Dataverse
Authors
Karin Dyrstad; Amélie Godefroidt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The replication material consists of the following files: 1. hh_replication.do contains the code that was used to clean the data and create all tables and figures in the paper. 2. hh_replication.txt corresponds to hh_replication.do (converted to .txt for ease of access for Stata non-users). 3. hh_replicationdata.dta (or alternatively hh_replicationdata.csv) contains the dataset (Stata dataset; alternatively as a comma-separated version) 4. hh_log.txt is a log file that contains all numerical results reported in the article and the online appendix. 5. Additional data files necessary to recreate the maps reported in the Online Appendix, Figures B.1-B.2 - B.1: guadm2.dta, guacoord2.dta, and guatecapital.dta - B.2: nepadm3.dta, nepcoord3.dta, and nepalcapital.dta
b
From Physical Activity to Online Engagement - Datasets - data.bris
data.bris.ac.uk
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). From Physical Activity to Online Engagement - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/13vcslkc37coc2g8u7iqt7o6cm
Explore at:
Dataset updated
Sep 17, 2025
Description
Previous studies have demonstrated the advantages of physical behaviour such as physical activity and sleep on mental health and provided an association between virtual behaviour, such as social media use and screen time, and mental health problems. Here, physical behaviour is defined as the data related to the person in the physical real world such as physical activity, sleep, and virtual behaviour is defined as behaviour involving the internet such as social networks, general web browsing, and instant messaging. We believe that a person's physical or virtual behaviour individually may not be the best indicator of their mental health. Current datasets do not include data on both physical and virtual behaviours. Therefore, we seek to run a data collection study that collects both physical and virtual behaviours. Additionally, we investigate if machine learning models that include both physical and virtual behaviour can better predict mental health. This study is conducted by using data collected via a custom-made app. This app is made to run in the background of a user's smartphone collecting physical activity, sleep, location, and audio inferences passively. Additionally, it offers users an ecological momentary assessment (EMA) platform where they may log information about their feelings and other significant occurrences through the Warwick-Edinburgh Mental Wellbeing survey. This will provide us with a ground truth to evaluate our models. We also collect social media data through Instagram and YouTube logs sent by the participants at the end of the study.
Z
PAN19 Authorship Analysis: Bots and Gender Profiling
data.niaid.nih.gov
Updated Apr 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rangel, Francisco (2020). PAN19 Authorship Analysis: Bots and Gender Profiling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3530207
Explore at:
Dataset updated
Apr 26, 2020
Dataset provided by
Rangel, Francisco
Rosso, Paolo
Description
Social media bots pose as humans to influence users with commercial, political or ideological purposes. For example, bots could artificially inflate the popularity of a product by promoting it and/or writing positive ratings, as well as undermine the reputation of competitive products through negative valuations. The threat is even greater when the purpose is political or ideological (see Brexit referendum or US Presidential elections). Fearing the effect of this influence, the German political parties have rejected the use of bots in their electoral campaign for the general elections. Furthermore, bots are commonly related to fake news spreading. Therefore, to approach the identification of bots from an author profiling perspective is of high importance from the point of view of marketing, forensics and security.

After having addressed several aspects of author profiling in social media from 2013 to 2018 (age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating whether the author of a Twitter feed is a bot or a human. Furthermore, in case of human, to profile the gender of the author.

The uncompressed dataset consists in a folder per language (en, es). Each folder contains:

A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.

A truth.txt file with the list of authors and the ground truth.
O
Friendster
opendatalab.com
zip
Updated Sep 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University (2022). Friendster [Dataset]. https://opendatalab.com/OpenDataLab/Friendster
Explore at:
zip(9473194171 bytes)Available download formats
Dataset updated
Sep 10, 2022
Dataset provided by
Stanford University
Description
Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. The Friendster dataset consist of ground-truth communities (based on user-defined groups) and the social network from induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger (2023). Truth Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.7531625

Truth Social Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7531625

Dataset updated

Jan 13, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Comprised of 12 different files, the entry count for each file is shown below.

File	Data Points
users.tsv	454,458
follows.tsv	4,002,115
truths.tsv	823,927
quotes.tsv	10,508
replies.tsv	506,276
media.tsv	184,884
hashtags.tsv	21,599
external_urls.tsv	173,947
truth_hashtag_edges.tsv	213,295
truth_media_edges.tsv	257,500
truth_external_url_edges.tsv	252,877
truth_user_tag_edges.tsv	145,234

A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.

Clear search

Close search

Google apps

Main menu

Truth Social Dataset

Data from: Youtube social network

YouTube Social Network with Communities (SNAP)

Youtube social network and ground-truth communities

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

Profiling Fake News Spreaders on Twitter

Reevaluating Political Trust and Social Desirability in China - Dataset -...

Collected feed statistics.

Dataset of adaptive Children-Robot Interaction for Education based on...

COVID-19 Fake News Dataset

Context

Content

Acknowledgements

Twitter follower-followee graph, labeled with benign/Sybil

Data sets used for user analysis.

Profiling Hate Speech Spreaders on Twitter

Residential School Locations Dataset (CSV Format)

Replication Data for: Hurting or Healing? How Conflict Exposure and Trauma...

From Physical Activity to Online Engagement - Datasets - data.bris

PAN19 Authorship Analysis: Bots and Gender Profiling

Friendster

Truth Social DatasetSee More Versions

Truth Social Dataset