Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Truth Social data set containing a network of users, their associated posts, and additional information about each post. Collected from February 2022 through September 2022, this dataset contains 454,458 user entries and 845,060 Truth (Truth Social’s term for post) entries.
Comprised of 12 different files, the entry count for each file is shown below.
File | Data Points |
---|---|
users.tsv | 454,458 |
follows.tsv | 4,002,115 |
truths.tsv | 823,927 |
quotes.tsv | 10,508 |
replies.tsv | 506,276 |
media.tsv | 184,884 |
hashtags.tsv | 21,599 |
external_urls.tsv | 173,947 |
truth_hashtag_edges.tsv | 213,295 |
truth_media_edges.tsv | 257,500 |
truth_external_url_edges.tsv | 252,877 |
truth_user_tag_edges.tsv | 145,234 |
A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
more info : https://snap.stanford.edu/data/com-Youtube.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/com-Youtube.html
Dataset information
Youtube (http://www.youtube.com/) is a video-sharing web site that includes
a social network. In the Youtube social network, users form friendship each
other and users can create groups which other users can join. We consider
such user-defined groups as ground-truth communities. This data is provided
by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)
We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.
Network statistics
Nodes 1,134,890
Edges 2,987,624
Nodes in largest WCC 1134890 (1.000)
Edges in largest WCC 2987624 (1.000)
Nodes in largest SCC 1134890 (1.000)
Edges in largest SCC 2987624 (1.000)
Average clustering coefficient 0.0808
Number of triangles 3056386
Fraction of closed triangles 0.002081
Diameter (longest shortest path) 20
90-percentile effective diameter 6.5
Community statistics
Number of communities 8,385
Average community size 13.50
Average membership size 0.10
Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based
on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233
Files
File Description
com-youtube.ungraph.txt.gz Undirected Youtube network
com-youtube.all.cmty.txt.gz Youtube communities
com-youtube.top5000.cmty.txt.gz Youtube communities (Top 5,000)
The graph in the SNAP data set is 1-based, with nodes numbered 1 to
1,157,827.
In the SuiteSparse Matrix Collection, Problem.A is the undirected Youtube
network, a matrix of size n-by-n with n=1,134,890, which is the number of
unique user id's appearing in any edge.
Problem.aux.nodeid is a list of the node id's that appear in the SNAP data
set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).
C = Problem.aux.Communities_all is a sparse matrix of size n by 16,386
which represents the communities in the com-youtube.all.cmty.txt file.
The kth line in that file defines the kth community, and is the column
C(:,k), where C(i,k)=1 if person nodeid(i) is in the kth community. Row
C(i,:) and row/column i of the A matrix thus refer to the same person,
nodeid(i).
Ctop = Problem.aux.Communities_top5000 is n-by-5000, with the same
structure as the C array above, with the content of the
com-youtube.top5000.cmty.txt.gz file.
Task
Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.
After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.
As in previous years, we propose the task from a multilingual perspective:
NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.
Data
Input
The uncompressed dataset consists in a folder per language (en, es). Each folder contains:
The format of the XML files is:
The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.
b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.
IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.
Evaluation
The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.
Submission
Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:
mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY
Within OUTPUT-DIRECTORY
, we require two subfolders: en
and es
, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:
The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.
Related Work
The data comes from the Harvard Dataverse and covers information regarding political trust & regime support in China and self-monitoring, which determines the participants' desire for social desirability. Authors Nicholson and Huang obtained the data via a standard survey experiment that contains an embedded list experiment. The list experiment aspect is significant because list experiments are an "indirect way to gauge overreporting" (Nicholson and Haung). The data have possibilities for helping understand Chinese politics, such as how support varies at different government levels and how overreporting is affected by a person's social desirability. This data can be used in government classes and coding classes. The data should be used when learning about ordered logit and simple bar graphs. A regression should not be used. It could be used to compare the levels of trust in different regime types. It would be interesting to compare the results of other authoritarian countries, such as Turkey and Vietnam, to the results of these datasets from China. Additionally, data from these countries could be compared to democracies. People underreport in authoritarian governments and might not always tell the truth, so there is a chance that authoritarian countries could have similar levels of reported trust to the democratic countries. This experiment is also a list experiment, which reduces some of the underreporting. The data can be used to see whether certain demographic characteristics have more or less support for their government. Examples of demographic characteristics that could be looked at are gender, age, and education level.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and like feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions. This dataset allows novel analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Dataset of adaptive Children-Robot Interaction for Education based on Autonomous Multimodal Users’ Readings
## Background
This dataset is generated from multiple interactions between a Social Robot (NAO) and 5th grade students from a private school in São Paulo, Brazil.
In the interaction, the robot approached the content that teachers were approaching at the time with the participants students about the wasting system in Brazil.
The measures here are the readings that the R-CASTLE system did for each answer the students gave to the questions the robot asked.
For more information about how these measures were collected, please refer to this thesis at: https://doi.org/10.11606/T.55.2020.tde-31082020-093935
Since the goal of the R-CASTLE is to provide autonomous adaptation, we built a ground-truth dataset based on human feedback of an expert in education operating the robot in loco. The person was teleoperating the robot to change its behaviour (or not) according to observed values of the participants as Face Gaze, Facial emotion displayed, Number of spoken words, the correctness of the answer (based on pre-defined answers), and the time students took to answer. These measures are the 5th columns of this csv file. The evaluator could decide to increase (1), maintain (0), or decrease (-1) the level of difficulties of the following questions depending on the mentioned observed measures. This is the human true label, stored in the 6th column.
## Description:
Each row of this file is a tuple of the autonomous reading the robot made in the 5 first columns, plus the true label in the 6th row (True Value) and the Final Crisp Value using fuzzy classification in the 7th row (Final Crisp Value).
Deviations (integer): number of face deviations of the participant during the question answering identified by the system.
EmotionCount (integer): a balance between "good" and "bad" emotions (good - bad) identified by the system.
NumberWord (integer): number of words comprised in the sentence the participant gave.
SucRate/Ans/RWa: (between 0 and 1, where 0 is completely wrong and 1 is completely right): The success rate of the participant’s answer to that question, based on the expected answer programmed by their teachers.
Time2ans (float): The time spent to answer the question since the robot has finished the question until the end of the participant’s speech in seconds.
True Value (-1, 0, 1): Ground-truth value. Value of adaptation chosen by the human observing the interaction if the system needed to decrease, maintain, or increase the level of difficulty of asked questions.
Final Crisp Value (float): value of calculated fuzzy output based on the implementations in the paper: https://doi.org/10.1145/3395035.3425201
## Creators
Daniel Tozadore: dtozadore@gmail.com
Roseli Romero: rafrance@icmc.usp.br
## License:
[Creative Commons Licenses](https://creativecommons.org/share-your-work/cclicenses/)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
As the COVID-19 virus quickly spreads around the world, unfortunately, misinformation related to COVID-19 also gets created and spreads like wild fire. Such misinformation has caused confusion among people, disruptions in society, and even deadly consequences in health problems. To be able to understand, detect, and mitigate such COVID-19 misinformation, therefore, has not only deep intellectual values but also huge societal impacts. To help researchers combat COVID-19 health misinformation, this dataset created.
#
#
https://img.etimg.com/thumb/msid-65836641,width-640,resizemode-4,imgsize-272192/fake-news.jpg" width="700">
The datasets is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users' social engagement about such news. It includes 4,251 news, 296,000 related user engagements, 926 social platform posts about COVID-19, and ground truth labels.
Version 0.1 (05/17/2020) initial version corresponding to arXiv paper CoAID: COVID-19 HEALTHCARE MISINFORMATION DATASET
Version 0.2 (08/03/2020) added data from May 1, 2020 through July 1, 2020
Version 0.3 (11/03/2020) added data from July 1, 2020 through September 1, 2020
Limeng Cui Dongwon Lee, Pennsylvania State University.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Twitter follower-followee graph with 269,640 nodes and 6,818,501 edges from [Kwak], and we obtain the ground truth labels from [SybilSCAR]. Among them 178377 are benign and 91263 are Sybil. We divide 9000 Sybil and 17000 benign users (about 10%) from them as the training set and test on the overall social graph.
H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?” in WWW, 2010 B. Wang, L. Zhang, and N. Z. Gong, “SybilSCAR: Sybil detection in online social networks via local rule based propagation,” in IEEE INFOCOM, 2017.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social networks are a battlefield for political propaganda. Protected by the anonymity of the internet, political actors use computational propaganda to influence the masses. Their methods include the use of synchronized or individual bots, multiple accounts operated by one social media management tool, or different manipulations of search engines and social network algorithms, all aiming to promote their ideology. While computational propaganda influences modern society, it is hard to measure or detect it. Furthermore, with the recent exponential growth in large language models (L.L.M), and the growing concerns about information overload, which makes the alternative truth spheres more noisy than ever before, the complexity and magnitude of computational propaganda is also expected to increase, making their detection even harder. Propaganda in social networks is disguised as legitimate news sent from authentic users. It smartly blended real users with fake accounts. We seek here to detect efforts to manipulate the spread of information in social networks, by one of the fundamental macro-scale properties of rhetoric—repetitiveness. We use 16 data sets of a total size of 13 GB, 10 related to political topics and 6 related to non-political ones (large-scale disasters), each ranging from tens of thousands to a few million of tweets. We compare them and identify statistical and network properties that distinguish between these two types of information cascades. These features are based on both the repetition distribution of hashtags and the mentions of users, as well as the network structure. Together, they enable us to distinguish (p − value = 0.0001) between the two different classes of information cascades. In addition to constructing a bipartite graph connecting words and tweets to each cascade, we develop a quantitative measure and show how it can be used to distinguish between political and non-political discussions. Our method is indifferent to the cascade’s country of origin, language, or cultural background since it is only based on the statistical properties of repetitiveness and the word appearance in tweets bipartite network structures.
Task
Hate speech (HS) is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics. Given the huge amount of user-generated contents on Twitter, the problem of detecting, and therefore possibly contrasting the HS diffusion, is becoming fundamental, for instance for fighting against misogyny and xenophobia. To this end, in this task, we aim at identifying possible hate speech spreaders on Twitter as a first step towards preventing hate speech from being propagated among online users.
After having addressed several aspects of author profiling in social media from 2013 to 2020 (fake news spreaders, bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possible to discriminate authors that have shared some hate speech in the past from those that, to the best of our knowledge, have never done it.
As in previous years, we propose the task from a multilingual perspective:
NOTE: Although we recommend participating in both languages (English and Spanish), it is possible to address the problem just for one language.
Award
We are happy to announce that the best performing team at the 9th International Competition on Author Profiling will be awarded 300,- Euro sponsored by Symanto
Data
Input
The uncompressed dataset consists of a folder per language (en, es). Each folder contains:
The format of the XML files is:
The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.
b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
The naming of the output files is up to you. However, we recommend using the author-id as filename and "XML" as an extension.
IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.
Evaluation
The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.
Related Work
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The replication material consists of the following files: 1. hh_replication.do contains the code that was used to clean the data and create all tables and figures in the paper. 2. hh_replication.txt corresponds to hh_replication.do (converted to .txt for ease of access for Stata non-users). 3. hh_replicationdata.dta (or alternatively hh_replicationdata.csv) contains the dataset (Stata dataset; alternatively as a comma-separated version) 4. hh_log.txt is a log file that contains all numerical results reported in the article and the online appendix. 5. Additional data files necessary to recreate the maps reported in the Online Appendix, Figures B.1-B.2 - B.1: guadm2.dta, guacoord2.dta, and guatecapital.dta - B.2: nepadm3.dta, nepcoord3.dta, and nepalcapital.dta
Previous studies have demonstrated the advantages of physical behaviour such as physical activity and sleep on mental health and provided an association between virtual behaviour, such as social media use and screen time, and mental health problems. Here, physical behaviour is defined as the data related to the person in the physical real world such as physical activity, sleep, and virtual behaviour is defined as behaviour involving the internet such as social networks, general web browsing, and instant messaging. We believe that a person's physical or virtual behaviour individually may not be the best indicator of their mental health. Current datasets do not include data on both physical and virtual behaviours. Therefore, we seek to run a data collection study that collects both physical and virtual behaviours. Additionally, we investigate if machine learning models that include both physical and virtual behaviour can better predict mental health. This study is conducted by using data collected via a custom-made app. This app is made to run in the background of a user's smartphone collecting physical activity, sleep, location, and audio inferences passively. Additionally, it offers users an ecological momentary assessment (EMA) platform where they may log information about their feelings and other significant occurrences through the Warwick-Edinburgh Mental Wellbeing survey. This will provide us with a ground truth to evaluate our models. We also collect social media data through Instagram and YouTube logs sent by the participants at the end of the study.
Social media bots pose as humans to influence users with commercial, political or ideological purposes. For example, bots could artificially inflate the popularity of a product by promoting it and/or writing positive ratings, as well as undermine the reputation of competitive products through negative valuations. The threat is even greater when the purpose is political or ideological (see Brexit referendum or US Presidential elections). Fearing the effect of this influence, the German political parties have rejected the use of bots in their electoral campaign for the general elections. Furthermore, bots are commonly related to fake news spreading. Therefore, to approach the identification of bots from an author profiling perspective is of high importance from the point of view of marketing, forensics and security.
After having addressed several aspects of author profiling in social media from 2013 to 2018 (age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating whether the author of a Twitter feed is a bot or a human. Furthermore, in case of human, to profile the gender of the author.
The uncompressed dataset consists in a folder per language (en, es). Each folder contains:
A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.
A truth.txt file with the list of authors and the ground truth.
Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. The Friendster dataset consist of ground-truth communities (based on user-defined groups) and the social network from induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Truth Social data set containing a network of users, their associated posts, and additional information about each post. Collected from February 2022 through September 2022, this dataset contains 454,458 user entries and 845,060 Truth (Truth Social’s term for post) entries.
Comprised of 12 different files, the entry count for each file is shown below.
File | Data Points |
---|---|
users.tsv | 454,458 |
follows.tsv | 4,002,115 |
truths.tsv | 823,927 |
quotes.tsv | 10,508 |
replies.tsv | 506,276 |
media.tsv | 184,884 |
hashtags.tsv | 21,599 |
external_urls.tsv | 173,947 |
truth_hashtag_edges.tsv | 213,295 |
truth_media_edges.tsv | 257,500 |
truth_external_url_edges.tsv | 252,877 |
truth_user_tag_edges.tsv | 145,234 |
A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.