Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Truth Social data set containing a network of users, their associated posts, and additional information about each post. Collected from February 2022 through September 2022, this dataset contains 454,458 user entries and 845,060 Truth (Truth Social’s term for post) entries.
Comprised of 12 different files, the entry count for each file is shown below.
| File | Data Points |
|---|---|
| users.tsv | 454,458 |
| follows.tsv | 4,002,115 |
| truths.tsv | 823,927 |
| quotes.tsv | 10,508 |
| replies.tsv | 506,276 |
| media.tsv | 184,884 |
| hashtags.tsv | 21,599 |
| external_urls.tsv | 173,947 |
| truth_hashtag_edges.tsv | 213,295 |
| truth_media_edges.tsv | 257,500 |
| truth_external_url_edges.tsv | 252,877 |
| truth_user_tag_edges.tsv | 145,234 |
A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains posts and interactions from Donald J. Trump's Truth Social account, specifically during his 2024 U.S. Presidential election campaign. Each post entry provides detailed information, including the post content, number of replies, shares, likes, and metadata such as post date, media URLs (if available), and account details. The data offers a rich source for analyzing political messaging, engagement metrics, and audience reactions during the campaign period.
The posts are sourced directly from Trump's official Truth Social profile, capturing interactions that are publicly available.
The dataset may not include every post or interaction due to scraping limitations, and some interactions might lack context or additional details that could affect interpretability.
This dataset is intended for research and analysis purposes. Please ensure that any use of the data complies with Truth Social's terms of service and applicable copyright laws.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Truth Social Dataset (Limited Version)
License: CC BY 4.0
Overview
This dataset contains posts and comments scraped from Truth Social, focusing on Donald Trump’s posts (“Truths” and “Retruths”).Due to the initial collection method, all media and URLs were excluded. Future versions will include complete post data, including images and links. Contains 31.8Million Comments, and over 18000 Posts all By Trump. As well as logged over 1.5Million unique users who commented on… See the full description on the dataset page: https://huggingface.co/datasets/notmooodoo9/TrumpsTruthSocialPosts.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.
more info : https://snap.stanford.edu/data/com-Youtube.html
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/com-Youtube.html
Dataset information
Youtube (http://www.youtube.com/) is a video-sharing web site that includes
a social network. In the Youtube social network, users form friendship each
other and users can create groups which other users can join. We consider
such user-defined groups as ground-truth communities. This data is provided
by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)
We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.
Network statistics
Nodes 1,134,890
Edges 2,987,624
Nodes in largest WCC 1134890 (1.000)
Edges in largest WCC 2987624 (1.000)
Nodes in largest SCC 1134890 (1.000)
Edges in largest SCC 2987624 (1.000)
Average clustering coefficient 0.0808
Number of triangles 3056386
Fraction of closed triangles 0.002081
Diameter (longest shortest path) 20
90-percentile effective diameter 6.5
Community statistics
Number of communities 8,385
Average community size 13.50
Average membership size 0.10
Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based
on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233
Files
File Description
com-youtube.ungraph.txt.gz Undirected Youtube network
com-youtube.all.cmty.txt.gz Youtube communities
com-youtube.top5000.cmty.txt.gz Youtube communities (Top 5,000)
The graph in the SNAP data set is 1-based, with nodes numbered 1 to
1,157,827.
In the SuiteSparse Matrix Collection, Problem.A is the undirected Youtube
network, a matrix of size n-by-n with n=1,134,890, which is the number of
unique user id's appearing in any edge.
Problem.aux.nodeid is a list of the node id's that appear in the SNAP data
set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).
C = Problem.aux.Communities_all is a sparse matrix of size n by 16,386
which represents the communities in the com-youtube.all.cmty.txt file.
The kth line in that file defines the kth community, and is the column
C(:,k), where C(i,k)=1 if person ...
Facebook
TwitterThis dataset is crawled from three popular on-line social networks (OSNs), namely, Twitter, Facebook and Foursquare. We collected this dataset as follows. We first gathered a set of Singapore-based Twitter users who declared Singapore as location in their user profiles. From the Singapore-based Twitter users, we retrieve a subset of Twitter users who declared their Facebook or Foursquare accounts in their short bio description. In total, we collected 1,998 Twitter-Facebook user identity pairs (known as TW-FB ground truth matching pairs}, and 3,602 Twitter-Foursquare user identity pairs (known as TW-FQ ground truth matching pairs). To simulate a real-world setting, where a user identity in the source OSN may not have its corresponding matching user identity in the target OSN, we expanded the datasets by adding Twitter, Facebook and Foursquare users who are connected to users in the TW-FB ground truth matching pairs and TW-FQ ground truth matching pairs sets. Note that isolated users who do not have links to other users are removed from the data sets. After collecting the datasets, we extract the following user features using the OSNs' APIs. • Username: The username of the account. • Screen name: The natural name of the user account. It is usually formed using the first and last name of the user. • Profile Image: The thumbnail or image provided by the user to visually present herself. • Network: The relationship links between users.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.1/customlicense?persistentId=doi:10.7910/DVN/28119https://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.1/customlicense?persistentId=doi:10.7910/DVN/28119
THIS IS NO LONGER SUPPORTED. In ICEWS, an Event of Interest (EOI) is a macro-level occurrence within a country or region that is supported by the existence of multiple underlying events. The Ground Truth Data Set is a collection of data which lists, for the EOIs supported, whether or not the EOI did occur in any given country for any given month, historically speaking. We plan to update this data on a periodic basis. The five EOIs that are currently supported in this data set include: 1. Domestic Political Crisis (DPC): Significant opposition to the government, but not to the level of rebellion or insurgency (e.g., power struggles between two political factions involving disruptive strikes or violent clashes between supporters). 2. Insurgency: Organized opposition whose objective is to overthrow the central government. 3. International Crisis: Conflict or elevated tensions that could lead to conflict between two or more states OR between a state and an actor operating primarily from beyond the state's borders that involves the deployment of substantial ground forces (1,000+) beyond its borders. 4. Rebellion: Organized, active, violent opposition with substantial arms, where the objective is to seek autonomy or independence from the central government. 5. Ethnic/Religious Violence: Violence between ethnic or religious groups that is not specifically directed against the government. Additional information about the IC EWS program can be found at http://www.icews.com/. Follow our Twitter handle for data updates and other news: @icews
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/com-LiveJournal.html
Dataset information
LiveJournal (http://www.livejournal.com/) is a free on-line blogging
community where users declare friendship each other. LiveJournal also
allows users form a group which other members can then join. We consider
such user-defined groups as ground-truth communities. We provide the
LiveJournal friendship social network and ground-truth communities.
We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.
Dataset statistics
Nodes 3,997,962
Edges 34,681,189
Nodes in largest WCC 3997962 (1.000)
Edges in largest WCC 34681189 (1.000)
Nodes in largest SCC 3997962 (1.000)
Edges in largest SCC 34681189 (1.000)
Average clustering coefficient 0.2843
Number of triangles 177820130
Fraction of closed triangles 0.04559
Diameter (longest shortest path) 17
90-percentile effective diameter 6.5
Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based
on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233
Files
File Description
com-lj.ungraph.txt.gz Undirected LiveJournal network
com-lj.all.cmty.txt.gz LiveJournal communities
com-lj.top5000.cmty.txt.gz LiveJournal communities (Top 5,000)
The graph in the SNAP data set is 0-based, with nodes numbering 0 to
4,036,537.
In the SuiteSparse Matrix Collection, Problem.A is the undirected
LiveJournal network, a matrix of size n-by-n with n=3,997,962, which is
the number of unique user id's appearing in any edge.
Problem.aux.nodeid is a list of the node id's that appear in the SNAP data
set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (0-based).
C = Problem.aux.Communities_all is a sparse matrix of size n by 664,414
which represents the communities in the com-lj.all.cmty.txt file. The kth
line in that file defines the kth community, and is the column C(:,k),
where C(i,k)=1 if person nodeid(i) is in the kth community. Row C(i,:)
and row/column i of the A matrix thus refer to the same person, nodeid(i).
Ctop = Problem.aux.Communities_top5000 is n-by-5000, with the same
structure as the C array above, with the content of the
com-lj.top5000.cmty.txt file.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The latest hot topic in the news is fake news and many are wondering what data scientists can do to detect it and stymie its viral spread. This dataset is only a first step in understanding and tackling this problem. It contains text and metadata scraped from 244 websites tagged as "bullshit" by the BS Detector Chrome Extension by Daniel Sieradski.
Warning: I did not modify the list of news sources from the BS Detector so as not to introduce my (useless) layer of bias; I'm not an authority on fake news. There may be sources whose inclusion you disagree with. It's up to you to decide how to work with the data and how you might contribute to "improving it". The labels of "bs" and "junksci", etc. do not constitute capital "t" Truth. If there are other sources you would like to include, start a discussion. If there are sources you believe should not be included, start a discussion or write a kernel analyzing the data. Or take the data and do something else productive with it. Kaggle's choice to host this dataset is not meant to express any particular political affiliation or intent.
The dataset contains text and metadata from 244 websites and represents 12,999 posts in total from the past 30 days. The data was pulled using the webhose.io API; because it's coming from their crawler, not all websites identified by the BS Detector are present in this dataset. Each website was labeled according to the BS Detector as documented here. Data sources that were missing a label were simply assigned a label of "bs". There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.
For inspiration, I've included some (presumably non-fake) recent stories covering fake news in the news. This is a sensitive, nuanced topic and if there are other resources you'd like to see included here, please leave a suggestion. From defining fake, biased, and misleading news in the first place to deciding how to take action (a blacklist is not a good answer), there's a lot of information to consider beyond what can be neatly arranged in a CSV file.
We Tracked Down A Fake-News Creator In The Suburbs. Here's What We Learned (NPR)
Does Facebook Generate Over Half of its Revenue from Fake News? (Forbes)
If you have suggestions for improvements or would like to contribute, please let me know. The most obvious extensions are to include data from "real" news sites and to address the bias in the current list. I'd be happy to include any contributions in future versions of the dataset.
Thanks to Anthony for pointing me to Daniel Sieradski's BS Detector. Thank you to Daniel Nouri for encouraging me to add a disclaimer to the dataset's page.
Facebook
TwitterTask
Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.
After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.
As in previous years, we propose the task from a multilingual perspective:
English
Spanish
NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.
Data
Input
The uncompressed dataset consists in a folder per language (en, es). Each folder contains:
A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.
A truth.txt file with the list of authors and the ground truth.
The format of the XML files is:
Tweet 1 textual contents
Tweet 2 textual contents
...
The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.
b2d5748083d6fdffec6c2d68d4d4442d:::0
2bed15d46872169dc7deaf8d2b43a56:::0
8234ac5cca1aed3f9029277b2cb851b:::1
5ccd228e21485568016b4ee82deb0d28:::0
60d068f9cafb656431e62a6542de2dc0:::1
...
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.
IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.
Evaluation
The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.
Submission
Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:
mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY
Within OUTPUT-DIRECTORY, we require two subfolders: en and es, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:
The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.
Related Work
Bilal Ghanem, Paolo Rosso, Francisco Rangel. An Emotional Analysis of False Information in Social Media and News Articles. arXiv preprint arXiv:1908.09951 (2019). ACM Transactions on Internet Technology (TOIT). In Press.
Anastasia Giachanou, Paolo Rosso, Fabio Crestani. Leveraging Emotional Signals for Credibility Detection. Proceedings of the 42nd International ACM Conference on Research and Development in Information Retrieval (SIGIR). pp 877–880. (2019)
Andre Guess, Jonathan Nagler, and Joshua Tucker. Less than you think: Prevalence and predictors of fake news dissemination on Facebook. Science Advances vol. 5 (2019)
Andrew Hall, Loren Terveen, Aaron Halfaker. Bot Detection in Wikidata Using Behavioral and Other Informal Cues. Proceedings of the ACM on Human-Computer Interaction. 2018 Nov 1;2(CSCW):64.
Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, Gerhard Weikum. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp 22-32. (2018)
Francisco Rangel and Paolo Rosso. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter. In: L. Cappellato, N. Ferro, D. E. Losada and H. Müller (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org, vol. 2380
Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 2125.
Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866.
Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784
Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.
Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.
Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179
Francisco Rangel and Paolo Rosso On the Implications of the General Data Protection Regulation on the Organisation of Evaluation Tasks. In: Language and Law / Linguagem e Direito, Vol. 5(2), pp. 80-102
Kai Shu, Suhang Wang, and Huan Liu. Understanding user profiles on social media for fake news detection. Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430--435 (2018)
Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter. (2017)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
Facebook
TwitterThe Residential Schools Locations Dataset in Geodatabase format (IRS_Locations.gbd) contains a feature layer "IRS_Locations" that contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Residential Schools Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites. Access Instructions: there are 47 files in this data package. Please download the entire data package by selecting all the 47 files and click on download. Two files will be downloaded, IRS_Locations.gbd.zip and IRS_LocFields.csv. Uncompress the IRS_Locations.gbd.zip. Use QGIS, ArcGIS Pro, and ArcMap to open the feature layer IRS_Locations that is contained within the IRS_Locations.gbd data package. The feature layer is in WGS 1984 coordinate system. There is also detailed file level metadata included in this feature layer file. The IRS_locations.csv provides the full description of the fields and codes used in this dataset.
Facebook
TwitterSince 1970, scores of states have established truth commissions to document political violence. Despite their prevalence and potential consequence, the question of why commissions are adopted in some contexts, but not in others, is not well understood. Relatedly, little is known about why some commissions possess strong investigative powers while others do not. I argue that the answer to both questions lies with domestic and international civil society actors, who are connected by a global transitional justice (TJ) network and who share the burden of guiding commission adoption and design. I propose that commissions are more likely to be adopted where network members can leverage information and moral authority over governments. I also suggest that commissions are more likely to possess strong powers where international experts, who steward TJ best practices, advise governments. I evaluate these expectations by analyzing two datasets in the novel Varieties of Truth Commissions Project, interviews with representatives from international non-governmental organizations, interviews with Guatemalan non-governmental organization leaders, a focus group with Argentinian human rights advocates, and a focus group at the International Center for Transitional Justice. My results indicate that network members share the burden—domestic members are essential to commission adoption, while international members are important for strong commission design.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide text metadata, image frames, and thumbnails of YouTube videos classified as harmful or harmless by domain experts, GPT-4-Turbo, and crowdworkers. Harmful videos are categorized into one or more of six harm categories: Information harms (IH), Hate and Harassment harms (HH), Clickbait harms (CB), Addictive harms (ADD), Sexual harms (SXL), and Physical harms (PH).
This repository includes the text metadata and a link to external cloud storage for the image data.
| Folder | Subfolder | #Videos |
| Ground Truth | Harmful_full_agreement (classified as harmful by all the three actors) | 5,109 |
| Harmful_subset_agreement (classified as harmful by more than two actors) | 14,019 | |
| Domain Experts | Harmful | 15,115 |
| Harmless | 3,303 | |
| GPT-4-Turbo | Harmful | 10,495 |
| Harmless | 7,818 | |
| Crowdworkers (Workers from Amazon Mechanical Turk) | Harmful | 12,668 |
| Harmless | 4,390 | |
| Unannotated large pool | - | 60,906 |
For details about the harm classification taxonomy and the performance comparison between crowdworkers, GPT-4-Turbo, and domain experts, please see https://arxiv.org/abs/2411.05854.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/com-Orkut.html
Dataset information
Orkut (http://www.orkut.com/) is a free on-line social network where users
form friendship each other. Orkut also allows users form a group which
other members can then join. We consider such user-defined groups as
ground-truth communities. We provide the Orkut friendship social network
and ground-truth communities. This data is provided by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)
We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.
Dataset statistics
Nodes 3,072,441
Edges 117,185,083
Nodes in largest WCC 3072441 (1.000)
Edges in largest WCC 117185083 (1.000)
Nodes in largest SCC 3072441 (1.000)
Edges in largest SCC 117185083 (1.000)
Average clustering coefficient 0.1666
Number of triangles 627584181
Fraction of closed triangles 0.01414
Diameter (longest shortest path) 9
90-percentile effective diameter 4.8
Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based
on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233
Files
File Description
com-orkut.ungraph.txt.gz Undirected Orkut network
com-orkut.all.cmty.txt.gz Orkut communities
com-orkut.top5000.cmty.txt.gz Orkut communities (Top 5,000)
The graph in the SNAP data set is 1-based, with nodes numbered 1 to
3,072,626.
In the SuiteSparse Matrix Collection, Problem.A is the undirected
Orkut network, a matrix of size n-by-n with n=3,072,441, which is
the number of unique user id's appearing in any edge.
Problem.aux.nodeid is a list of the node id's that appear in the SNAP data
set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).
C = Problem.aux.Communities_all is a sparse matrix of size n by 15,301,901
which represents the same number communities in the com-orkut.all.cmty.txt
file. The kth line in that file defines the kth community, and is the
column C(:,k), where where C(i,k)=1 if person nodeid(i) is in the kth
community. Row C(i,:) and row/column i of the A matrix thus refer to the
same person, nodeid(i).
Ctop = Problem.aux.Communities_to...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social networks are a battlefield for political propaganda. Protected by the anonymity of the internet, political actors use computational propaganda to influence the masses. Their methods include the use of synchronized or individual bots, multiple accounts operated by one social media management tool, or different manipulations of search engines and social network algorithms, all aiming to promote their ideology. While computational propaganda influences modern society, it is hard to measure or detect it. Furthermore, with the recent exponential growth in large language models (L.L.M), and the growing concerns about information overload, which makes the alternative truth spheres more noisy than ever before, the complexity and magnitude of computational propaganda is also expected to increase, making their detection even harder. Propaganda in social networks is disguised as legitimate news sent from authentic users. It smartly blended real users with fake accounts. We seek here to detect efforts to manipulate the spread of information in social networks, by one of the fundamental macro-scale properties of rhetoric—repetitiveness. We use 16 data sets of a total size of 13 GB, 10 related to political topics and 6 related to non-political ones (large-scale disasters), each ranging from tens of thousands to a few million of tweets. We compare them and identify statistical and network properties that distinguish between these two types of information cascades. These features are based on both the repetition distribution of hashtags and the mentions of users, as well as the network structure. Together, they enable us to distinguish (p − value = 0.0001) between the two different classes of information cascades. In addition to constructing a bipartite graph connecting words and tweets to each cascade, we develop a quantitative measure and show how it can be used to distinguish between political and non-political discussions. Our method is indifferent to the cascade’s country of origin, language, or cultural background since it is only based on the statistical properties of repetitiveness and the word appearance in tweets bipartite network structures.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
From the description: https://www.unb.ca/cic/datasets/truthseeker-2023.html
This project aims to create the largest ground truth fake news analysis dataset for real and fake news content in relation to social media posts. Below illustrates the major contributions of the TruthSeeker dataset to the current fake news dataset landscape:
One of the most extensive benchmark datasets with more than 180,000 labelled Tweets.
Three-factor active learning verification method which involved utilising 456 unique, highly skilled, Amazon Mechanical Turkers for labelling each Tweet. To understand patterns and characteristics of Twitter users, three auxiliary social media scores are also introduced:**Bot, credibility, and influence score.**
Conducted comprehensive analyses and evaluations on the TruthSeeker dataset, including the establishment of deep learning-based detection models, clustering-based event detection, and exploration of the relationship between tweet labels and the characteristics of online creators/spreaders.
The application of multiple BERT-based models to assess the accuracy of real/fake tweet detection.
The data for the Truth Seeker and Basic ML dataset were generated through the crawling of tweets related to Real and Fake news from the Politifact Dataset. Taking these ground truth values and crawling for tweets related to these topics (by manually generating keywords associated with the news in question to input into the twitter API), we were able to extract over 186,000 (before final processing) tweets related to 700 real and 700 fake pieces of news.
Taking this raw tweet data, we then used crowdsourcing in the form of Amazon Mechanical Turk to generate a majority answer to how closely the tweet agrees with the Real/Fake news source statement. After, a majority agreement algorithm is employed to designate a validity to the associated tweets in both a 3 and 5 category classification column.
This results in one of the largest ground truth datasets for fake news detection on twitter ever created. The TruthSeeker Dataset. Then we also generated a dataset of features from the tweet itself and the metadata of the user who posted the related tweet. Allowing the user to have the option to use both deep learning models as well as classical machine learning techniques.
Feature dataset: here
From: The Largest Social Media Ground-Truth Dataset for Real/Fake Content: TruthSeeker By: Sajjad Dadkhah; Xichen Zhang; Alexander Gerald Weismann; Amir Firouzi; Ali A. Ghorbani DOI: 10.1109/TCSS.2023.3322303
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
A dataset containing 79k articles of misinformation, fake news and propaganda. - 34975 'true' articles. --> MisinfoSuperset_TRUE.csv - 43642 articles of misinfo, fake news or propaganda --> MisinfoSuperset_FAKE.csv
The 'true' articles comes from a variety of sources, such as Reuters, the New York TImes, the Washington Post and more.
The 'fake' articles are sourced from: 1. American right wing extremist websites (such as Redflag Newsdesk, Beitbart, Truth Broadcast Network) 2. A previously made public dataset described in the following article: Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138). 3. Disinformation and propaganda cases collected by the EUvsDisinfo project. A project started in 2015 that identifies and fact checks disinformation cases originating from pro-Kremlin media that are spread across the EU.
The articles have all information except the actual text removed and are split up into a set with all the fake news / misinformation, and one with al the true articles.
// For those only interested in Russian propaganda (and not so much misinformation in general), I have added the Russian propaganda in a separate csv called 'EXTRA_RussianPropagandaSubset.csv..'
--
Note. While this might immediately seem like a great classification task, I would suggest also considering clustering / topic modelling. Why clustering? Because by clustering we make a model that can match a newly written article to a previously debunked lie / misinformation narrative, thereby we can immediately debunk a new article (or at least link it to a actual fact-checked statement) without either using an algorithm as argument , or encountering a time delay with regards to waiting for confirmation of a fact checking organisation.
An example disinformation project using this dataset can be found on https://stevenpeutz.com/disinformation/
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data is scrapped from the Politifact website. It contains the claims made by individuals and what does the Politifact curators think about the same. This data can be used in order to run various NLP algorithms in order to find the integrity of the data and also determining the validity of a claim.
Image for associating the content:- When you land on Politifact website. You will see the page with the list of facts as shown below. I have also annotated the various column fields with the image for convenience.
https://i.imgur.com/9MH52Uf.jpg" alt="Landing page for fact check page of Politifact">
Now when you click the article you land on the main page and the annotation for the curator is on the main page. You can see it as follows:-
https://i.imgur.com/c9Ht0fp.jpg" alt="Article and other info">
The content of the data is scrapped from the Politifact site and has various attributes. This list of attributes are covered below:- - sources: String representing the person who is associated with the quote. - sources_dates: Date on which the information was furnished by the source. - sources_post_location: The location/medium at which the source furnished the information. - sources_quote: The actual quote/information furnished by the source in question. - curator_name: Person who curated the information from the source. - curated_date:Date at which the curator analyzed and assessed the source's quote. - fact: Fact score that is assigned to the source's quote. - sources_url: URL of the curator's article about the source's quote - curators_article_title: Title of the article written by the curator to support/reject the source's claim - curator_complete_article: Complete blog written by the curator supporting/rejecting the source's claim - curator_tags: Tags given by curator to the blog post.
The entire acknowledgment goes to Politifact.com for curating and validating such data and facts.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Truth Social data set containing a network of users, their associated posts, and additional information about each post. Collected from February 2022 through September 2022, this dataset contains 454,458 user entries and 845,060 Truth (Truth Social’s term for post) entries.
Comprised of 12 different files, the entry count for each file is shown below.
| File | Data Points |
|---|---|
| users.tsv | 454,458 |
| follows.tsv | 4,002,115 |
| truths.tsv | 823,927 |
| quotes.tsv | 10,508 |
| replies.tsv | 506,276 |
| media.tsv | 184,884 |
| hashtags.tsv | 21,599 |
| external_urls.tsv | 173,947 |
| truth_hashtag_edges.tsv | 213,295 |
| truth_media_edges.tsv | 257,500 |
| truth_external_url_edges.tsv | 252,877 |
| truth_user_tag_edges.tsv | 145,234 |
A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.