21 datasets found

Truth Social Dataset

zenodo.org

zip

Updated Jan 13, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger (2023). Truth Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.7531625

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7531625

Dataset updated

Jan 13, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A Truth Social data set containing a network of users, their associated posts, and additional information about each post. Collected from February 2022 through September 2022, this dataset contains 454,458 user entries and 845,060 Truth (Truth Social’s term for post) entries.

Comprised of 12 different files, the entry count for each file is shown below.

File	Data Points
users.tsv	454,458
follows.tsv	4,002,115
truths.tsv	823,927
quotes.tsv	10,508
replies.tsv	506,276
media.tsv	184,884
hashtags.tsv	21,599
external_urls.tsv	173,947
truth_hashtag_edges.tsv	213,295
truth_media_edges.tsv	257,500
truth_external_url_edges.tsv	252,877
truth_user_tag_edges.tsv	145,234

A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.

Data from: Youtube social network
kaggle.com
zip
Updated Sep 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo De Tomasi (2019). Youtube social network [Dataset]. https://www.kaggle.com/lodetomasi1995/youtube-social-network
Explore at:
zip(10604317 bytes)Available download formats
Dataset updated
Sep 1, 2019
Authors
Lorenzo De Tomasi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.

We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

more info : https://snap.stanford.edu/data/com-Youtube.html
P
Friendster Dataset
paperswithcode.com
opendatalab.com
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaewon Yang; Jure Leskovec (2020). Friendster Dataset [Dataset]. https://paperswithcode.com/dataset/friendster
Explore at:
Dataset updated
Oct 28, 2020
Authors
Jaewon Yang; Jure Leskovec
Description
Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. The Friendster dataset consist of ground-truth communities (based on user-defined groups) and the social network from induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community.
YouTube Social Network with Communities (SNAP)
kaggle.com
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). YouTube Social Network with Communities (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-com-youtube/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 16, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
Youtube social network and ground-truth communities

https://snap.stanford.edu/data/com-Youtube.html

Dataset information

Youtube (http://www.youtube.com/) is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider
such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)

We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.

Network statistics
Nodes 1,134,890
Edges 2,987,624
Nodes in largest WCC 1134890 (1.000)
Edges in largest WCC 2987624 (1.000)
Nodes in largest SCC 1134890 (1.000)
Edges in largest SCC 2987624 (1.000)
Average clustering coefficient 0.0808
Number of triangles 3056386
Fraction of closed triangles 0.002081
Diameter (longest shortest path) 20
90-percentile effective diameter 6.5
Community statistics
Number of communities 8,385
Average community size 13.50
Average membership size 0.10

Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

Files
File Description
com-youtube.ungraph.txt.gz Undirected Youtube network
com-youtube.all.cmty.txt.gz Youtube communities
com-youtube.top5000.cmty.txt.gz Youtube communities (Top 5,000)

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

The graph in the SNAP data set is 1-based, with nodes numbered 1 to
1,157,827.

In the SuiteSparse Matrix Collection, Problem.A is the undirected Youtube
network, a matrix of size n-by-n with n=1,134,890, which is the number of
unique user id's appearing in any edge.

Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).

C = Problem.aux.Communities_all is a sparse matrix of size n by 16,386
which represents the communities in the com-youtube.all.cmty.txt file.
The kth line in that file defines the kth community, and is the column
C(:,k), where C(i,k)=1 if person ...
Orkut Social Network and Communities (SNAP)
kaggle.com
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Orkut Social Network and Communities (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-com-orkut/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 16, 2021
Dataset provided by
Kaggle
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Orkut social network and ground-truth communities

https://snap.stanford.edu/data/com-Orkut.html

Dataset information

Orkut (http://www.orkut.com/) is a free on-line social network where users form friendship each other. Orkut also allows users form a group which
other members can then join. We consider such user-defined groups as
ground-truth communities. We provide the Orkut friendship social network
and ground-truth communities. This data is provided by Alan Mislove et al. (http://socialnetworks.mpi-sws.org/data-imc2007.html)

We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.

Dataset statistics
Nodes 3,072,441
Edges 117,185,083
Nodes in largest WCC 3072441 (1.000)
Edges in largest WCC 117185083 (1.000)
Nodes in largest SCC 3072441 (1.000)
Edges in largest SCC 117185083 (1.000)
Average clustering coefficient 0.1666
Number of triangles 627584181
Fraction of closed triangles 0.01414
Diameter (longest shortest path) 9
90-percentile effective diameter 4.8

Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

Files
File Description
com-orkut.ungraph.txt.gz Undirected Orkut network
com-orkut.all.cmty.txt.gz Orkut communities
com-orkut.top5000.cmty.txt.gz Orkut communities (Top 5,000)

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

The graph in the SNAP data set is 1-based, with nodes numbered 1 to
3,072,626.

In the SuiteSparse Matrix Collection, Problem.A is the undirected
Orkut network, a matrix of size n-by-n with n=3,072,441, which is
the number of unique user id's appearing in any edge.

Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).

C = Problem.aux.Communities_all is a sparse matrix of size n by 15,301,901 which represents the same number communities in the com-orkut.all.cmty.txt file. The kth line in that file defines the kth community, and is the
column C(:,k), where where C(i,k)=1 if person nodeid(i) is in the kth
community. Row C(i,:) and row/column i of the A matrix thus refer to the
same person, nodeid(i).

Ctop = Problem.aux.Communities_to...
H
ICEWS Events of Interest Ground Truth Data Set
dataverse.harvard.edu
search.datacite.org
pdf +2
Updated May 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2020). ICEWS Events of Interest Ground Truth Data Set [Dataset]. http://doi.org/10.7910/DVN/28119
Explore at:
pdf(502638), txt(530), text/plain; charset=us-ascii(2282782)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/28119
Dataset updated
May 7, 2020
Dataset provided by
Harvard Dataverse
Time period covered
Jan 1, 2001 - Dec 31, 2013
Area covered
167 countries worldwide
Description
THIS IS NO LONGER SUPPORTED. In ICEWS, an Event of Interest (EOI) is a macro-level occurrence within a country or region that is supported by the existence of multiple underlying events. The Ground Truth Data Set is a collection of data which lists, for the EOIs supported, whether or not the EOI did occur in any given country for any given month, historically speaking. We plan to update this data on a periodic basis. The five EOIs that are currently supported in this data set include: 1. Domestic Political Crisis (DPC): Significant opposition to the government, but not to the level of rebellion or insurgency (e.g., power struggles between two political factions involving disruptive strikes or violent clashes between supporters). 2. Insurgency: Organized opposition whose objective is to overthrow the central government. 3. International Crisis: Conflict or elevated tensions that could lead to conflict between two or more states OR between a state and an actor operating primarily from beyond the state's borders that involves the deployment of substantial ground forces (1,000+) beyond its borders. 4. Rebellion: Organized, active, violent opposition with substantial arms, where the objective is to seek autonomy or independence from the central government. 5. Ethnic/Religious Violence: Violence between ethnic or religious groups that is not specifically directed against the government. Additional information about the IC EWS program can be found at http://www.icews.com/. Follow our Twitter handle for data updates and other news: @icews
D
Using social network information to discover truth of movie ranking
researchdata.ntu.edu.sg
tsv, txt
Updated Jun 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DR-NTU (Data) (2018). Using social network information to discover truth of movie ranking [Dataset]. http://doi.org/10.21979/N9/L5TTRW
Explore at:
tsv(4143), tsv(26553), txt(1857)Available download formats
Unique identifier
https://doi.org/10.21979/N9/L5TTRW
Dataset updated
Jun 10, 2018
Dataset provided by
DR-NTU (Data)
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The real dataset consists of movie evaluations from IMDB, which provides a platform where individuals can evaluate movies on a scale of 1 to 10. If a user rates a movie and clicks the share button, a Twitter message is generated. We then extract the rating from the Twitter message. We treat the ratings on the IMDB website as the event truths, which are based on the aggregated evaluations from all users, whereas our observations come from only a subset of users who share their ratings on Twitter. Using the Twitter API, we collect information about the follower and following relationships between individuals that generate movie evaluation Twitter messages. To better show the influence of social network information on event truth discovery, we delete small subnetworks that consist of less than 5 agents. The final dataset we use consists of 2266 evaluations from 209 individuals on 245 movies (events) and also the social network between these 209 individuals. We regard the social network to be undirected as both follower or following relationships indicate that the two users have similar taste.
Twitter dataset
figshare.com
txt
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mehdi khalil (2024). Twitter dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28069163.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28069163.v1
Dataset updated
Dec 20, 2024
Dataset provided by
figshare
Authors
mehdi khalil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Truth Seeker Dataset is designed to support research in the detection and classification of misinformation on social media platforms, particularly focusing on Twitter. This dataset is part of a broader initiative to enhance the understanding of how machine learning (ML) and natural language processing (NLP) can be leveraged to identify fake news and misleading content in real-time.Dataset CompositionThe Truth Seeker Dataset comprises a substantial collection of social media posts that have been meticulously labeled as either real or fake. It was constructed using advanced ML algorithms and NLP techniques to analyze the language patterns in social media communications. The dataset includes:Raw Social Media Posts: A diverse range of tweets that reflect various topics and sentiments.Labeling: Each post is annotated with binary labels indicating its authenticity (real or fake).Feature Sets: Two distinct subsets of the dataset have been created using different NLP vectorization methods—Word2Vec and TF-IDF. This allows researchers to explore how different feature representations impact model performance.Research ApplicationsThe primary aim of the Truth Seeker Dataset is to facilitate the development and validation of models that can accurately classify social media content. Key applications include:Fake News Detection: Utilizing various ML algorithms, including Random Forest and AdBoost, which have demonstrated high F1 scores in preliminary evaluations.Model Comparison: Researchers can compare the effectiveness of different ML approaches on the same dataset, enabling a clearer understanding of which methods yield the best results in detecting misinformation.Algorithm Development: The dataset serves as a benchmark for developing new algorithms aimed at improving accuracy in fake news detection.
Z
CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection
data.niaid.nih.gov
zenodo.org
Updated Jan 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahi Gautam Kishore (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5775507
Explore at:
Dataset updated
Jan 6, 2022
Dataset provided by
Thomas Mandl
Shahi Gautam Kishore
Struß Julia Maria
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

Task 3

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: Coming soon

Related Work

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
P
Group SNAP Dataset
paperswithcode.com
Updated Jul 21, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Group SNAP Dataset [Dataset]. https://paperswithcode.com/dataset/group-snap-snap-suitesparse-matrix-collection
Explore at:
Dataset updated
Jul 21, 2018
Description
Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets, Jure Leskovec http://snap.stanford.edu/data/index.html email jure at cs.stanford.edu

Citation for the SNAP collection:

@misc{snapnets, author = {Jure Leskovec and Andrej Krevl}, title = {{SNAP Datasets}: {Stanford} Large Network Dataset Collection}, howpublished = {\url{http://snap.stanford.edu/data}}, month = jun, year = 2014 }

The following matrices/graphs were added to the collection in June 2010 by Tim Davis (problem id and name):

2284 SNAP/soc-Epinions1 who-trusts-whom network of Epinions.com 2285 SNAP/soc-LiveJournal1 LiveJournal social network 2286 SNAP/soc-Slashdot0811 Slashdot social network, Nov 2008 2287 SNAP/soc-Slashdot0902 Slashdot social network, Feb 2009 2288 SNAP/wiki-Vote Wikipedia who-votes-on-whom network 2289 SNAP/email-EuAll Email network from a EU research institution 2290 SNAP/email-Enron Email communication network from Enron 2291 SNAP/wiki-Talk Wikipedia talk (communication) network 2292 SNAP/cit-HepPh Arxiv High Energy Physics paper citation network 2293 SNAP/cit-HepTh Arxiv High Energy Physics paper citation network 2294 SNAP/cit-Patents Citation network among US Patents 2295 SNAP/ca-AstroPh Collaboration network of Arxiv Astro Physics 2296 SNAP/ca-CondMat Collaboration network of Arxiv Condensed Matter 2297 SNAP/ca-GrQc Collaboration network of Arxiv General Relativity 2298 SNAP/ca-HepPh Collaboration network of Arxiv High Energy Physics 2299 SNAP/ca-HepTh Collaboration network of Arxiv High Energy Physics Theory 2300 SNAP/web-BerkStan Web graph of Berkeley and Stanford 2301 SNAP/web-Google Web graph from Google 2302 SNAP/web-NotreDame Web graph of Notre Dame 2303 SNAP/web-Stanford Web graph of Stanford.edu 2304 SNAP/amazon0302 Amazon product co-purchasing network from March 2 2003 2305 SNAP/amazon0312 Amazon product co-purchasing network from March 12 2003 2306 SNAP/amazon0505 Amazon product co-purchasing network from May 5 2003 2307 SNAP/amazon0601 Amazon product co-purchasing network from June 1 2003 2308 SNAP/p2p-Gnutella04 Gnutella peer to peer network from August 4 2002 2309 SNAP/p2p-Gnutella05 Gnutella peer to peer network from August 5 2002 2310 SNAP/p2p-Gnutella06 Gnutella peer to peer network from August 6 2002 2311 SNAP/p2p-Gnutella08 Gnutella peer to peer network from August 8 2002 2312 SNAP/p2p-Gnutella09 Gnutella peer to peer network from August 9 2002 2313 SNAP/p2p-Gnutella24 Gnutella peer to peer network from August 24 2002 2314 SNAP/p2p-Gnutella25 Gnutella peer to peer network from August 25 2002 2315 SNAP/p2p-Gnutella30 Gnutella peer to peer network from August 30 2002 2316 SNAP/p2p-Gnutella31 Gnutella peer to peer network from August 31 2002 2317 SNAP/roadNet-CA Road network of California 2318 SNAP/roadNet-PA Road network of Pennsylvania 2319 SNAP/roadNet-TX Road network of Texas 2320 SNAP/as-735 733 daily instances(graphs) from November 8 1997 to January 2 2000 2321 SNAP/as-Skitter Internet topology graph, from traceroutes run daily in 2005 2322 SNAP/as-caida The CAIDA AS Relationships Datasets, from January 2004 to November 2007 2323 SNAP/Oregon-1 AS peering information inferred from Oregon route-views between March 31 and May 26 2001 2324 SNAP/Oregon-2 AS peering information inferred from Oregon route-views between March 31 and May 26 2001 2325 SNAP/soc-sign-epinions Epinions signed social network 2326 SNAP/soc-sign-Slashdot081106 Slashdot Zoo signed social network from November 6 2008 2327 SNAP/soc-sign-Slashdot090216 Slashdot Zoo signed social network from February 16 2009 2328 SNAP/soc-sign-Slashdot090221 Slashdot Zoo signed social network from February 21 2009

Then the following problems were added in July 2018. All data and metadata from the SNAP data set was imported into the SuiteSparse Matrix Collection.

2777 SNAP/CollegeMsg Messages on a Facebook-like platform at UC-Irvine 2778 SNAP/com-Amazon Amazon product network 2779 SNAP/com-DBLP DBLP collaboration network 2780 SNAP/com-Friendster Friendster online social network 2781 SNAP/com-LiveJournal LiveJournal online social network 2782 SNAP/com-Orkut Orkut online social network 2783 SNAP/com-Youtube Youtube online social network 2784 SNAP/email-Eu-core E-mail network 2785 SNAP/email-Eu-core-temporal E-mails between users at a research institution 2786 SNAP/higgs-twitter twitter messages re: Higgs boson on 4th July 2012. 2787 SNAP/loc-Brightkite Brightkite location based online social network 2788 SNAP/loc-Gowalla Gowalla location based online social network 2789 SNAP/soc-Pokec Pokec online social network 2790 SNAP/soc-sign-bitcoin-alpha Bitcoin Alpha web of trust network 2791 SNAP/soc-sign-bitcoin-otc Bitcoin OTC web of trust network 2792 SNAP/sx-askubuntu Comments, questions, and answers on Ask Ubuntu 2793 SNAP/sx-mathoverflow Comments, questions, and answers on Math Overflow 2794 SNAP/sx-stackoverflow Comments, questions, and answers on Stack Overflow 2795 SNAP/sx-superuser Comments, questions, and answers on Super User 2796 SNAP/twitter7 A collection of 476 million tweets collected between June-Dec 2009 2797 SNAP/wiki-RfA Wikipedia Requests for Adminship (with text) 2798 SNAP/wiki-talk-temporal Users editing talk pages on Wikipedia 2799 SNAP/wiki-topcats Wikipedia hyperlinks (with communities)

The following 13 graphs/networks were in the SNAP data set in July 2018 but have not yet been imported into the SuiteSparse Matrix Collection. They may be added in the future:

amazon-meta ego-Facebook ego-Gplus ego-Twitter gemsec-Deezer gemsec-Facebook ksc-time-series memetracker9 web-flickr web-Reddit web-RedditPizzaRequests wiki-Elec wiki-meta wikispeedia

The 2010 description of the SNAP data set gave these categories:

Social networks: online social networks, edges represent interactions between people

Communication networks: email communication networks with edges representing communication

Citation networks: nodes represent papers, edges represent citations

Collaboration networks: nodes represent scientists, edges represent collaborations (co-authoring a paper)

Web graphs: nodes represent webpages and edges are hyperlinks

Blog and Memetracker graphs: nodes represent time stamped blog posts, edges are hyperlinks [revised below]

Amazon networks : nodes represent products and edges link commonly co-purchased products

Internet networks : nodes represent computers and edges communication

Road networks : nodes represent intersections and edges roads connecting the intersections

Autonomous systems : graphs of the internet

Signed networks : networks with positive and negative edges (friend/foe, trust/distrust)

By July 2018, the following categories had been added:

Networks with ground-truth communities : ground-truth network communities in social and information networks

Location-based online social networks : Social networks with geographic check-ins

Wikipedia networks, articles, and metadata : Talk, editing, voting, and article data from Wikipedia

Temporal networks : networks where edges have timestamps

Twitter and Memetracker : Memetracker phrases, links and 467 million Tweets

Online communities : Data from online communities such as Reddit and Flickr

Online reviews : Data from online review systems such as BeerAdvocate and Amazon

https://sparse.tamu.edu/SNAP
B
Residential Schools Locations Dataset (Geodatabase)
borealisdata.ca
search.dataone.org
Updated May 31, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosa Orlandini (2019). Residential Schools Locations Dataset (Geodatabase) [Dataset]. http://doi.org/10.5683/SP2/JFQ1SZ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/JFQ1SZ
Dataset updated
May 31, 2019
Dataset provided by
Borealis
Authors
Rosa Orlandini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1863 - Jun 30, 1998
Area covered
Canada
Description
The Residential Schools Locations Dataset in Geodatabase format (IRS_Locations.gbd) contains a feature layer "IRS_Locations" that contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Residential Schools Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites. Access Instructions: there are 47 files in this data package. Please download the entire data package by selecting all the 47 files and click on download. Two files will be downloaded, IRS_Locations.gbd.zip and IRS_LocFields.csv. Uncompress the IRS_Locations.gbd.zip. Use QGIS, ArcGIS Pro, and ArcMap to open the feature layer IRS_Locations that is contained within the IRS_Locations.gbd data package. The feature layer is in WGS 1984 coordinate system. There is also detailed file level metadata included in this feature layer file. The IRS_locations.csv provides the full description of the fields and codes used in this dataset.
Communities Graphs
kaggle.com
Updated Nov 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Communities Graphs [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-communities/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
com-LiveJournal: LiveJournal social network and ground-truth communities

LiveJournal is a free on-line blogging community where users declare friendship each other. LiveJournal also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. We provide the LiveJournal friendship social network and ground-truth communities.

We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

com-Friendster: Friendster social network and ground-truth communities

Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. For the social network, we take the induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community. This data is provided by The Web Archive Project, where the full graph is available.

We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

com-Orkut: Orkut social network and ground-truth communities

Orkut is a free on-line social network where users form friendship each other. Orkut also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. We provide the Orkut friendship social network and ground-truth communities. This data is provided by Alan Mislove et al.

We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

com-Youtube: Youtube social network and ground-truth communities

Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.

We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

com-DBLP: DBLP collaboration network and ground-truth communities

The DBLP computer science bibliography provides a comprehensive list of research papers in computer science. We construct a co-authorship network where two authors are connected if they publish at least one paper together. Publication venue, e.g, journal or conference, defines an individual ground-truth community; authors who published to a certain journal or conference form a community.

We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

com-Amazon: Amazon product co-purchasing network and ground-truth communities

Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. Each product category provided by Amazon defines each ground-truth community.

We regard each connected component in a product category as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

email-Eu-core: email-Eu-core network

The network was generated using email data from a large European research institution. We have anonymized information about all incoming and outgoing email between members of the research institution. Th...
d
Residential Schools Locations Dataset (Shapefile format)
dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orlandini, Rosa (2023). Residential Schools Locations Dataset (Shapefile format) [Dataset]. http://doi.org/10.5683/SP2/FJG5TG
Explore at:
Unique identifier
https://doi.org/10.5683/SP2/FJG5TG
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Orlandini, Rosa
Time period covered
Jan 1, 1863 - Jun 30, 1998
Description
The Residential Schools Locations Dataset in shapefile format contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this data set, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The data set was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this data set,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School. When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites. The geographic coordinate system for this dataset is WGS 1984. The data in shapefile format [IRS_locations.zip] can be viewed and mapped in a Geographic Information System software. Detailed metadata in xml format is available as part of the data in shapefile format. In addition, the field name descriptions (IRS_locfields.csv) and the detailed locations descriptions (IRS_locdescription.csv) should be used alongside the data in shapefile format.
B
Residential School Locations Dataset (CSV Format)
borealisdata.ca
search.dataone.org
Updated Jun 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/RIYEMU
Dataset updated
Jun 5, 2019
Dataset provided by
Borealis
Authors
Rosa Orlandini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1863 - Jun 30, 1998
Area covered
Canada
Description
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
Profiling Fake News Spreaders on Twitter
zenodo.org
Updated Sep 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU (2020). Profiling Fake News Spreaders on Twitter [Dataset]. http://doi.org/10.5281/zenodo.3692319
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3692319
Dataset updated
Sep 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU
Description
Task

Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.

After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.

As in previous years, we propose the task from a multilingual perspective:

English

Spanish

NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.

Data

Input

The uncompressed dataset consists in a folder per language (en, es). Each folder contains:

A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.

A truth.txt file with the list of authors and the ground truth.

The format of the XML files is:

The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.

b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Evaluation

The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.

Submission

Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:

mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY

Within OUTPUT-DIRECTORY, we require two subfolders: en and es, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

Bilal Ghanem, Paolo Rosso, Francisco Rangel. An Emotional Analysis of False Information in Social Media and News Articles. arXiv preprint arXiv:1908.09951 (2019). ACM Transactions on Internet Technology (TOIT). In Press.

Anastasia Giachanou, Paolo Rosso, Fabio Crestani. Leveraging Emotional Signals for Credibility Detection. Proceedings of the 42nd International ACM Conference on Research and Development in Information Retrieval (SIGIR). pp 877–880. (2019)

Andre Guess, Jonathan Nagler, and Joshua Tucker. Less than you think: Prevalence and predictors of fake news dissemination on Facebook. Science Advances vol. 5 (2019)

Andrew Hall, Loren Terveen, Aaron Halfaker. Bot Detection in Wikidata Using Behavioral and Other Informal Cues. Proceedings of the ACM on Human-Computer Interaction. 2018 Nov 1;2(CSCW):64.

Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, Gerhard Weikum. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp 22-32. (2018)

Francisco Rangel and Paolo Rosso. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter. In: L. Cappellato, N. Ferro, D. E. Losada and H. Müller (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org, vol. 2380

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 2125.

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866.

Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784

Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.

Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.

Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179

Francisco Rangel and Paolo Rosso On the Implications of the General Data Protection Regulation on the Organisation of Evaluation Tasks. In: Language and Law / Linguagem e Direito, Vol. 5(2), pp. 80-102

Kai Shu, Suhang Wang, and Huan Liu. Understanding user profiles on social media for fake news detection. Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430--435 (2018)

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter. (2017)
c
Data from: Truths and Tales: Understanding Online Fake News Networks in...
researchdata.canberra.edu.au
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benedict Sheehy (2023). Truths and Tales: Understanding Online Fake News Networks in South Korea [Dataset]. http://doi.org/10.17632/3xb4n9n6t4.1
Explore at:
Unique identifier
https://doi.org/10.17632/3xb4n9n6t4.1
Dataset updated
Nov 24, 2023
Authors
Benedict Sheehy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Korea
Description
This study investigates the features of fake news networks and how they spread during the 2020 South Korean election. Using Actor-Network Theory (ANT), we assessed the network's central players and how they are connected. Results reveal the characteristics of the videoclips and channel networks responsible for the propagation of fake news. Analysis of the videoclip network reveals a high number of detected fake news videos and a high density of connections among users. Assessment of news videoclips on both actual and fake news networks reveals that the real news network is more concentrated. However, the scale of the network may play a role in these variations. Statistics for network centralization reveal that users are spread out over the network, pointing to its decentralized character. A closer look at the real and fake news networks inside videos and channels reveals similar trends. We find that the density of the real news videoclip network is higher than that of the fake news network, whereas the fake news channel networks are denser than their real news counterparts, which may indicate greater activity and interconnectedness in their transmission. We also found that fake news videoclips had more likes than real news videoclips, whereas real news videoclips had more dislikes than fake news videoclips. These findings strongly suggest that fake news videoclips are more accepted when people watch them on YouTube. In addition, we used semantic networks and automated content analysis to uncover common language patterns in fake news which helps us better understand the structure and dynamics of the networks involved in the dissemination of fake news. The findings reported here provide important insights on how fake news spread via social networks during the South Korean election of 2020. The results of this study have important implications for the campaign against fake news and ensuring factual coverage.
Machine Translation Evaluation Dataset for Amharic
zenodo.org
data.niaid.nih.gov
tsv
Updated Mar 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asmelash Teka Hadgu; Asmelash Teka Hadgu; Adam Beaudoin; Abel Aregawi; Adam Beaudoin; Abel Aregawi (2020). Machine Translation Evaluation Dataset for Amharic [Dataset]. http://doi.org/10.5281/zenodo.3734260
Explore at:
tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3734260
Dataset updated
Mar 31, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Asmelash Teka Hadgu; Asmelash Teka Hadgu; Adam Beaudoin; Abel Aregawi; Adam Beaudoin; Abel Aregawi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine Translation Evaluation Dataset for Amharic

The dataset contains source sentences in Amharic and English and their corresponding reference translations that were collected using crowd sourcing. These ground-truth sentences are from across different domains such as news headlines, social media, Wikipedia and everyday conversation.

Metadata of files in the dataset

amen.tsv
- Domain: news | wiki | twitter | convo
- Source Sentence: Amharic sentence
- Reference Translation: English translation
- Google Translate: output of Google Translate
- Yandex Translate: output of Yandex Translate

enam.tsv
- Domain: news | wiki | twitter | convo
- Source Sentence: English sentence
- Reference Translation: Amharic translation
- Google Translate: output of Google Translate
- Yandex Translate: output of Yandex Translate

Amharic source and reference translations across domains:

News: These are news headlines from Ethiopian news websites.
Wikipedia: A random sample of sentences from the Amharic Wikipedia.
Twitter: Amharic Twitter posts on consumer products.
Conversational: Everyday conversational expressions from Amharic native speakers.

English source and reference translations across domains:

News: These are news headlines from Wikipedia current events portal.
Wikipedia: A random sample of sentences from the English Wikipedia.
Twitter: English Twitter posts on global events from Wikipedia current events portal.
Conversational: Everyday conversational expressions from English native speakers.

Evaluation of two systems that provide Amharic translation

The dataset also contains evaluation of two commercial systems: [Google
Translate](https://translate.google.com/) and [Yandex
Translate](https://translate.yandex.com/). Both systems provide free APIs that
users can sign up and get access keys to. The translations for Amharic to English were generated on 14th
February 2020. The translations for English to Amharic were generated on 30th March 2020.

Data from: MetaHarm: Harmful YouTube Video Dataset Annotated by Domain...

zenodo.org
data.niaid.nih.gov

Updated Jun 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Wonjeong Jo; Wonjeong Jo; Magdalena Wojcieszak; Magdalena Wojcieszak (2025). MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers [Dataset]. http://doi.org/10.5281/zenodo.14647452

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.14647452

Dataset updated

Jun 12, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Wonjeong Jo; Wonjeong Jo; Magdalena Wojcieszak; Magdalena Wojcieszak

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

YouTube

Description

We provide text metadata, image frames, and thumbnails of YouTube videos classified as harmful or harmless by domain experts, GPT-4-Turbo, and crowdworkers. Harmful videos are categorized into one or more of six harm categories: Information harms (IH), Hate and Harassment harms (HH), Clickbait harms (CB), Addictive harms (ADD), Sexual harms (SXL), and Physical harms (PH).

This repository includes the text metadata and a link to external cloud storage for the image data.

Text Metadata

Folder	Subfolder	#Videos
Ground Truth	Harmful_full_agreement (classified as harmful by all the three actors)	5,109
	Harmful_subset_agreement (classified as harmful by more than two actors)	14,019
Domain Experts	Harmful	15,115
	Harmless	3,303
GPT-4-Turbo	Harmful	10,495
	Harmless	7,818
Crowdworkers (Workers from Amazon Mechanical Turk)	Harmful	12,668
	Harmless	4,390
Unannotated large pool	-	60,906

Note. The term "actor" refers to the annotating entities: domain experts, GPT-4-Turbo, and crowdworkers

Explanations about the indicators

1. Ground truth - harmful_full_agreement & harmful_subset agreement

- links

- video_id

- channel

- description

- transcript

- date

- maj_harmcat: In the full_agreement version, this represents a harm category identified by all three actors. In the subset_agreement version, it represents a harm category classified by more than two actors.

- all_harmcat: This includes all harm categories classified by any of the actors without requiring agreement. It captures all classified categories.

2. Domain Experts, GPT-4-Turbo, Crowdworkers

- links

- video_id

- channel

- description

- transcript

- date

- harmcat

3. Unannotated large pool

- links

- video_id

- channel

- description

- transcript

- date

Note. Some data from the external dataset does not include date information. In such cases, the date was marked as 1990-01-01.
We retrieved transcripts using the YouTubeTranscriptApi. If a video does not have any text data in the transcript section, it means the API failed to retrieve the transcript, possibly because the video does not contain any detectable language.

Some image frames are also available in the pickle file.

Image data

The image frames and thumbnails are available at this link: https://ucdavis.app.box.com/folder/302772803692?s=d23b20snl1slwkuh4pgvjs31m7r1xae2

1. Image frames (imageframes_1-20.zip): Image frames are organized into 20 zip folders due to the large size of the image frames. Each zip folder contains subfolders named after the unique video IDs of the annotated videos. Inside each subfolder, there are 15 sequentially numbered image frames (from 0 to 14) extracted from the corresponding video. The image frame folders do not distinguish between videos classified as harmful or non-harmful.

2. Thumbnails (Thumbnails.zip): The zip folder contains thumbnails from the individual videos used in classification. Each thumbnail is named using the unique video ID. This folder does not distinguish between videos classified as harmful or harmless

Related works (in preprint)

For details about the harm classification taxonomy and the performance comparison between crowdworkers, GPT-4-Turbo, and domain experts, please see https://arxiv.org/abs/2411.05854.

f
Data sets used for user analysis.
plos.figshare.com
xlsx
Updated Jan 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alon Sela; Omer Neter; Václav Lohr; Petr Cihelka; Fan Wang; Moti Zwilling; John Phillip Sabou; Miloš Ulman (2025). Data sets used for user analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0309688.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309688.s002
Dataset updated
Jan 30, 2025
Dataset provided by
PLOS ONE
Authors
Alon Sela; Omer Neter; Václav Lohr; Petr Cihelka; Fan Wang; Moti Zwilling; John Phillip Sabou; Miloš Ulman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Social networks are a battlefield for political propaganda. Protected by the anonymity of the internet, political actors use computational propaganda to influence the masses. Their methods include the use of synchronized or individual bots, multiple accounts operated by one social media management tool, or different manipulations of search engines and social network algorithms, all aiming to promote their ideology. While computational propaganda influences modern society, it is hard to measure or detect it. Furthermore, with the recent exponential growth in large language models (L.L.M), and the growing concerns about information overload, which makes the alternative truth spheres more noisy than ever before, the complexity and magnitude of computational propaganda is also expected to increase, making their detection even harder. Propaganda in social networks is disguised as legitimate news sent from authentic users. It smartly blended real users with fake accounts. We seek here to detect efforts to manipulate the spread of information in social networks, by one of the fundamental macro-scale properties of rhetoric—repetitiveness. We use 16 data sets of a total size of 13 GB, 10 related to political topics and 6 related to non-political ones (large-scale disasters), each ranging from tens of thousands to a few million of tweets. We compare them and identify statistical and network properties that distinguish between these two types of information cascades. These features are based on both the repetition distribution of hashtags and the mentions of users, as well as the network structure. Together, they enable us to distinguish (p − value = 0.0001) between the two different classes of information cascades. In addition to constructing a bipartite graph connecting words and tweets to each cascade, we develop a quantitative measure and show how it can be used to distinguish between political and non-political discussions. Our method is indifferent to the cascade’s country of origin, language, or cultural background since it is only based on the statistical properties of repetitiveness and the word appearance in tweets bipartite network structures.
o
Data from: On the Influence of Twitter Trolls during the 2016 US...
explore.openaire.eu
Updated Oct 1, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikos Salamanos; Michael J. Jensen; Xinlei He; Yang Chen; Michael Sirivianos (2019). On the Influence of Twitter Trolls during the 2016 US Presidential Election [Dataset]. http://doi.org/10.5281/zenodo.3540801
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3540801, https://identifiers.org/arxiv:1910.00531v2
Dataset updated
Oct 1, 2019
Authors
Nikos Salamanos; Michael J. Jensen; Xinlei He; Yang Chen; Michael Sirivianos
Area covered
United States
Description
It is a widely accepted fact that state-sponsored Twitter accounts operated during the 2016 US presidential election spreading millions of tweets with misinformation and inflammatory political content. Whether these social media campaigns of the so-called "troll" accounts were able to manipulate public opinion is still in question. Here we aim to quantify the influence of troll accounts and the impact they had on Twitter by analyzing 152.5 million tweets from 9.9 million users, including 822 troll accounts. The data collected during the US election campaign, contain original troll tweets before they were deleted by Twitter. From these data, we constructed a very large interaction graph; a directed graph of 9.3 million nodes and 169.9 million edges. Recently, Twitter released datasets on the misinformation campaigns of 8,275 state-sponsored accounts linked to Russia, Iran and Venezuela as part of the investigation on the foreign interference in the 2016 US election. These data serve as ground-truth identifier of troll users in our dataset. Using graph analysis techniques we qualify the diffusion cascades of web and media context that have been shared by the troll accounts. We present strong evidence that authentic users were the source of the viral cascades. Although the trolls were participating in the viral cascades, they did not have a leading role in them and only four troll accounts were truly influential. With this version, we are correcting an error in the Acknowledgments regarding the research funding that supports this work. The correct one is the European Union's Horizon 2020 Research and Innovation program under the Cybersecurity CONCORDIA project (Grant Agreement No. 830927)

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger (2023). Truth Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.7531625

Truth Social Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7531625

Dataset updated

Jan 13, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gerard; Nicholas Botzer; Tim Weninger; Patrick Gerard; Nicholas Botzer; Tim Weninger

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Comprised of 12 different files, the entry count for each file is shown below.

File	Data Points
users.tsv	454,458
follows.tsv	4,002,115
truths.tsv	823,927
quotes.tsv	10,508
replies.tsv	506,276
media.tsv	184,884
hashtags.tsv	21,599
external_urls.tsv	173,947
truth_hashtag_edges.tsv	213,295
truth_media_edges.tsv	257,500
truth_external_url_edges.tsv	252,877
truth_user_tag_edges.tsv	145,234

A readme file is provided that describes the structure of the files, necessary terms, and necessary information about the data collection.

Clear search

Close search

Google apps

Main menu

Truth Social Dataset

Data from: Youtube social network

Friendster Dataset

YouTube Social Network with Communities (SNAP)

Youtube social network and ground-truth communities

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

Orkut Social Network and Communities (SNAP)

Orkut social network and ground-truth communities

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

ICEWS Events of Interest Ground Truth Data Set

Using social network information to discover truth of movie ranking

Twitter dataset

CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

Group SNAP Dataset

Residential Schools Locations Dataset (Geodatabase)

Communities Graphs

com-LiveJournal: LiveJournal social network and ground-truth communities

com-Friendster: Friendster social network and ground-truth communities

com-Orkut: Orkut social network and ground-truth communities

com-Youtube: Youtube social network and ground-truth communities

com-DBLP: DBLP collaboration network and ground-truth communities

com-Amazon: Amazon product co-purchasing network and ground-truth communities

email-Eu-core: email-Eu-core network

Residential Schools Locations Dataset (Shapefile format)

Residential School Locations Dataset (CSV Format)

Profiling Fake News Spreaders on Twitter

Data from: Truths and Tales: Understanding Online Fake News Networks in...

Machine Translation Evaluation Dataset for Amharic

Data from: MetaHarm: Harmful YouTube Video Dataset Annotated by Domain...

Text Metadata

Explanations about the indicators

Image data

Related works (in preprint)

Data sets used for user analysis.

Data from: On the Influence of Twitter Trolls during the 2016 US...

Truth Social DatasetSee More Versions

Truth Social Dataset