40 datasets found

d
GraphXAI
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Queen, Owen (2023). GraphXAI [Dataset]. http://doi.org/10.7910/DVN/KULOS8
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/KULOS8
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Queen, Owen
Description
As post hoc explanations are increasingly used to understand the behavior of Graph Neural Networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations for a given task. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Further, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. We include ShapeGGen and additional XAI-ready real-world graph datasets into an open-source graph explainability library, GraphXAI. In addition, GraphXAI provides a broader ecosystem of data loaders, data processing functions, synthetic and real-world graph datasets with ground-truth explanations, visualizers, GNN model implementations, and a set of evaluation metrics to benchmark the performance of any given GNN explainer.
f
Input data for exopGCN and different GNNs.
plos.figshare.com
zip
Updated Apr 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qingju Jiao; Han Zhang; Jingwen Wu; Nan Wang; Guoying Liu; Yongge Liu (2024). Input data for exopGCN and different GNNs. [Dataset]. http://doi.org/10.1371/journal.pone.0301476.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301476.s002
Dataset updated
Apr 30, 2024
Dataset provided by
PLOS ONE
Authors
Qingju Jiao; Han Zhang; Jingwen Wu; Nan Wang; Guoying Liu; Yongge Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Graph neural networks (GNNs), with their ability to incorporate node features into graph learning, have achieved impressive performance in many graph analysis tasks. However, current GNNs including the popular graph convolutional network (GCN) cannot obtain competitive results on the graphs without node features. In this work, we first introduce path-driven neighborhoods, and then define an extensional adjacency matrix as a convolutional operator. Second, we propose an approach named exopGCN which integrates the simple and effective convolutional operator into GCN to classify the nodes in the graphs without features. Experiments on six real-world graphs without node features indicate that exopGCN achieves better performance than other GNNs on node classification. Furthermore, by adding the simple convolutional operator into 13 GNNs, the accuracy of these methods are improved remarkably, which means that our research can offer a general skill to improve accuracy of GNNs. More importantly, we study the relationship between node classification by GCN without node features and community detection. Extensive experiments including six real-world graphs and nine synthetic graphs demonstrate that the positive relationship between them can provide a new direction on exploring the theories of GCNs.
Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7909511
Dataset updated
Nov 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
Subject matter triples file
fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
Example of a row in train.txt, valid.txt, and test.txt:
2, 192, 0
Example of a row in entity2id.txt:
/g/112yfy2xr, 2
Example of a row in relation2id.txt:
/music/album/release_type, 192
Explaination
"/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
Type system file
freebase_endtypes: Each row maps an edge type to its required subject type and object type.
Example
92, 47178872, 90
Explanation
"92" and "90" are the type id of the subject and object which has the relationship id "47178872".
Metadata files
object_types: Each row maps the MID of a Freebase object to a type it belongs to.
Example
/g/11b41c22g, /type/object/type, /people/person
Explanation
The entity with MID "/g/11b41c22g" has a type "/people/person"
object_names: Each row maps the MID of a Freebase object to its textual label.
Example
/g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
Explanation
The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
Example
/m/05v3y9r, /type/object/id, "/music/live_album/concert"
Explanation
The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
domains_id_label: Each row maps the MID of a Freebase domain to its label.
Example
/m/05v4pmy, geology, 77
Explanation
The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
types_id_label: Each row maps the MID of a Freebase type to its label.
Example
/m/01xljxh, /government/political_party, 147
Explanation
The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
entities_id_label: Each row maps the MID of a Freebase entity to its label.
Example
/g/11b78qtr5m, Viroliano Tries Jazz, 2234
Explanation
The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
properties_id_label: Each row maps the MID of a Freebase property to its label.
Example
/m/010h8tp2, /comedy/comedy_group/members, 47178867
Explanation
The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.
Example
uri_original2simplified
"http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique"
uri_simplified2original
"/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique"
Explanation
The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset.
The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.
P
SCG Dataset
paperswithcode.com
Updated Nov 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SCG Dataset [Dataset]. https://paperswithcode.com/dataset/scg
Explore at:
Dataset updated
Nov 12, 2024
Description
Abstract: Graph Neural Networks (GNNs) have recently gained traction in transportation, bioinformatics, language and image processing, but research on their application to supply chain management remains limited. Supply chains are inherently graph-like, making them ideal for GNN methodologies, which can optimize and solve complex problems. The barriers include a lack of proper conceptual foundations, familiarity with graph applications in SCM, and real-world benchmark datasets for GNN-based supply chain research. To address this, we discuss and connect supply chains with graph structures for effective GNN application, providing detailed formulations, examples, mathematical definitions, and task guidelines. Additionally, we present a multi-perspective real-world benchmark dataset from a leading FMCG company in Bangladesh, focusing on supply chain planning. We discuss various supply chain tasks using GNNs and benchmark several state-of-the-art models on homogeneous and heterogeneous graphs across six supply chain analytics tasks. Our analysis shows that GNN-based models consistently outperform statistical ML and other deep learning models by around 10-30% in regression, 10-30% in classification and detection tasks, and 15-40% in anomaly detection tasks on designated metrics. With this work, we lay the groundwork for solving supply chain problems using GNNs, supported by conceptual discussions, methodological insights, and a comprehensive dataset.
m
A Row Generation Algorithm for Finding Optimal Burning Sequences of Large...
data.mendeley.com
redu.unicamp.br
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felipe Pereira (2024). A Row Generation Algorithm for Finding Optimal Burning Sequences of Large Graphs - Complementary Data [Dataset]. http://doi.org/10.17632/c95hp3m4mz.2
Explore at:
Unique identifier
https://doi.org/10.17632/c95hp3m4mz.2
Dataset updated
Nov 11, 2024
Authors
Felipe Pereira
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This dataset contains complementary data to the paper "A Row Generation Algorithm for Finding Optimal Burning Sequences of Large Graphs" [1], which proposes an exact algorithm for the Graph Burning Problem, an NP-hard optimization problem that models a form of contagion diffusion on social networks.

Concerning the computational experiments discussed in that paper, we make available:

Four sets of instances;

The optimal (or best known) solutions obtained;

The source code;

An Appendix with additional details about the results.

The "delta" input sets include graphs that are real-world networks [1,2], while the "grid" input set contains graphs that are square grids.

The directories "delta_10K_instances", "delta_100K_instances", "delta_4M_instances" and "grid_instances" contain files that describe the sets of instances. The first two lines of each file contain:

where

where and

The directories "delta_10K_solutions", "delta_100K_solutions", "delta_4M_solutions" and "grid_solutions" contain files that describe the optimal (or best known) solutions for the corresponding sets of instances.

The first line of each file contains:

where is the number of vertices in the burning sequence. Each of the next lines contains:

where

The directory "source_code" contains the implementations of the exact algorithm proposed in the paper [1], namely, PRYM.

Lastly, the file "appendix.pdf" presents additional details on the results reported in the paper.

This work was supported by grants from Santander Bank, Brazil, Brazilian National Council for Scientific and Technological Development (CNPq), Brazil, São Paulo Research Foundation (FAPESP), Brazil and Fund for Support to Teaching, Research and Outreach Activities (FAEPEX).

Caveat: the opinions, hypotheses and conclusions or recommendations expressed in this material are the sole responsibility of the authors and do not necessarily reflect the views of Santander, CNPq, FAPESP or FAEPEX.

References

[1] F. C. Pereira, P. J. de Rezende, T. Yunes and L. F. B. Morato. A Row Generation Algorithm for Finding Optimal Burning Sequences of Large Graphs. Submitted. 2024.

[2] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. 2024. https://snap.stanford.edu/data

[3] Ryan A. Rossi and Nesreen K. Ahmed. The Network Data Repository with Interactive Graph Analytics and Visualization. In: AAAI, 2022. https://networkrepository.com
f
Data set introduction.
plos.figshare.com
xls
Updated May 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guoyang Tang; Xueyi Zhao; Yanyun Fu; Xiaolin Ning (2024). Data set introduction. [Dataset]. http://doi.org/10.1371/journal.pone.0297989.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297989.t001
Dataset updated
May 23, 2024
Dataset provided by
PLOS ONE
Authors
Guoyang Tang; Xueyi Zhao; Yanyun Fu; Xiaolin Ning
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In light of the exponential growth in information volume, the significance of graph data has intensified. Graph clustering plays a pivotal role in graph data processing by jointly modeling the graph structure and node attributes. Notably, the practical significance of multi-view graph clustering is heightened due to the presence of diverse relationships within real-world graph data. Nonetheless, prevailing graph clustering techniques, predominantly grounded in deep learning neural networks, face challenges in effectively handling multi-view graph data. These challenges include the incapability to concurrently explore the relationships between multiple view structures and node attributes, as well as difficulties in processing multi-view graph data with varying features. To tackle these issues, this research proposes a straightforward yet effective multi-view graph clustering approach known as SLMGC. This approach uses graph filtering to filter noise, reduces computational complexity by extracting samples based on node importance, enhances clustering representations through graph contrastive regularization, and achieves the final clustering outcomes using a self-training clustering algorithm. Notably, unlike neural network algorithms, this approach avoids the need for intricate parameter settings. Comprehensive experiments validate the supremacy of the SLMGC approach in multi-view graph clustering endeavors when contrasted with prevailing deep neural network techniques.
CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315
Explore at:
application/gzip, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11391315
Dataset updated
Jun 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha
Time period covered
May 29, 2024
Description
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

Background and Motivation

In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2

Paper: to be published
Link-prediction on Biomedical Knowledge Graphs
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.12097377
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12097377
Dataset updated
Jun 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Time period covered
Jun 25, 2021
Description
Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.

Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).

On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.

Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.

Inside experimental_data.zip, the following files are provided for each dataset:

{dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.

test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;

entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);

relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.
Data from: IntelliGraphs: Datasets for Benchmarking Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thiviyan Thanapalasingam; Thiviyan Thanapalasingam; Emile van Krieken; Emile van Krieken; Peter Bloem; Peter Bloem; Paul Groth; Paul Groth (2025). IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation [Dataset]. http://doi.org/10.5281/zenodo.14787483
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14787483
Dataset updated
Feb 1, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thiviyan Thanapalasingam; Thiviyan Thanapalasingam; Emile van Krieken; Emile van Krieken; Peter Bloem; Peter Bloem; Paul Groth; Paul Groth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntelliGraphs is a collection of datasets for benchmarking Knowledge Graph Generation models. It consists of three synthetic datasets (syn-paths, syn-tipr, syn-types) and two real-world datasets (wd-movies, wd-articles). There is also a Python package available that loads these datasets and verifies new graphs using semantics that was pre-defined for each dataset. It can also be used as a testbed for developing new generative models.
u
Topology Bench: Systematic Graph Based Benchmarking for Optical Networks
rdr.ucl.ac.uk
zenodo.org
bin
Updated Oct 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Topology Bench: Systematic Graph Based Benchmarking for Optical Networks [Dataset]. https://rdr.ucl.ac.uk/articles/dataset/Topology_Bench_Systematic_Graph_Based_Benchmarking_for_Optical_Networks/27212457
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5522/04/27212457.v2
Dataset updated
Oct 14, 2024
Dataset provided by
University College London
Authors
Robin Matzner; Ahuja, Akanksha; Rasoul Sadeghi Yamchi; Michael Doherty; Alejandra Beghelli Zapata; Seb J. Savory; Polina Bayvel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
TopologyBench is a systematic graph theoretical approach to benchmarking optical network topologies. Network datasets are combined with their corresponding graph theoretical analysis to provide a systematic methodology for selecting diverse sets of optical networks for benchmarking. This topology benchmark is comprised of a network dataset and a systematic graph theoretic analysis. The dataset provides (a) 105 real optical networks and (b) synthetic topologies, generated by the SNR-BA model, divided into (i) Syn-small of 900 synthetic networks and (ii) Syn-large of 270,000 synthetic networks. The systematic graph theoretical analysis identifies and analyses structural, spatial and spectral properties of both the real world and synthetic networks. The graph theoretical correlation analysis reveal network design strategies leading to sparse yet efficient networks. An outlier analysis identifies networks that deviate from standard network designs. The analysis also identifies the limitations of real data in terms of network diversity and provides a justification for using synthetic data to complement the real dataset. We conclude the paper by providing a systematic methodology to cluster networks based on unsupervised machine learning and to select a diverse set of topologies for benchmarking. TopologyBench is a novel, high-quality and unified benchmark designed to facilitate research collaborations in long-haul fibre infrastructure by providing a systematic graph theoretical approach to benchmarking optical networks.
f
Results of experiments on planned area fix-size.
plos.figshare.com
xls
Updated Jun 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hina Ali; Khalid Saleem (2024). Results of experiments on planned area fix-size. [Dataset]. http://doi.org/10.1371/journal.pone.0304422.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0304422.t004
Dataset updated
Jun 21, 2024
Dataset provided by
PLOS ONE
Authors
Hina Ali; Khalid Saleem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study delves into the critical need for generating real-world compatible data to support the application of deep reinforcement learning (DRL) in vehicle routing. Despite the advancements in DRL algorithms, their practical implementation in vehicle routing is hindered by the scarcity of appropriate real-world datasets. Existing methodologies often rely on simplistic distance metrics, failing to accurately capture the complexities inherent in real-world routing scenarios. To address this challenge, we present a novel approach for generating real-world compatible data tailored explicitly for DRL-based vehicle routing models. Our methodology centers on the development of a spatial data extraction and curation tool adept at extracting geocoded locations from diverse urban environments, encompassing both planned and unplanned areas. Leveraging advanced techniques, the tool refines location data, accounting for unique characteristics of urban environments. Furthermore, it integrates specialized distance metrics and location demands to construct vehicle routing graphs that represent real-world conditions. Through comprehensive experimentation on varied real-world testbeds, our approach showcases its efficacy in producing datasets closely aligned with the requirements of DRL-based vehicle routing models. It’s worth mentioning that this dataset is structured as a graph containing location, distance, and demand information, with each graph stored independently to facilitate efficient access and manipulation. The findings underscore the adaptability and reliability of our methodology in tackling the intricacies of real-world routing challenges. This research marks a significant stride towards enabling the practical application of DRL techniques in addressing real-world vehicle routing problems.
R
The least cost directed perfect awareness problem: benchmark instances and...
redu.unicamp.br
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Repositório de Dados de Pesquisa da Unicamp (2024). The least cost directed perfect awareness problem: benchmark instances and solutions [Dataset]. http://doi.org/10.25824/redu/1KQLD8
Explore at:
Unique identifier
https://doi.org/10.25824/redu/1KQLD8
Dataset updated
Nov 11, 2024
Dataset provided by
Repositório de Dados de Pesquisa da Unicamp
Dataset funded by
São Paulo Research Foundation
National Council for Scientific and Technological Developmenthttp://www.cnpq.br/
Description
This dataset contains complementary data to the paper "The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations" [1]. Here, we make available two sets of instances of the combinatorial optimization problem studied in that paper, which deals with the spread of information on social networks. We also provide the best known solutions and bounds obtained through computational experiments for each instance. The first input set includes 300 synthetic instances composed of graphs that resemble real-world social networks. These graphs were produced with a generator proposed in [2]. The second set consists of 14 instances built from graphs obtained by crawling Twitter [3]. The directories "synthetic_instances" and "twitter_instances" contain files that describe both sets of instances, all of which follow the format: the first two lines correspond to: {n} {m} where {n} and {m} are the number of vertices and edges in the graph. Each of the next {n} lines contains: {v} {c} {t} where {v} identifies a vertex while {c} and {t} are the cost and threshold associated to that vertex. Each of the {m} remaining lines contains: {u} {v} {w} where {u} and {v} identify an ordered pair of vertices that determines a directed edge with weight {w}. The directories "solutions_for_synthetic_instances" and "solutions_for_twitter_instances" contain files that describe the best known solutions for both sets of instances, all of which follow the format: the first line corresponds to: {s} where {s} is the number of vertices in the solution. Each of the next {s} lines contains: {v} where {v} identifies a seed vertex. The last line contains: {c} where {c} is the solution cost. Lastly, two files, namely, "bounds_for_synthetic_instances.csv" and "bounds_for_twitter_instances.csv", enumerate the values of the best known lower and upper bounds for both sets of instances. This work was supported by grants from Santander Bank, Brazil, Brazilian National Council for Scientific and Technological Development (CNPq), Brazil, São Paulo Research Foundation (FAPESP), Brazil. Caveat: the opinions, hypotheses and conclusions or recommendations expressed in this material are the responsibility of the authors and do not necessarily reflect the views of Santander, CNPq, or FAPESP. References [1] F. C. Pereira, P. J. de Rezende. The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations. Submitted. 2023. [2] B. Bollobás, C. Borgs, J. Chayes, and O. Riordan. Directed scale-free graphs. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pages 132–139, 2003. [3] C. Schweimer, C. Gfrerer, F. Lugstein, D. Pape, J. A. Velimsky, R. Elsässer, and B. C. Geiger. Generating simple directed social network graphs for information spreading. In Proceedings of the ACM Web Conference 2022, WWW ’22, pages 1475–1485, 2022.
n
Event Graph of BPI Challenge 2014
4tu.edu.hpc.n-helix.com
data.4tu.nl
zip
Updated Apr 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dirk Fahland; Stefan Esser (2021). Event Graph of BPI Challenge 2014 [Dataset]. http://doi.org/10.4121/14169494.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/14169494.v1
Dataset updated
Apr 22, 2021
Dataset provided by
4TU.ResearchData
Authors
Dirk Fahland; Stefan Esser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Business process event data modeled as labeled property graphs

Data Format
-----------

The dataset comprises one labeled property graph in two different file formats.

#1) Neo4j .dump format

A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/

/bin/neo4j-admin.(bat|sh) load --database=graph.db --from=

The .dump was created with Neo4j v3.5.

#2) .graphml format

A .zip file containing a .graphml file of the entire graph

Data Schema
-----------

The graph is a labeled property graph over business process event data. Each graph uses the following concepts

:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"

:Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")

:Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node

:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations

:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities

:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.

:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log

:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph

:REL relationship - placeholder for any structural relationship between two :Entity nodes

The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552

Data Contents
-------------

neo4j-bpic14-2021-02-17 (.dump|.graphml.zip)

An integrated graph describing the raw event data of the entire BPI Challenge 2014 dataset.
van Dongen, B.F. (Boudewijn) (2014): BPI Challenge 2014. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:c3e5d162-0cfd-4bb0-bd82-af5268819c35

BPI Challenge 2014: Similar to other ICT companies, Rabobank Group ICT has to implement an increasing number of software releases, while the time to market is decreasing. Rabobank Group ICT has implemented the ITIL-processes and therefore uses the Change-proces for implementing these so called planned changes. Rabobank Group ICT is looking for fact-based insight into sub questions, concerning the impact of changes in the past, to predict the workload at the Service Desk and/or IT Operations after future changes. The challenge is to design a (draft) predictive model, which can be used to implement in a BI environment. The purpose of this predictive model will be to support Business Change Management in implementing software releases with less impact on the Service Desk and/or IT Operations. We have prepared several case-files with anonymous information from Rabobank Netherlands Group ICT for this challenge. The files contain record details from an ITIL Service Management tool called HP Service Manager.

The original data had the information as extracts in CSV with the Interaction-, Incident- or Change-number as case ID. Next to these case-files, we provide you with an Activity-log, related to the Incident-cases. There is also a document detailing the data in the CSV file and providing background to the Service Management tool. All this information is integrated in the labeled property graph in this dataset.

The data contains the following entities and their events

- ServiceComponent - an IT hardware or software component in a financial institute
- ConfigurationItem - an part of a ServiceComponent that can be configured, changed, or modified
- Incident - a problem or issue that occurred at a configuration item of a service component
- Interaction - a logical grouping of activities performed for investigating an incident and identifying a solution for the incident
- Change - a logical grouping of activities performed to change or modify one or more configuration items
- Case_R - a user or worker involved in any of the steps
- KM - an entry in the knowledge database used to resolve incidents

Data Size
---------

BPIC14, nodes: 919838, relationships: 6682386
Data from: Graph drawing using tabu search coupled with path relinking
data.niaid.nih.gov
datadryad.org
zip
Updated Apr 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fadi K. Dib; Peter Rodgers (2019). Graph drawing using tabu search coupled with path relinking [Dataset]. http://doi.org/10.5061/dryad.k082rv8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.k082rv8
Dataset updated
Apr 30, 2019
Dataset provided by
University of Kent
Authors
Fadi K. Dib; Peter Rodgers
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Graph drawing, or the automatic layout of graphs, is a challenging problem. There are several search based methods for graph drawing which are based on optimizing an objective function which is formed from a weighted sum of multiple criteria. In this paper, we propose a new neighbourhood search method which uses a tabu search coupled with path relinking to optimize such objective functions for general graph layouts with undirected straight lines. To our knowledge, before our work, neither of these methods have been previously used in general multi-criteria graph drawing. Tabu search uses a memory list to speed up searching by avoiding previously tested solutions, while the path relinking method generates new solutions by exploring paths that connect high quality solutions. We use path relinking periodically within the tabu search procedure to speed up the identification of good solutions. We have evaluated our new method against the commonly used neighbourhood search optimization techniques: hill climbing and simulated annealing. Our evaluation examines the quality of the graph layout (objective function's value) and the speed of layout in terms of the number of evaluated solutions required to draw a graph. We also examine the relative scalability of each method. Our experimental results were applied to both random graphs and a real-world dataset. We show that our method outperforms both hill climbing and simulated annealing by producing a better layout in a lower number of evaluated solutions. In addition, we demonstrate that our method has greater scalability as it can layout larger graphs than the state-of-the-art neighbourhood search methods. Finally, we show that similar results can be produced in a real world setting by testing our method against a standard public graph dataset.
m
Data from: Minimizing the Cost of Leveraging Influencers in Social Networks:...
data.mendeley.com
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felipe Pereira (2024). Minimizing the Cost of Leveraging Influencers in Social Networks: IP and CP Approaches - Complementary Data [Dataset]. http://doi.org/10.17632/tkk5pdswty.3
Explore at:
Unique identifier
https://doi.org/10.17632/tkk5pdswty.3
Dataset updated
Nov 11, 2024
Authors
Felipe Pereira
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This dataset contains complementary data to the paper "Minimizing the Cost of Leveraging Influencers in Social Networks: IP and CP Approaches" [1], which studies integer/constraint programming formulations for the Least Cost Directed Perfect Awareness Problem, an NP-hard optimization problem that arises in the context of influence marketing. Regarding the computational experiments conducted in the paper, we make available:

Two sets of instances;

The best known attained solutions;

The source code;

An appendix with additional details about the results.

The first input set includes 300 synthetic instances composed of graphs that resemble real-world social networks [2]. The second set consists of 14 instances built from online interactions crawled from X (formerly known as Twitter) [3].

The directories "synthetic_instances" and "x_instances" contain files that describe both sets of instances. The first two lines of each file contain:

where

where

where and

The directories "solutions_for_synthetic_instances" and "solutions_for_x_instances" contain files that describe the best known solutions for both sets of instances. The first line of each file contains:

where is the number of vertices in the solution. Each of the next lines contains:

where

where

The directory "source_code" contains the implementations of the mathematical models studied in the paper.

Lastly, the file "appendix.pdf" presents details of the results reported in the paper [1].

This work was supported by grants from Santander Bank, Brazil, Brazilian National Council for Scientific and Technological Development (CNPq), Brazil, and São Paulo Research Foundation (FAPESP), Brazil.

Caveat: the opinions, hypotheses and conclusions or recommendations expressed in this material are the responsibility of the authors and do not necessarily reflect the views of Santander, CNPq or FAPESP.

References

[1] F. C. Pereira, P. J. de Rezende and T. Yunes. Minimizing the Cost of Leveraging Influencers in Social Networks: IP and CP Approaches. Submitted. 2023.

[2] F. C. Pereira, P. J. de Rezende. The Least Cost Directed Perfect Awareness Problem: complexity, algorithms and computations. Online Social Networks and Media, 37-38, 2023.

[3] C. Schweimer, C. Gfrerer, F. Lugstein, D. Pape, J. A. Velimsky, R. Elsässer, and B. C. Geiger. Generating simple directed social network graphs for information spreading. In Proceedings of the ACM Web Conference 2022, WWW ’22, pages 1475–1485, 2022.
d
Replication Data for: Sequential Monte Carlo for Sampling Balanced and...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McCartan, Cory; Imai, Kosuke (2023). Replication Data for: Sequential Monte Carlo for Sampling Balanced and Compact Redistricting Plans [Dataset]. https://search.dataone.org/view/sha256%3A4530346effc876e9c9a2eae42dc54d709e48f5c9f444ee7e9fcc4078f0a9e195
Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
McCartan, Cory; Imai, Kosuke
Description
Random sampling of graph partitions under constraints has become a popular tool for evaluating legislative redistricting plans. Analysts detect partisan gerrymandering by comparing a proposed redistricting plan with an ensemble of sampled alternative plans. For successful application, sampling methods must scale to maps with a moderate or large number of districts, incorporate realistic legal constraints, and accurately and efficiently sample from a selected target distribution. Unfortunately, most existing methods struggle in at least one of these areas. We present a new Sequential Monte Carlo (SMC) algorithm that generates a sample of redistricting plans converging to a realistic target distribution. Because it draws many plans in parallel, the SMC algorithm can efficiently explore the relevant space of redistricting plans better than the existing Markov chain Monte Carlo (MCMC) algorithms that generate plans sequentially. Our algorithm can simultaneously incorporate several constraints commonly imposed in real-world redistricting problems, including equal population, compactness, and preservation of administrative boundaries. We validate the accuracy of the proposed algorithm by using a small map where all redistricting plans can be enumerated. We then apply the SMC algorithm to evaluate the partisan implications of several maps submitted by relevant parties in a recent high-profile redistricting case in the state of Pennsylvania. We find that the proposed algorithm converges faster and with fewer samples than a comparable MCMC algorithm. Open-source software is available for implementing the proposed methodology.
O
Data from: Identity Access Management dataset
opendatalab.com
zip
Updated Mar 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Singapore Management University (2023). Identity Access Management dataset [Dataset]. https://opendatalab.com/OpenDataLab/Identity_Access_Management_etc
Explore at:
zipAvailable download formats
Dataset updated
Mar 17, 2023
Dataset provided by
University of Texas at Austin
Singapore Management University
Symmetry Systems
Description
We release 280 synthetic IAM graphs generated using IAM graphs of commercial companies. Specifically, we vary the number of nodes, but keep graph density as is, i.e. in the range of 0.259 ± 0.198 (avg ± std). To generate a synthetic graph, we first sample the number of users and datastores from uniform distributions over the following intervals [10, 150] and [50, 300] respectively that cover variations of those parameters across real graphs. After fixing node counts we sample with replacement the actual nodes from a real world graph, which is chosen at random. Then we add Gaussian N(0, 0.01) noise to node embeddings and renormalize them. To match the graph density with the density of the underlying baseline we sample edges from a multinomial distribution, where each component is proportional to the cosine distance between a user and a datastore embeddings. Also we enforce the invariant that dynamic edges are always a subset of all permission edges. A synthetic graph generated in such a way is an ”upsampled” version of an underlying real world graph.
Data from: Knowledge graphs for seismic data and metadata
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Davis; Cassandra Hunt (2023). Knowledge graphs for seismic data and metadata [Dataset]. http://doi.org/10.6078/D1P430
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D1P430
Dataset updated
Sep 19, 2023
Dataset provided by
Relational AI
University of California, San Diego
Authors
William Davis; Cassandra Hunt
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The increasing scale and diversity of seismic data, and the growing role of big data in seismology, has raised interest in methods to make data exploration more accessible. This paper presents the use of knowledge graphs (KGs) for representing seismic data and metadata to improve data exploration and analysis, focusing on usability, flexibility, and extensibility. Using constraints derived from domain knowledge in seismology, we define semantic models of seismic station and event information used to construct the KGs. Our approach utilizes the capability of KGs to integrate data across many sources and diverse schema formats. We use schema-diverse, real-world seismic data to construct KGs with millions of nodes, and illustrate potential applications with three big-data examples. Our findings demonstrate the potential of KGs to enhance the efficiency and efficacy of seismological workflows in research and beyond, indicating a promising interdisciplinary future for this technology. Methods The data here consists of, and was collected from:

Station metadata, in StationXML format, acquired from IRIS DMC using the fdsnws-station webservice (https://service.iris.edu/fdsnws/station/1/). Earthquake event data, in NDK format, acquired from the Global Centroid-Moment Tensor (GCMT) catalog webservice (https://www.globalcmt.org) [1,2]. Earthquake event data, in CSV format, acquired from the USGS earthquake catalog webservice (https://doi.org/10.5066/F7MS3QZH) [3].

The format of the data is described in the README. In addition, a complete description of the StationXML, NDK, and USGS file formats can be found at https://www.fdsn.org/xml/station/, https://www.ldeo.columbia.edu/~gcmt/projects/CMT/catalog/allorder.ndk_explained, and https://earthquake.usgs.gov/data/comcat/#event-terms, respectively. Also provided are conversions from NDK and StationXML file formats into JSON format. References: [1] Dziewonski, A. M., Chou, T. A., & Woodhouse, J. H. (1981). Determination of earthquake source parameters from waveform data for studies of global and regional seismicity. Journal of Geophysical Research: Solid Earth, 86(B4), 2825-2852. [2] Ekström, G., Nettles, M., & Dziewoński, A. M. (2012). The global CMT project 2004–2010: Centroid-moment tensors for 13,017 earthquakes. Physics of the Earth and Planetary Interiors, 200, 1-9. [3] U.S. Geological Survey, Earthquake Hazards Program, 2017, Advanced National Seismic System (ANSS) Comprehensive Catalog of Earthquake Events and Products: Various, https://doi.org/10.5066/F7MS3QZH.
C
Event Graph of BPI Challenge 2019
data.4tu.nl
zip
Updated Apr 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dirk Fahland (2021). Event Graph of BPI Challenge 2019 [Dataset]. http://doi.org/10.4121/14169614.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/14169614.v1
Dataset updated
Apr 22, 2021
Dataset provided by
4TU.ResearchData
Authors
Dirk Fahland
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Business process event data modeled as labeled property graphs

Data Format
-----------

The dataset comprises one labeled property graph in two different file formats.

#1) Neo4j .dump format

A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/

/bin/neo4j-admin.(bat|sh) load --database=graph.db --from=

The .dump was created with Neo4j v3.5.

#2) .graphml format

A .zip file containing a .graphml file of the entire graph

Data Schema
-----------

The graph is a labeled property graph over business process event data. Each graph uses the following concepts

:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"

:Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")

:Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node

:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations

:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities

:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.

:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log

:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph

:REL relationship - placeholder for any structural relationship between two :Entity nodes

The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552

Data Contents
-------------

neo4j-bpic19-2021-02-17 (.dump|.graphml.zip)

An integrated graph describing the raw event data of the entire BPI Challenge 2019 dataset.
van Dongen, B.F. (Boudewijn) (2019): BPI Challenge 2019. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:d06aff4b-79f0-45e6-8ec8-e19730c248f1

This data originated from a large multinational company operating from The Netherlands in the area of coatings and paints and we ask participants to investigate the purchase order handling process for some of its 60 subsidiaries. In particular, the process owner has compliance questions. In the data, each purchase order (or purchase document) contains one or more line items. For each line item, there are roughly four types of flows in the data: (1) 3-way matching, invoice after goods receipt: For these items, the value of the goods receipt message should be matched against the value of an invoice receipt message and the value put during creation of the item (indicated by both the GR-based flag and the Goods Receipt flags set to true). (2) 3-way matching, invoice before goods receipt: Purchase Items that do require a goods receipt message, while they do not require GR-based invoicing (indicated by the GR-based IV flag set to false and the Goods Receipt flags set to true). For such purchase items, invoices can be entered before the goods are receipt, but they are blocked until goods are received. This unblocking can be done by a user, or by a batch process at regular intervals. Invoices should only be cleared if goods are received and the value matches with the invoice and the value at creation of the item. (3) 2-way matching (no goods receipt needed): For these items, the value of the invoice should match the value at creation (in full or partially until PO value is consumed), but there is no separate goods receipt message required (indicated by both the GR-based flag and the Goods Receipt flags set to false). (4)Consignment: For these items, there are no invoices on PO level as this is handled fully in a separate process. Here we see GR indicator is set to true but the GR IV flag is set to false and also we know by item type (consignment) that we do not expect an invoice against this item. Unfortunately, the complexity of the data goes further than just this division in four categories. For each purchase item, there can be many goods receipt messages and corresponding invoices which are subsequently paid. Consider for example the process of paying rent. There is a Purchase Document with one item for paying rent, but a total of 12 goods receipt messages with (cleared) invoices with a value equal to 1/12 of the total amount. For logistical services, there may even be hundreds of goods receipt messages for one line item. Overall, for each line item, the amounts of the line item, the goods receipt messages (if applicable) and the invoices have to match for the process to be compliant. Of course, the log is anonymized, but some semantics are left in the data, for example: The resources are split between batch users and normal users indicated by their name. The batch users are automated processes executed by different systems. The normal users refer to human actors in the process. The monetary values of each event are anonymized from the original data using a linear translation respecting 0, i.e. addition of multiple invoices for a single item should still lead to the original item worth (although there may be small rounding errors for numerical reasons). Company, vendor, system and document names and IDs are anonymized in a consistent way throughout the log. The company has the key, so any result can be translated by them to business insights about real customers and real purchase documents.

The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system. For each purchase item (or case) the following attributes are recorded: concept:name: A combination of the purchase document id and the item id, Purchasing Document: The purchasing document ID, Item: The item ID, Item Type: The type of the item, GR-Based Inv. Verif.: Flag indicating if GR-based invoicing is required (see above), Goods Receipt: Flag indicating if 3-way matching is required (see above), Source: The source system of this item, Doc. Category name: The name of the category of the purchasing document, Company: The subsidiary of the company from where the purchase originated, Spend classification text: A text explaining the class of purchase item, Spend area text: A text explaining the area for the purchase item, Sub spend area text: Another text explaining the area for the purchase item, Vendor: The vendor to which the purchase document was sent, Name: The name of the vendor, Document Type: The document type, Item Category: The category as explained above (3-way with GR-based invoicing, 3-way without, 2-way, consignment).

The data contains the following entities and their events

- PO - Purchase Order documents handled at a large multinational company operating from The Netherlands
- POItem - an item in a Purchase Order document describing a specific item to be purchased
- Resource - the user or worker handling the document or a specific item
- Vendor - the external organization from which an item is to be purchased

Data Size
---------

BPIC19, nodes: 1926651, relationships: 15082099
f
Spearman’s coefficient on us freeway network.
plos.figshare.com
xls
Updated Dec 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Debasish Jana; Sven Malama; Sriram Narasimhan; Ertugrul Taciroglu (2023). Spearman’s coefficient on us freeway network. [Dataset]. http://doi.org/10.1371/journal.pone.0296045.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296045.t005
Dataset updated
Dec 21, 2023
Dataset provided by
PLOS ONE
Authors
Debasish Jana; Sven Malama; Sriram Narasimhan; Ertugrul Taciroglu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Transportation networks play a crucial role in society by enabling the smooth movement of people and goods during regular times and acting as arteries for evacuations during catastrophes and natural disasters. Identifying the critical road segments in a large and complex network is essential for planners and emergency managers to enhance the network’s efficiency, robustness, and resilience to such stressors. We propose a novel approach to rapidly identify critical and vital network components (road segments in a transportation network) for resilience improvement or post-disaster recovery. We pose the transportation network as a graph with roads as edges and intersections as nodes and deploy a Graph Neural Network (GNN) trained on a broad range of network parameter changes and disruption events to rank the importance of road segments. The trained GNN model can rapidly estimate the criticality rank of individual road segments in the modified network resulting from an interruption. We address two main limitations in the existing literature that can arise in capital planning or during emergencies: ranking a complete network after changes to components and addressing situations in post-disaster recovery sequencing where some critical segments cannot be recovered. Importantly, our approach overcomes the computational overhead associated with the repeated calculation of network performance metrics, which can limit its use in large networks. To highlight scenarios where our method can prove beneficial, we present examples of synthetic graphs and two real-world transportation networks. Through these examples, we show how our method can support planners and emergency managers in undertaking rapid decisions for planning infrastructure hardening measures in large networks or during emergencies, which otherwise would require repeated ranking calculations for the entire network.

Facebook

Twitter

Click to copy link

Link copied

Cite

Queen, Owen (2023). GraphXAI [Dataset]. http://doi.org/10.7910/DVN/KULOS8

GraphXAI

Explore at:

25 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.7910/DVN/KULOS8

Dataset updated

Nov 8, 2023

Dataset provided by

Harvard Dataverse

Authors

Queen, Owen

Description

As post hoc explanations are increasingly used to understand the behavior of Graph Neural Networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations for a given task. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Further, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. We include ShapeGGen and additional XAI-ready real-world graph datasets into an open-source graph explainability library, GraphXAI. In addition, GraphXAI provides a broader ecosystem of data loaders, data processing functions, synthetic and real-world graph datasets with ground-truth explanations, visualizers, GNN model implementations, and a set of evaluation metrics to benchmark the performance of any given GNN explainer.

Clear search

Close search

Google apps

Main menu

GraphXAI

Input data for exopGCN and different GNNs.

Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

SCG Dataset

A Row Generation Algorithm for Finding Optimal Burning Sequences of Large...

Data set introduction.

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

Link-prediction on Biomedical Knowledge Graphs

Data from: IntelliGraphs: Datasets for Benchmarking Knowledge Graph...

Topology Bench: Systematic Graph Based Benchmarking for Optical Networks

Results of experiments on planned area fix-size.

The least cost directed perfect awareness problem: benchmark instances and...

Event Graph of BPI Challenge 2014

Data from: Graph drawing using tabu search coupled with path relinking

Data from: Minimizing the Cost of Leveraging Influencers in Social Networks:...

Replication Data for: Sequential Monte Carlo for Sampling Balanced and...

Data from: Identity Access Management dataset

Data from: Knowledge graphs for seismic data and metadata

Event Graph of BPI Challenge 2019

Spearman’s coefficient on us freeway network.

GraphXAI