Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of number of nodes in the graph (N), normalized characteristic path length (λ), normalized clustering coefficient (γ), and small-world measure (σ) from our study with previously published results on small-world characterization of functional brain network constructed using EEG, MEG, and fMRI data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
mRNA-seq assays on mouse tissues were downloaded from the ENCODE project and consolidated into matrices of expression
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spectral clustering views the similarity matrix as a weighted graph, and partitions the data by minimizing a graph-cut loss. Since it minimizes the across-cluster similarity, there is no need to model the distribution within each cluster. As a result, one reduces the chance of model misspecification, which is often a risk in mixture model-based clustering. Nevertheless, compared to the latter, spectral clustering has no direct ways of quantifying the clustering uncertainty (such as the assignment probability), or allowing easy model extensions for complicated data applications. To fill this gap, we propose the Bayesian forest model as a generative graphical model for spectral clustering. This is motivated by our discovery that the posterior connecting matrix in a forest model has almost the same leading eigenvectors, as the ones used by normalized spectral clustering. To induce a distribution for the forest, we develop a “forest process” as a graph extension to the urn process, while we carefully characterize the differences in the partition probability. We derive a simple Markov chain Monte Carlo algorithm for posterior estimation, and demonstrate superior performance compared to existing algorithms. We illustrate several model-based extensions useful for data applications, including high-dimensional and multi-view clustering for images. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Task scheduler performance survey
This dataset contains results of task graph scheduler performance survey.
The results are stored in the following files, which correspond to simulations performed on
the elementary
, irw
and pegasus
task graph datasets published at https://doi.org/10.5281/zenodo.2630384.
elementary-result.zip
irw-result.zip
pegasus-result.zip
The files contain compressed pandas dataframes in CSV format, it can be read with the following Python code:
python
import pandas as pd
frame = pd.read_csv("elementary-result.zip")
Each row in the frame corresponds to a single instance of a task graph that was simulated with a specific configuration (network model, scheduler etc.). The list below summarizes the meaning of the individual columns.
graph_name - name of the benchmarked task graph
graph_set - name of the task graph dataset from which the graph originates
graph_id - unique ID of the graph
cluster_name - type of cluster used in this instance the format is x; 32x16 means 32 workers, each with 16 cores
bandwidth - network bandwidth [MiB]
netmodel - network model (simple or maxmin)
scheduler_name - name of the scheduler
imode - information mode
min_sched_interval - minimal scheduling delay [s]
sched_time - duration of each scheduler invocation [s]
time - simulated makespan of the task graph execution [s]
execution_time - real duration of all scheduler invocations [s]
total_transfer - amount of data transferred amongst workers [MiB]
The file charts.zip
contains charts obtained by processing the datasets.
On the X axis there is always bandwidth in [MiB/s].
There are the following files:
[DATASET]-schedulers-time - Absolute makespan produced by schedulers [seconds]
[DATASET]-schedulers-score - The same as above but normalized with respect to the best schedule (shortest makespan) for the given configuration.
[DATASET]-schedulers-transfer - Sums of transfers between all workers for a given configuration [MiB]
[DATASET]-[CLUSTER]-netmodel-time - Comparison of netmodels, absolute times [seconds]
[DATASET]-[CLUSTER]-netmodel-score - Comparison of netmodels, normalized to the average of model "simple"
[DATASET]-[CLUSTER]-netmodel-transfer - Comparison of netmodels, sum of transfered data between all workers [MiB]
[DATASET]-[CLUSTER]-schedtime-time - Comparison of MSD, absolute times [seconds]
[DATASET]-[CLUSTER]-schedtime-score - Comparison of MSD, normalized to the average of "MSD=0.0" case
[DATASET]-[CLUSTER]-imode-time - Comparison of Imodes, absolute times [seconds]
[DATASET]-[CLUSTER]-imode-score - Comparison of Imodes, normalized to the average of "exact" imode
Reproducing the results
$ git clone https://github.com/It4innovations/estee $ cd estee $ pip install .
benchmarks/generate.py
to generate graphs
from three categories (elementary, irw and pegasus):$ cd benchmarks $ python generate.py elementary.zip elementary $ python generate.py irw.zip irw $ python generate.py pegasus.zip pegasus
or use our task graph dataset that is provided at https://doi.org/10.5281/zenodo.2630384.
benchmark.json
. Then you can run the benchmark using this command:$ python pbs.py compute benchmark.json
The benchmark script can be interrupted at any time (for example using Ctrl+C). When interrupted, it will store the computed results to the result file and restore the computation when launched again.
$ python view.py --all
The resulting plots will appear in a folder called outputs
.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
We provide an academic graph based on a snapshot of the Microsoft Academic Graph from 26.05.2021. The Microsoft Academic Graph (MAG) is a large-scale dataset containing information about scientific publication records, their citation relations, as well as authors, affiliations, journals, conferences and fields of study. We acknowledge the Microsoft Academic Graph using the URI https://aka.ms/msracad. For more information regarding schema and the entities present in the original dataset please refer to: MAG schema.
MAG for Heterogeneous Graph Learning We use a recent version of MAG from May 2021 and extract all relevant entities to build a graph that can be directly used for heterogeneous graph learning (node classification, link prediction, etc.). The graph contains all English papers, published after 1900, that have been cited at least 5 times per year since the time of publishing. For fairness, we set a constant citation bound of 100 for papers published before 2000. We further include two smaller subgraphs, one containing computer science papers and one containing medicine papers.
Nodes and features We define the following nodes:
paper with mag_id, graph_id, normalized title, year of publication, citations and a 128-dimension title embedding built using word2vec No. of papers: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
author with mag_id, graph_id, normalized name, citations No. of authors: 6,363,201 (all), 1,797,980 (medicine), 557,078 (computer science);
field with mag_id, graph_id, level, citations denoting the hierarchical level of the field where 0 is the highest-level (e.g. computer science) No. of fields: 199,457 (all), 83,970 (medicine), 45,454 (computer science);
affiliation with mag_id, graph_id, citations No. of affiliations: 19,421 (all), 12,103 (medicine), 10,139 (computer science);
venue with mag_id, graph_id, citations, type denoting whether conference or journal No. of venues: 24,608 (all), 8,514 (medicine), 9,893 (computer science).
Edges We define the following edges:
author is_affiliated_with affiliation No. of author-affiliation edges: 8,292,253 (all), 2,265,728 (medicine), 665,931 (computer science);
author is_first/last/other paper No. of author-paper edges: 24,907,473 (all), 5,081,752 (medicine), 1,269,485 (computer science);
paper has_citation_to paper No. of author-affiliation edges: 142,684,074 (all), 16,808,837 (medicine), 4,152,804 (computer science);
paper conference/journal_published_at venue No. of author-affiliation edges: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
paper has_field_L0/L1/L2/L3/L4 field No. of author-affiliation edges: 47,531,366 (all), 9,403,708 (medicine), 3,341,395 (computer science);
field is_in field No. of author-affiliation edges: 339,036 (all), 138,304 (medicine), 83,245 (computer science);
We further include a reverse edge for each edge type defined above that is denoted with the prefix rev_ and can be removed based on the downstream task.
Data structure The nodes and their respective features are provided as separate .tsv files where each feature represents a column. The edges are provided as a pickled python dictionary with schema:
{target_type: {source_type: {edge_type: {target_id: {source_id: {time } } } } } }
We provide three compressed ZIP archives, one for each subgraph (all, medicine, computer science), however we split the file for the complete graph into 500mb chunks. Each archive contains the separate node features and edge dictionary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network propagation leads to topology bias when the normalized Laplacian of the graph is used, whereas the degree row-normalized adjacency matrix does not lead to bias on the hub nodes. The Laplacian of the graph cannot be used for RWR because the iterative process is not guaranteed to converge for all α’s. Yes: presence of topology bias, No: absence of topology bias for the respective combination of propagation algorithm and graph normalization approach. The symbol “-” indicates that convergence is not guaranteed for all values of the smoothing parameter.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Named Entity Recognition (NER) and Entity Normalization (EN) are fundamental tasks in information extraction, particularly in the biomedical and clinical domains. NER identifies textual mentions of entities, while EN maps these mentions to unique identifiers within a structured vocabulary. However, the biomedical domain presents unique challenges for NER, including the diverse and inconsistent lexical representations of biomedical concepts, such as non-standard terminology, abbreviations, complex phrases, and frequent misspellings in clinical texts. Additionally, rare entities are often underrepresented in training datasets and may lack detailed descriptions or synonyms in knowledge graphs, limiting the quality of training data for Disease Entity Recognition (DER) and Disease Entity Normalization (DEN). To address this, we present the Synthetic Mention Corpora for Disease Entity Recognition and Normalization, a dataset comprising 128,000 synthetic disease mentions generated using a fine-tuned LLaMa-2-13B-Chat model. These mentions are derived from the Unified Medical Language System (UMLS) disorder group. This corpus aims to enhance the development of more robust systems for disease entity identification and linking in biomedical and clinical text, addressing current limitations in training data availability.
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Reference Series (GDP) Normalized for Canada (CANLORSGPNOSTSAM) from Feb 1961 to Nov 2023 about leading indicator, Canada, and GDP.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises a profession-clinical knowledge graph, derived from the co-occurrence of normalised concepts identified in two distinct corpora: the Mesinesp2 corpus, a manually annotated corpus in which domain experts have labelled a set of scientific literature, clinical trials, and patent abstracts, as well as clinical case reports. The application of different NER systems to each corpus has enabled the extraction of clinical mentions related to diseases, drugs, locations, procedures, species, species-human, and symptoms.
The repository contains a .zip file for each of the corpus, each containing a .tsv file with the co-occurrences between the detected professions and clinical entities. The file has the following columns order:
Notes
This resource been funded by the Spanish National Proyectos I+D+i 2020 AI4ProfHealth project PID2020-119266RA-I00 (PID2020-119266RA-I0/AEI/10.13039/501100011033).
Contact
If you have any questions or suggestions, please contact us at:
- Miguel Rodríguez Ortega (
Additional resources and corpora
If you are interested, you might want to check out these corpora and resources:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The scale and complexity of relational data in critical domains like chemistry, neuroscience, and social media has ignited interest in graph neural networks as performant, expressive, flexible frameworks for solving problems specified over graphs. However, this performance comes at the cost of interpretability: the behavior of a typical neural network is, at best, a mystery. Graph grammars, in contrast, provide a symbolic, discrete, rule-based formalism for describing transformations between graphs. While profoundly interpretable, they are mired in the inductive biases and restrictive assumptions common to many traditional approaches to graph modeling.
This dissertation tries to diminish the discrepancy between graph neural networks and graph grammars. The first contribution introduces Dynamic Vertex Replacement Grammars as a way of modeling temporal graph datasets with graph grammars. The second contribution proposes an analytically-invertible normalizing flow network that learns prototypical probability distributions as intrinsic explanations for its behavior. The third contribution shows how the attention mechanism in graph neural networks induces grammars that can act as generative post hoc explainers.
This supports the thesis that discrete rules and continuous distributions are jointly critical to the future of machine learning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Deriving Bayesian inference for exponential random graph models (ERGMs) is a challenging “doubly intractable” problem as the normalizing constants of the likelihood and posterior density are both intractable. Markov chain Monte Carlo (MCMC) methods which yield Bayesian inference for ERGMs, such as the exchange algorithm, are asymptotically exact but computationally intensive, as a network has to be drawn from the likelihood at every step using, for instance, a “tie no tie” sampler. In this article, we develop a variety of variational methods for Gaussian approximation of the posterior density and model selection. These include nonconjugate variational message passing based on an adjusted pseudolikelihood and stochastic variational inference. To overcome the computational hurdle of drawing a network from the likelihood at each iteration, we propose stochastic gradient ascent with biased but consistent gradient estimates computed using adaptive self-normalized importance sampling. These methods provide attractive fast alternatives to MCMC for posterior approximation. We illustrate the variational methods using real networks and compare their accuracy with results obtained via MCMC and Laplace approximation. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Leading Indicators OECD: Leading indicators: CLI: Normalised for the United States was 99.85411 Index in January of 2024, according to the United States Federal Reserve. Historically, Leading Indicators OECD: Leading indicators: CLI: Normalised for the United States reached a record high of 103.41090 in December of 1972 and a record low of 93.48370 in April of 2020. Trading Economics provides the current actual value, an historical data chart and related indicators for Leading Indicators OECD: Leading indicators: CLI: Normalised for the United States - last updated from the United States Federal Reserve on June of 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Leading Indicators OECD: Component series: Interest rate spread: Normalised for the United States was 99.16215 Index in December of 2023, according to the United States Federal Reserve. Historically, Leading Indicators OECD: Component series: Interest rate spread: Normalised for the United States reached a record high of 102.59428 in February of 1976 and a record low of 95.16112 in July of 1974. Trading Economics provides the current actual value, an historical data chart and related indicators for Leading Indicators OECD: Component series: Interest rate spread: Normalised for the United States - last updated from the United States Federal Reserve on July of 2025.
Table summarizes the global values of connectivity strength S, shortest path length L, normalized path length, and normalized clustering coefficient C (normalized to 100 random graphs).FA = fractional anisotropy. SD = standard deviation.
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Composite Leading Indicator (CLI) Normalized for United States (USALOLITONOSTSAM) from Jan 1955 to Jan 2024 about leading indicator and USA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Leading Indicators OECD: Component series: Orders: Normalised for the United States was 100.20803 Index in November of 2023, according to the United States Federal Reserve. Historically, Leading Indicators OECD: Component series: Orders: Normalised for the United States reached a record high of 103.56959 in February of 2008 and a record low of 93.03215 in April of 2020. Trading Economics provides the current actual value, an historical data chart and related indicators for Leading Indicators OECD: Component series: Orders: Normalised for the United States - last updated from the United States Federal Reserve on July of 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Connectome, traits and behavior data for APOE234 mice.
columns: winding numbers, total distance, normalized NE time, normalized NE distance, normalized NW time, normalized NW distance, normalized SE time, normalized SE distance, normlaized SW time, normalized SW distance, island latency to first entry, island entries, normalized thigmataxis time, and normalized thigmotaxis distance
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia articles use Wikidata to list the links to the same article in other language versions. Therefore, each Wikipedia language edition stores the Wikidata Q-id for each article.
This dataset constitutes a Wikipedia link graph where all the article identifiers are normalized to Wikidata Q-ids. It contains the normalized links from all Wikipedia language versions. Detailed link count statistics are attached. Note that articles that have no incoming nor outgoing links are not part of this graph.
The format is as follows:
Q-id of linking page (outgoing)
This dataset was used to compute Wikidata PageRank. More information can be found on the danker repository, where the source code of the link extraction as well as the PageRank computation is hosted.
Example entries:
bzcat 2022-11-10.allwiki.links.bz2 | head
1 1001051 zhwiki-20221101
1 1001 azbwiki-20221101
1 10022 nds_nlwiki-20221101
1 1005917 ptwiki-20221101
1 10090 guwiki-20221101
1 10090 tawiki-20221101
1 101038 glwiki-20221101
1 101072 idwiki-20221101
1 101072 lvwiki-20221101
1 101072 ndswiki-20221101
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Composite Leading Indicator (CLI) Normalized for Korea (KORLOLITONOSTSAM) from Jan 1990 to Jan 2024 about leading indicator and Korea.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hypergraphs have gained increasing attention in the machine learning community lately due to their superiority over graphs in capturing super-dyadic interactions among entities. In this work, we propose a novel approach for the partitioning of k-uniform hypergraphs. Most of the existing methods work by reducing the hypergraph to a graph followed by applying standard graph partitioning algorithms. The reduction step restricts the algorithms to capturing only some weighted pairwise interactions and hence loses essential information about the original hypergraph. We overcome this issue by utilizing tensor-based representation of hypergraphs, which enables us to capture actual super-dyadic interactions. We extend the notion of minimum ratio-cut and normalized-cut from graphs to hypergraphs and show that the relaxed optimization problem can be solved using eigenvalue decomposition of the Laplacian tensor. This novel formulation also enables us to remove a hyperedge completely by using the “hyperedge score” metric proposed by us, unlike the existing reduction approaches. We propose a hypergraph partitioning algorithm inspired from spectral graph theory and also derive a tighter upper bound on the minimum positive eigenvalue of even-order hypergraph Laplacian tensor in terms of its conductance, which is utilized in the partitioning algorithm to approximate the normalized cut. The efficacy of the proposed method is demonstrated numerically on synthetic hypergraphs generated by stochastic block model. We also show improvement for the min-cut solution on 2-uniform hypergraphs (graphs) over the standard spectral partitioning algorithm.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of number of nodes in the graph (N), normalized characteristic path length (λ), normalized clustering coefficient (γ), and small-world measure (σ) from our study with previously published results on small-world characterization of functional brain network constructed using EEG, MEG, and fMRI data.