Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of number of nodes in the graph (N), normalized characteristic path length (λ), normalized clustering coefficient (γ), and small-world measure (σ) from our study with previously published results on small-world characterization of functional brain network constructed using EEG, MEG, and fMRI data.
Facebook
Twittern = number of nodes, m = number of edges.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spectral clustering views the similarity matrix as a weighted graph, and partitions the data by minimizing a graph-cut loss. Since it minimizes the across-cluster similarity, there is no need to model the distribution within each cluster. As a result, one reduces the chance of model misspecification, which is often a risk in mixture model-based clustering. Nevertheless, compared to the latter, spectral clustering has no direct ways of quantifying the clustering uncertainty (such as the assignment probability), or allowing easy model extensions for complicated data applications. To fill this gap, we propose the Bayesian forest model as a generative graphical model for spectral clustering. This is motivated by our discovery that the posterior connecting matrix in a forest model has almost the same leading eigenvectors, as the ones used by normalized spectral clustering. To induce a distribution for the forest, we develop a “forest process” as a graph extension to the urn process, while we carefully characterize the differences in the partition probability. We derive a simple Markov chain Monte Carlo algorithm for posterior estimation, and demonstrate superior performance compared to existing algorithms. We illustrate several model-based extensions useful for data applications, including high-dimensional and multi-view clustering for images. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Table summarizes the global values of connectivity strength S, shortest path length L, normalized path length, and normalized clustering coefficient C (normalized to 100 random graphs).FA = fractional anisotropy. SD = standard deviation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Normalized ConceptNet 5 (SQLite, Filtered)
This dataset contains a normalized, filtered, and optimized version of the ConceptNet 5.5 knowledge graph, ready for high-performance querying in a single SQLite file. It is derived from the cstr/conceptnet-de-indexed dataset, which was a 23.6 GB un-normalized SQLite file containing 28.3 million nodes and 34 million edges. This version has been processed to be significantly smaller, faster, and data-correct.
Key Features… See the full description on the dataset page: https://huggingface.co/datasets/cstr/conceptnet-normalized-multi.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
mRNA-seq assays on mouse tissues were downloaded from the ENCODE project and consolidated into matrices of expression
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Task scheduler performance survey
This dataset contains results of task graph scheduler performance survey.
The results are stored in the following files, which correspond to simulations performed on
the elementary, irw and pegasus task graph datasets published at https://doi.org/10.5281/zenodo.2630384.
elementary-result.zip
irw-result.zip
pegasus-result.zip
The files contain compressed pandas dataframes in CSV format, it can be read with the following Python code:
python
import pandas as pd
frame = pd.read_csv("elementary-result.zip")
Each row in the frame corresponds to a single instance of a task graph that was simulated with a specific configuration (network model, scheduler etc.). The list below summarizes the meaning of the individual columns.
graph_name - name of the benchmarked task graph
graph_set - name of the task graph dataset from which the graph originates
graph_id - unique ID of the graph
cluster_name - type of cluster used in this instance the format is x; 32x16 means 32 workers, each with 16 cores
bandwidth - network bandwidth [MiB]
netmodel - network model (simple or maxmin)
scheduler_name - name of the scheduler
imode - information mode
min_sched_interval - minimal scheduling delay [s]
sched_time - duration of each scheduler invocation [s]
time - simulated makespan of the task graph execution [s]
execution_time - real duration of all scheduler invocations [s]
total_transfer - amount of data transferred amongst workers [MiB]
The file charts.zip contains charts obtained by processing the datasets.
On the X axis there is always bandwidth in [MiB/s].
There are the following files:
[DATASET]-schedulers-time - Absolute makespan produced by schedulers [seconds]
[DATASET]-schedulers-score - The same as above but normalized with respect to the best schedule (shortest makespan) for the given configuration.
[DATASET]-schedulers-transfer - Sums of transfers between all workers for a given configuration [MiB]
[DATASET]-[CLUSTER]-netmodel-time - Comparison of netmodels, absolute times [seconds]
[DATASET]-[CLUSTER]-netmodel-score - Comparison of netmodels, normalized to the average of model "simple"
[DATASET]-[CLUSTER]-netmodel-transfer - Comparison of netmodels, sum of transfered data between all workers [MiB]
[DATASET]-[CLUSTER]-schedtime-time - Comparison of MSD, absolute times [seconds]
[DATASET]-[CLUSTER]-schedtime-score - Comparison of MSD, normalized to the average of "MSD=0.0" case
[DATASET]-[CLUSTER]-imode-time - Comparison of Imodes, absolute times [seconds]
[DATASET]-[CLUSTER]-imode-score - Comparison of Imodes, normalized to the average of "exact" imode
Reproducing the results
$ git clone https://github.com/It4innovations/estee $ cd estee $ pip install .
benchmarks/generate.py to generate graphs
from three categories (elementary, irw and pegasus):$ cd benchmarks $ python generate.py elementary.zip elementary $ python generate.py irw.zip irw $ python generate.py pegasus.zip pegasus
or use our task graph dataset that is provided at https://doi.org/10.5281/zenodo.2630384.
benchmark.json. Then you can run the benchmark using this command:$ python pbs.py compute benchmark.json
The benchmark script can be interrupted at any time (for example using Ctrl+C). When interrupted, it will store the computed results to the result file and restore the computation when launched again.
$ python view.py --all
The resulting plots will appear in a folder called outputs.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
We provide an academic graph based on a snapshot of the Microsoft Academic Graph from 26.05.2021. The Microsoft Academic Graph (MAG) is a large-scale dataset containing information about scientific publication records, their citation relations, as well as authors, affiliations, journals, conferences and fields of study. We acknowledge the Microsoft Academic Graph using the URI https://aka.ms/msracad. For more information regarding schema and the entities present in the original dataset please refer to: MAG schema.
MAG for Heterogeneous Graph Learning We use a recent version of MAG from May 2021 and extract all relevant entities to build a graph that can be directly used for heterogeneous graph learning (node classification, link prediction, etc.). The graph contains all English papers, published after 1900, that have been cited at least 5 times per year since the time of publishing. For fairness, we set a constant citation bound of 100 for papers published before 2000. We further include two smaller subgraphs, one containing computer science papers and one containing medicine papers.
Nodes and features We define the following nodes:
paper with mag_id, graph_id, normalized title, year of publication, citations and a 128-dimension title embedding built using word2vec No. of papers: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
author with mag_id, graph_id, normalized name, citations No. of authors: 6,363,201 (all), 1,797,980 (medicine), 557,078 (computer science);
field with mag_id, graph_id, level, citations denoting the hierarchical level of the field where 0 is the highest-level (e.g. computer science) No. of fields: 199,457 (all), 83,970 (medicine), 45,454 (computer science);
affiliation with mag_id, graph_id, citations No. of affiliations: 19,421 (all), 12,103 (medicine), 10,139 (computer science);
venue with mag_id, graph_id, citations, type denoting whether conference or journal No. of venues: 24,608 (all), 8,514 (medicine), 9,893 (computer science).
Edges We define the following edges:
author is_affiliated_with affiliation No. of author-affiliation edges: 8,292,253 (all), 2,265,728 (medicine), 665,931 (computer science);
author is_first/last/other paper No. of author-paper edges: 24,907,473 (all), 5,081,752 (medicine), 1,269,485 (computer science);
paper has_citation_to paper No. of author-affiliation edges: 142,684,074 (all), 16,808,837 (medicine), 4,152,804 (computer science);
paper conference/journal_published_at venue No. of author-affiliation edges: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
paper has_field_L0/L1/L2/L3/L4 field No. of author-affiliation edges: 47,531,366 (all), 9,403,708 (medicine), 3,341,395 (computer science);
field is_in field No. of author-affiliation edges: 339,036 (all), 138,304 (medicine), 83,245 (computer science);
We further include a reverse edge for each edge type defined above that is denoted with the prefix rev_ and can be removed based on the downstream task.
Data structure The nodes and their respective features are provided as separate .tsv files where each feature represents a column. The edges are provided as a pickled python dictionary with schema:
{target_type: {source_type: {edge_type: {target_id: {source_id: {time } } } } } }
We provide three compressed ZIP archives, one for each subgraph (all, medicine, computer science), however we split the file for the complete graph into 500mb chunks. Each archive contains the separate node features and edge dictionary.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has been challenging due to an inherent circularity, requiring that RNA-seq data be normalized before any pattern of differential (or non-differential) expression can be ascertained; meanwhile, the prior knowledge of non-differential transcripts is crucial to the normalization process. Some methods have successfully overcome this problem by the assumption that most transcripts are not differentially expressed. However, when RNA-seq profiles become more abundant and heterogeneous, this assumption fails to hold, leading to erroneous normalization. We present a normalization procedure that does not rely on this assumption, nor prior knowledge about the reference transcripts. This algorithm is based on a graph constructed from intrinsic correlations among RNA-seq transcripts and seeks to identify a set of densely connected vertices as references. Application of this algorithm on our synthesized validation data showed that it could recover the reference transcripts with high precision, thus resulting in high-quality normalization. On a realistic data set from the ENCODE project, this algorithm gave good results and could finish in a reasonable time. These preliminary results imply that we may be able to break the long persisting circularity problem in RNA-seq normalization.
Facebook
Twitterhttps://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Reference Series (GDP) Normalized for Spain (ESPLORSGPNOSTSAM) from Feb 1960 to Nov 2023 about leading indicator, Spain, and GDP.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dissection of the mode and tempo of phenotypic evolution is integral to our understanding of global biodiversity. Our ability to infer patterns of phenotypes across phylogenetic clades is essential to how we infer the macroevolutionary processes governing those patterns. Many methods are already available for fitting models of phenotypic evolution to data. However, there is currently no comprehensive non-parametric framework for characterising and comparing patterns of phenotypic evolution. Here we build on a recently introduced approach for using the phylogenetic spectral density profile to compare and characterize patterns of phylogenetic diversification, in order to provide a framework for non-parametric analysis of phylogenetic trait data. We show how to construct the spectral density profile of trait data on a phylogenetic tree from the normalized graph Laplacian. We demonstrate on simulated data the utility of the spectral density profile to successfully cluster phylogenetic trait data into meaningful groups and to characterise the phenotypic patterning within those groups. We furthermore demonstrate how the spectral density profile is a powerful tool for visualising phenotypic space across traits and for assessing whether distinct trait evolution models are distinguishable on a given empirical phylogeny. We illustrate the approach in two empirical datasets: a comprehensive dataset of traits involved in song, plumage and resource-use in tanagers, and a high-dimensional dataset of endocranial landmarks in New World monkeys. Considering the proliferation of morphometric and molecular data collected across the tree of life, we expect this approach will benefit big data analyses requiring a comprehensive and intuitive framework.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ConceptNet 5 (Un-normalized SQLite, 23.6 GB)
This repository contains the complete, un-normalized ConceptNet 5.5 knowledge graph in SQLite format. Unlike the filtered version, this dataset includes all languages from the original ConceptNet release. The database conceptnet-de-indexed.db is a 23.6 GB un-normalized SQLite file containing the full knowledge graph with all 28.3 million nodes and 34 million edges across all languages.
When to Use This Dataset
Use this… See the full description on the dataset page: https://huggingface.co/datasets/cstr/conceptnet-de-indexed.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a five-level, fine-grained, and structurally normalized knowledge-graph representation of a qualitative methods text corpus (Research with Qualitative Data), treated purely as text data rather than as a bibliographic object. Each record corresponds to a node at one of five hierarchical levels—macro-section (level 1), meso-section (level 2), paragraph (level 3), sentence (level 4), and keyword/media snippet (level 5)—with explicit parent–child links (e.g., sentence → paragraph, paragraph → meso-section), forming a well-closed, acyclic tree structure. For all machine-readable content in the source PDF, the dataset decomposes the corpus into independent nodes while preserving page locators and section titles, so that any fragment of text can be traced back to its exact position in the original file. Keyword nodes are automatically extracted from sentences to enhance search, thematic mapping, and downstream modeling without altering or compressing the underlying text. For tables and images, the dataset stores captions, surrounding textual context, and row-level data_points where applicable, enabling full reconstruction of tabular and visual information at the text level. Under the assumption that “all machine-readable text in the PDF is the reference universe,” the collection achieves a practically lossless representation of the qualitative methods corpus and has been independently checked for level completeness, parent–child consistency, and content integrity, supporting its designation as a five-level, completely lossless text-based knowledge-graph dataset suitable for advanced qualitative methodology research, knowledge-graph engineering, and large-language-model retrieval and reasoning experiments.
Facebook
Twitterhttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Network of 42 papers and 63 citation links related to "Importance of input data normalization for the application of neural networks to complex industrial problems".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains the cost matrix training datasets. Note that "standardized" refers to data that has been standardized , while "normalized" refers to data that has been normalized. If the filename contains "noc" then the data was either normalized or standardized using only the normal operating data samples as reference, while "all" means that the data was normalized or standardized using the entire dataset as reference.
Facebook
Twitterhttps://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Reference Series (GDP) Normalized for India (INDLORSGPNOSTSAM) from May 1996 to Aug 2023 about leading indicator, India, and GDP.
Facebook
Twitterhttps://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Reference Series (GDP) Normalized for Japan (JPNLORSGPNOSTSAM) from Feb 1960 to Aug 2023 about leading indicator, Japan, and GDP.
Facebook
TwitterGraph theory is useful for estimating time-dependent model parameters via weighted least-squares using interferometric synthetic aperture radar (InSAR) data. Plotting acquisition dates (epochs) as vertices and pair-wise interferometric combinations as edges defines an incidence graph. The edge-vertex incidence matrix and the normalized edge Laplacian matrix are factors in the covariance matrix for the pair-wise data. Using empirical measures of residual scatter in the pair-wise observations, we estimate the variance at each epoch by inverting the covariance of the pair-wise data. We evaluate the rank deficiency of the corresponding least-squares problem via the edge-vertex incidence matrix. We implement our method in a MATLAB software package called GraphTreeTA available on GitHub (https://github.com/feigl/gipht). We apply temporal adjustment to the data set described in Lu et al. (2005) at Okmok volcano, Alaska, which erupted most recently in 1997 and 2008. The data set contains 44 differential volumetric changes and uncertainties estimated from interferograms between 1997 and 2004. Estimates show that approximately half of the magma volume lost during the 1997 eruption was recovered by the summer of 2003. Between June 2002 and September 2003, the estimated rate of volumetric increase is (6.2 +/- 0.6) x 10^6 m^3/yr. Our preferred model provides a reasonable fit that is compatible with viscoelastic relaxation in the five years following the 1997 eruption. Although we demonstrate the approach using volumetric rates of change, our formulation in terms of incidence graphs applies to any quantity derived from pair-wise differences, such as wrapped phase or wrapped residuals. Date of final oral examination: 05/19/2016 This thesis is approved by the following members of the Final Oral Committee: Kurt L. Feigl, Professor, Geoscience Michael Cardiff, Assistant Professor, Geoscience Clifford H. Thurber, Vilas Distinguished Professor, Geoscience
Facebook
Twitterhttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Network of 29 papers and 63 citation links related to "Histogram-Based Image Retrieval Keyed by Normalized HSY Histograms and Its Experiments on a Pilot Dataset".
Facebook
Twitterhttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Network of 19 papers and 31 citation links related to "Normalized Acquisition System of the Facial Diagnosis in Traditional Chinese Medicine".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of number of nodes in the graph (N), normalized characteristic path length (λ), normalized clustering coefficient (γ), and small-world measure (σ) from our study with previously published results on small-world characterization of functional brain network constructed using EEG, MEG, and fMRI data.