Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: none of the data sets published here contain actual data, they are for testing purposes only.
This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:
dataset_30_nodes_interactions.csv
:contains 30 rows (nodes).dataset_30_edges_interactions.csv
: contains 47 rows (edges).dataset_30
refers to the same graph.Each dataset contains the following columns:
Name of the Column | Type | Description |
UniProt ID | string | protein identification |
label | string | protein label (type of node) |
properties | string | a dictionary containing properties related to the protein. |
Each dataset contains the following columns:
Name of the Column | Type | Description |
Relationship ID | string | relationship identification |
Source ID | string | identification of the source protein in the relationship |
Target ID | string | identification of the target protein in the relationship |
label | string | relationship label (type of relationship) |
properties | string | a dictionary containing properties related to the relationship. |
Graph | Number of Nodes | Number of Edges | Sparse graph |
dataset_30* |
30 | 47 |
Y |
dataset_60* |
60 |
181 |
Y |
dataset_120* |
120 |
689 |
Y |
dataset_240* |
240 |
2819 |
Y |
dataset_300* |
300 |
4658 |
Y |
dataset_600* |
600 |
18004 |
Y |
dataset_1200* |
1200 |
71785 |
Y |
dataset_2400* |
2400 |
288600 |
Y |
dataset_3000* |
3000 |
449727 |
Y |
dataset_6000* |
6000 |
1799413 |
Y |
dataset_12000* |
12000 |
7199863 |
Y |
dataset_24000* |
24000 |
28792361 |
Y |
dataset_30000* |
30000 |
44991744 |
Y |
This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.
Each dataset contains the following columns:
Name of the Column | Type | Description |
ID | string | node identification |
label | string | node label (type of node) |
properties | string | a dictionary containing properties related to the node. |
Each dataset contains the following columns:
Name of the Column | Type | Description |
ID | string | relationship identification |
source | string | identification of the source node in the relationship |
target | string | identification of the target node in the relationship |
label | string | relationship label (type of relationship) |
properties | string | a dictionary containing properties related to the relationship. |
Graph | Number of Nodes | Number of Edges | Sparse graph |
dataset_dummy* | 3 | 6 | N |
dataset_dummy2* | 3 | 6 | N |
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The awesome datasets graph is a Neo4j graph database which catalogs and classifies datasets and data sources as scraped from the Awesome Public Datasets GitHub list.
We started with a simple list of links on the Awesome Public Datasets page. We now have a semantic graph database with 10 labels, five relationship types, nine property keys, and more than 400 nodes. All within 1MB of database footprint. All database operations are query driven using the powerful and flexible Cypher Graph Query Language.
The download includes CSV files which were created as an interim step after scraping and wrangling the source. The download also includes a working Neo4j Graph Database. Login: neo4j | Password: demo.
Data scraped from Awesome Public Datasets page. Prepared for the book Data Science Solutions.
While we have done basic data wrangling and preparation, how can this graph prove useful for your data science workflow? Can we record our data science project decisions taken across workflow stages and how the data catalog (datasources, datasets, tools) use cases help in these decisions by achieving data science solutions strategies?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a collection of data about 454 value chains from 23 rural European areas of 16 countries. This data is obtained through a semi-automatic workflow that transforms raw textual data from an unstructured MS Excel sheet into semantic knowledge graphs.In particular, the repository contains:MS Excel sheet containing different value chains details provided by MOuntain Valorisation through INterconnectedness and Green growth (MOVING) European project;454 CSV files containing events, titles, entities and coordinates of narratives of each value chain, obtained by pre-processing the MS Excel sheet454 Web Ontology Language (OWL) files. This collection of files is the result of the semi-automatic workflow, and is organized as a semantic knowledge graph of narratives, where each narrative is a sub-graph explaining one among the 454 value chains and its territory aspects. The knowledge graph is based on the Narrative Ontology, an ontology developed by Institute of Information Science and Technologies (ISTI-CNR) as an extension of CIDOC CRM, FRBRoo, and OWL Time.Two CSV files that compile all the possible available information extracted from 454 Web Ontology Language (OWL) files.GeoPackage files with the geographic coordinates related to the narratives.The HTML files that show all the different SPARQL and GeoSPARQL queries.The HTML files that show the story maps about the 454 value chains.An image showing how the various components of the dataset interact with each other.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Graph datasets in csv format. Used in the article Learning Functional Causal Models with Generative Neural Networks. 1) Each file *_numdata.csv contain the data of around 20 variables connected in a graph without hidden variables. G2, G3, G4 and G5 refered to graph with 2, 3, 4 and 5 parents maximum for each node. Each file *_target.csv contains the ground truth of the graph with cause -> effect File beginning by "Big" are larger graphs with 100 variables. 2) Each file *_confounders_numdata.csv contain the data of around 20 variables connected in a graph. There are 3 hidden variables. Each file *_confounders_skeleton.csv contains the skeleton of the graph (including spurious links due to common hidden cause). Each file *_confounders_target.csv contains the ground truth of the graph with the direct visible cause -> effect. The task is to recover the direct visible links cause->effect while removing the spurious links of the skeleton
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These dataset contains the results of the interlinking process between selected csv datasets harvested by the European DAta Portal and the DBpedia knowledge graph.
We aim at answering the following questions:
What are the more popular column types? This will provide hindsight about what the datasets hold and how they can be joined. It will also provide hindsight on what specific linking schemes could be applied in future elements.
What datasets have columns of the same type? This will suggest datasets that may be similar or related.
What entities appear in most datasets (co-referent entities)? This will suggest entities for which more data is published.
What datasets share a particular entity? This will suggest datasets that may be joined, or are related through that particular entity
Results are provided as augmented tables, that contain the columns of the original csv, plus a metadata file in JSON-LD format. The metadata files can be loaded in an RDF-store and queried.
Refer to the accompanying report of activities for more details on the methodolog and how to query the dataset.
biogas/biogas_0/supplydata197.csv
in step 2 where supply data are specified). This dataset is associated with the following publication: Hu, Y., W. Zhang, P. Tominac, M. Shen, D. Göreke, E. Martín-Hernández, M. Martín, G.J. Ruiz-Mercado, and V.M. Zavala. ADAM: A web platform for graph-based modeling and optimization of supply chains. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 165: 107911, (2022).Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database collates 3552 development indicators from different studies with data by country and year, including single year and multiple year time series. The data is presented as charts, the data can be downloaded from linked project pages/references for each set, and the data for each presented graph is available as a CSV file as well as a visual download of the graph (both available via the download link under each chart).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The dataset contains the instances files for the paper X. Klimentova, A. Viana, J. P. Pedroso, N. Santos. Fairness models for multi-agent kidney exchange programmes. To appear in Omega: The International Journal of Management Science (2020). The same dataset was also use in Monteiro, T., Klimentova, X., Pedroso, J.P.Pedroso, A. Viana. A comparison of matching algorithms for kidney exchange programs addressing waiting time. Cent Eur J Oper Res (2020). https://doi.org/10.1007/s10100-020-00680-y Each instance mimics pools of kidney exchange progammes of several agents (e.g. countires) over time. Incompatible donor-recipient pairs appear and leave along the time horizon. Each of the pairs belongs to the pool of one of the agents. The virutal compatiblity among pairs is represented on a directed graph G = (V,A), called compatibility graph, where the set of vertices V corresponds to the set of incompatible pairs and non-directed donors. An arc from a vertex i to a vertex j indicates compatibility of donor in i with the patient in j. The positive real crossmatch testing is also incorporated by saving the arcs that would fail in case they are chosen is a cycle in one of the matching runs. The generator creates randomly graphs based on probabilities of blood type and of donor–patient tissue compatibility; the arrival of pairs and non-directed donors is generated based on a given arrival rates. An instance of the dataset represents a pools of 4 agents, that are simulated for the period of 6 years. There are 100 instances compressed in 4 zip-archives, each containing 25 instances. Each of the instances is described by 3 files, where index s is the seed used for random function when generating the instance. a) characterisations_s.csv -- csv file that contains information on each pair in the merged pool in the following columns 0 : Pair ID 1 : Donor ID 2 : Donor blood type 3 : Donor age 4 : Patient ID 5 : Patient blood type 6 : Patient PRA 7 : Patient cross-match probability 8 : Patient age 9 : Pair arrival day 10 : Pair departure day 11 : Pair probability of failure 12 : Pair from pool (e.g. country to which the pair belongs to) In case of non-directed donor the information about the patient is filled by -1; b) acrs_s.csv - csv file containts the compatibility graph of the problem described above. In the first line the file contains values n – number of vertices in the graph and m – number of arcs in the graph. In the following m lines of the file, the existing arcs (i,j) are presented as follows: i j w_ij where i and j are IDs of pairs, w_ij is the weight of the arc, which is always equal to 1.0 for all the instances in this dataset. c) fail_arcs_s.csv - is the list of arcs that would fail due to positive crossmatch test in case they appear in a chosen cycle or chain in any matching run. The format of the file is the same as that for arcs_s.csv. The first line represents the n - number of vertices in the graph, and m_fail the number of failed arcs listed in the following m_fail lines in the same way as in arcs_s.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wikipedia temporal graph.
The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.
Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using enwiki-20150403-graph.gt file that can be opened with open source Python library Graph-Tool.
Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is 5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These are six real-world temporal graph datasets with text attributes, varying in size and originating from different domains. Specifically, they are collected from culinary recipe feedback, movie reviews, book reading records, beer rating data, and online shopping interactions. In these temporal graphs, users and items are represented as nodes, while the interactions between them — in the form of user reviews or comments — serve as edges. Each edge is associated with both a timestamp and raw textual content. Additionally, each item node is accompanied by a descriptive text attribute. The files under the $dataset directory are as follows: 1、raw_node.npy and raw_edges.csv store the raw text attributes of nodes and edges, respectively. 2、ml_$dataset.csv records the temporal edges of the dataset, where each row in the format (u, i, ts) represents a user u interacting with an item i at timestamp ts. 3、$dataset_unique_labels.json contains the complete set of human-readable labels for the dataset. 4、Both $dataset_labels_text.json and $dataset_labels.json correspond to the labels associated with each edge in ml_$dataset.csv, where the former provides the textual form of the item labels that users are interested in, and the latter provides their corresponding numeric labels.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is part of the bachelor thesis "Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion". It was created for the finetuning of Bert Based models pre-trained on the SQUaD dataset. The Dataset was created using semi-automatic approach on the ORKG data. The dataset.csv file contains the entire data (all properties) in a tabular for and is unsplit. The json files contain only the necessary fields for training and evaluation, with additional fields (index of start and end of the answers in the abstracts). The data in the json files is split (training data) and evaluation data. We create 4 variants of the training and evaluation sets for each one of the question labels ("no label", "how", "what", "which") For detailed information on each of the fields in the dataset, refer to section 4.2 (Corpus) of the Thesis document that can be found in https://www.repo.uni-hannover.de/handle/123456789/12958. The script used to generate the dataset can be found in the public repository https://github.com/as18cia/thesis_work and https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-fine-tuning-squad-based-models
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This CSV dataset (numbered 1–8) demonstrates the construction processes of the regression models using machine learning methods, which are used to plot Fig. 2–7. The CSV file of 1.LSM_R^2 (plotting Fig. 2) shows the data of the relationship between estimated values and actual values when the least-squares method was used for a model construction. In the CSV file 2.PCR_R^2 (plotting Fig. 3), the number of the principal components was varied from 1 to 5 during the construction of a model using the principal component regression. The data in the CSV file 3.SVR_R^2 (plotting Fig. 4) is the result of the construction using the support vector regression. The hyperparameters were decided by the comprehensive combination from the listed candidates by exploring hyperparameters with maximum R2 values. When a deep neural network was applied to the construction of a regression model, NNeur., NH.L. and NL.T. were varied. The CSV file 4.DNN_HL (plotting Fig. 5a)) shows the changes in the relationship between estimated values and actual values at each NH.L.. Similarly, changes in the relationships between estimated values and actual values in the case NNeur. or NL.T. were varied in the CSV files 5.DNN_ Neur (plotting Fig. 5b)) and 6.DNN_LT (plotting Fig. 5c)). The data in the CSV file 7.DNN_R^2 (plotting Fig. 6) is the result using optimal NNeur., NH.L. and NL.T.. In the CSV file 8.R^2 (plotting Fig. 7), the validity of each machine learning method was compared by showing the optimal results for each method. Experimental conditions Supply volume of the raw material: 25–125 mL Addition rate of TiO2: 5.0–15.0 wt% Operation time: 1–15 min Rotation speed: 2,200–5,700 min-1 Temperature: 295–319 K Nomenclature NNeur.: the number of neurons NH.L.: the number of hidden layers NL.T.: the number of learning times
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).
(h,r,?)
is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t)
, after masking out scores of other (h,r,t')
triples contained in the graph.experimental_data.zip
, the following files are provided for each dataset:{dataset}_preprocessing.ipynb
: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID)
triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.test_ranks.csv
: csv table with columns ["h", "r", "t"]
specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"]
with the rank of the ground-truth tail in the ordered list of predictions made by the four models;entity_dict.csv
: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);relation_dict.csv
: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).The separate top_100_tail_predictions.zip
archive contains, for each of the test queries in the corresponding test_ranks.csv
table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz
archive of numpy arrays (one array of shape (n_test_triples, 100)
for each of the KGE models).
All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RELEASE V2.1.0 KNOWLEDGE GRAPH: ORIGINAL DATA SOURCES
Release: v2.1.0
The goal of this build was to create a knowledge graph that represented human disease mechanisms and included the central dogma. The data sources utilized in this release include many of the sources used in the initial release, as well as some new data made available by the Comparative Toxicogenomics Database and experimental data from the Human Protein Atlas.
Data sources are listed by type (Ontology and Data not represented in an ontology [Database Sources]). Additional details are provided for each data source below. Please see documentation on the primary release (https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) for additional details on each data source as well as citation information.
Data Access:
ONTOLOGIES
Cell Ontology
Cell Line Ontology
Chemical Entities of Biological Interest (ChEBI) Ontology
Gene Ontology
Human Phenotype Ontology
Mondo Disease Ontology
Pathway Ontology
Protein Ontology
Relations Ontology
Sequence Ontology
Uber-Anatomy Ontology
Vaccine Ontology
Cell Ontology (CL)
Homepage: GitHub Citation:
Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biology. 2005;6(2):R21
Usage: Utilized to connect transcripts and proteins to cells. Additionally, the edges between this ontology and its dependencies are utilized:
ChEBI
GO
PATO
PRO
RO
UBERON
Cell Line Ontology (CLO)
Homepage: http://www.clo-ontology.org/ Citation:
Sarntivijai S, Lin Y, Xiang Z, Meehan TF, Diehl AD, Vempati UD, Schürer SC, Pang C, Malone J, Parkinson H, Liu Y. CLO: the cell line ontology. Journal of Biomedical Semantics. 2014;5(1):37
Usage: Utilized this ontology to map cell lines to transcripts and proteins. Additionally, the edges between this ontology and its dependencies are utilized:
CL
DOID
NCBITaxon
UBERON
Chemical Entities of Biological Interest (ChEBI)
Homepage: https://www.ebi.ac.uk/chebi/ Citation:
Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research. 2015;44(D1):D1214-9
Usage: Utilized to connect chemicals to complexes, diseases, genes, GO biological processes, GO cellular components, GO molecular functions, pathways, phenotypes, reactions, and transcripts.
Gene Ontology (GO)
Homepage: http://geneontology.org/ Citations:
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25
The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research. 2018;47(D1):D330-8
Usage: Utilized to connect biological processes, cellular components, and molecular functions to chemicals, pathways, and proteins. Additionally, the edges between this ontology and its dependencies are utilized:
CL
NCBITaxon
RO
UBERON
Other Gene Ontology Data Used: goa_human.gaf.gz
Human Phenotype Ontology (HPO)
Homepage: https://hpo.jax.org/ Citation:
Köhler S, Carmody L, Vasilevsky N, Jacobsen JO, Danis D, Gourdine JP, Gargano M, Harris NL, Matentzoglu N, McMurry JA, Osumi-Sutherland D. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Research. 2018;47(D1):D1018-27
Usage: Utilized to connect phenotypes to chemicals, diseases, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:
CL
ChEBI
GO
UBERON
Files
Other Human Phenotype Ontology Data Used: phenotype.hpoa
Mondo Disease Ontology (Mondo)
Homepage: https://mondo.monarchinitiative.org/ Citation:
Mungall CJ, McMurry JA, Köhler S, Balhoff JP, Borromeo C, Brush M, Carbon S, Conlin T, Dunn N, Engelstad M, Foster E. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Research. 2017;45(D1):D712-22
Usage: Utilized to connect diseases to chemicals, phenotypes, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:
CL
NCBITaxon
GO
HPO
UBERON
Pathway Ontology (PW)
Homepage: rgd.mcw.edu Citation:
Petri V, Jayaraman P, Tutaj M, Hayman GT, Smith JR, De Pons J, Laulederkind SJ, Lowry TF, Nigam R, Wang SJ, Shimoyama M. The pathway ontology–updates and applications. Journal of Biomedical Semantics. 2014;5(1):7.
Usage: Utilized to connect pathways to GO biological processes, GO cellular components, GO molecular functions, Reactome pathways. Several steps are taken in order to connect Pathway Ontology identifiers to Reactome pathways and GO biological processes. To connect Pathway Ontology identifiers to Reactome pathways, we use ComPath Pathway Database Mappings developed by Daniel Domingo-Fernández (PMID:30564458).
Files
Downloaded Mapping Data
curated_mappings.txt
kegg_reactome.csv
Generated Mapping Data
REACTOME_PW_GO_MAPPINGS.txt
Protein Ontology (PRO)
Homepage: https://proconsortium.org/ Citation:
Natale DA, Arighi CN, Barker WC, Blake JA, Bult CJ, Caudy M, Drabkin HJ, D’Eustachio P, Evsikov AV, Huang H, Nchoutmboube J. The Protein Ontology: a structured representation of protein forms and complexes. Nucleic Acids Research. 2010;39(suppl_1):D539-45
Usage: Utilized to connect proteins to chemicals, genes, anatomy, catalysts, cell lines, cofactors, complexes, GO biological processes, GO cellular components, GO molecular functions, pathways, proteins, reactions, and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:
ChEBI
DOID
GO
Notes: A partial, human-only version of this ontology was used. Details on how this version of the ontology was generated can be found under the Protein Ontology section of the Data_Preparation.ipynb Jupyter Notebook.
Files
Generated Human Version Protein Ontology (PRO)
human_pro.owl (closed with hermit reasoner)
Other PRO Data Used: promapping.txt
Generated Mapping Data
Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
Ensembl Transcript-PRO Identifier Mapping: ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt
Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
UniProt Accession-PRO Identifier Mapping: UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt
STRING-PRO Identifier Mapping: STRING_PRO_ONTOLOGY_MAP.txt
Relations Ontology (RO)
Homepage: GitHub Citation:
Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C. Relations in biomedical ontologies. Genome Biology. 2005;6(5):R46.
Usage: Utilizing this ontology to connect all data sources in knowledge graph. Additionally, the ontology is queried prior to building the knowledge graph to identify all relations, their inverse properties, and their labels.
Files
Generated RO Data
INVERSE_RELATIONS.txt
RELATIONS_LABELS.txt
Sequence Ontology (SO)
Homepage: GitHub Citation:
Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology. 2005;6(5):R44
Usage: Utilized to connect transcripts and other genomic material like genes and variants.
Files
Generated Mapping Data
genomic_sequence_ontology_mappings.xlsx
SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt
Uber-Anatomy Ontology (Uberon)
Homepage: GitHub Citation:
Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biology. 2012;13(1):R5
Usage: Utilized to connect tissues, fluids, and cells to proteins and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:
ChEBI
CL
GO
PRO
Vaccine Ontology (VO)
Homepage: http://www.violinet.org/vaccineontology/ Citations:
He Y, Racz R, Sayers S, Lin Y, Todd T, Hur J, Li X, Patel M, Zhao B, Chung M, Ostrow J. Updates on the web-based VIOLIN vaccine database and analysis system. Nucleic Acids Research. 2013;42(D1):D1124-32
Xiang Z, Todd T, Ku KP, Kovacic BL, Larson CB, Chen F, Hodges AP, Tian Y, Olenzek EA, Zhao B, Colby LA. VIOLIN: vaccine investigation and online information network. Nucleic Acids Research. 2007;36(suppl_1):D923-8
Usage: Utilized the edges between this ontology and its dependencies:
ChEBI
DOID
GO
PRO
UBERON
DATABASE SOURCES
BioPortal
ClinVar
Comparative Toxicogenomics Database
DisGeNET
Ensembl
GeneMANIA
Genotype-Tissue Expression Project
Human Genome Organisation Gene Nomenclature Committee
Human Protein Atlas
National Center for Biotechnology Information Gene
Reactome Pathway Database
Search Tool for Recurring Instances of Neighbouring Genes Database
Universal Protein Resource Knowledgebase
BioPortal
Homepage: BioPortal Citation:
BioPortal. Lexical OWL Ontology Matcher (LOOM)
Ghazvinian A, Noy NF, Musen MA. Creating mappings for ontologies in biomedicine: simple methods work. In AMIA Annual Symposium Proceedings 2009 (Vol. 2009, p. 198). American Medical Informatics Association
Usage: BioPortal was utilized to obtain mappings between MeSH identifiers and ChEBI identifiers for chemicals-diseases, chemicals-genes, chemical-GO biological processes, chemicals-GO cellular components, chemicals-GO molecular functions, chemicals-phenotypes, chemicals-proteins, and chemicals-transcripts. Additional information on how this data was processed can be obtained
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains the output of the PALMS run with male active initial population.
To reproduce the physical activity trajectory graph, please follow the steps below:
1. Run PALMS (DOI: 10.15161/oar.it/23467) with input parameters M-Active (DOI: 10.15161/oar.it/23477).
2. To run the simulation use the PALMS OAR Reproducibility container (DOI: 10.15161/oar.it/23494).
3. The run generates five CSV files. For this graph, get the 'SimYear' (column A) and 'Avg PA status' (column H) records in the "AnnualPSA
4. The data above will reproduce the M-Active graph.
The Caselaw Access Project makes 40 million pages of U.S. caselaw freely available online from the collections of Harvard Law School Library.
The CAP citation graph shows the connections between cases in the Caselaw Access Project dataset. You can use the citation graph to answer questions like "what is the most influential case?" and "what jurisdictions cite most often to this jurisdiction?".
Learn More: https://case.law/download/citation_graph/
Access Limits: https://case.law/api/#limits
This dataset includes citations and metadata for the CAP citation graph in CSV format.
The Caselaw Access Project is by the Library Innovation Lab at Harvard Law School Library.
People are using CAP data to create research, applications, and more. We're sharing examples in our gallery.
Cite Grid is the first visualization we've created based on data from our citation graph.
Have something to share? We're excited to hear about it.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains the output of the PALMS run with female inactive initial population.
To reproduce the physical activity trajectory graph, please follow the steps below:
1. Run PALMS (DOI: 10.15161/oar.it/23467) with input parameters F-Inactive (DOI: 10.15161/oar.it/23471).
2. To run the simulation use the PALMS OAR Reproducibility container (DOI: 10.15161/oar.it/23494).
3. The run generates five CSV files. For this graph, get the 'SimYear' (column A) and 'Avg PA status' (column H) records in the "AnnualPSA
4. The data above will reproduce the F-Inactive graph.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description. The NetVote dataset contains the outputs of the NetVote program when applied to voting data coming from VoteWatch (http://www.votewatch.eu/).
These results were used in the following conference papers:
Source code. The NetVote source code is available on GitHub: https://github.com/CompNet/NetVotes.
Citation. If you use our dataset or tool, please cite article [1] above.
@InProceedings{Mendonca2015,
author = {Mendonça, Israel and Figueiredo, Rosa and Labatut, Vincent and Michelon, Philippe},
title = {Relevance of Negative Links in Graph Partitioning: A Case Study Using Votes From the {E}uropean {P}arliament},
booktitle = {2\textsuperscript{nd} European Network Intelligence Conference ({ENIC})},
year = {2015},
pages = {122-129},
address = {Karlskrona, SE},
publisher = {IEEE Publishing},
doi = {10.1109/ENIC.2015.25},
}
-------------------------
Details. This archive contains the following folders:
-------------------------
License. These data are shared under a Creative Commons 0 license.
Contact. Vincent Labatut <vincent.labatut@univ-avignon.fr> & Rosa Figueiredo <rosa.figueiredo@univ-avignon.fr>
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: none of the data sets published here contain actual data, they are for testing purposes only.
This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:
dataset_30_nodes_interactions.csv
:contains 30 rows (nodes).dataset_30_edges_interactions.csv
: contains 47 rows (edges).dataset_30
refers to the same graph.Each dataset contains the following columns:
Name of the Column | Type | Description |
UniProt ID | string | protein identification |
label | string | protein label (type of node) |
properties | string | a dictionary containing properties related to the protein. |
Each dataset contains the following columns:
Name of the Column | Type | Description |
Relationship ID | string | relationship identification |
Source ID | string | identification of the source protein in the relationship |
Target ID | string | identification of the target protein in the relationship |
label | string | relationship label (type of relationship) |
properties | string | a dictionary containing properties related to the relationship. |
Graph | Number of Nodes | Number of Edges | Sparse graph |
dataset_30* |
30 | 47 |
Y |
dataset_60* |
60 |
181 |
Y |
dataset_120* |
120 |
689 |
Y |
dataset_240* |
240 |
2819 |
Y |
dataset_300* |
300 |
4658 |
Y |
dataset_600* |
600 |
18004 |
Y |
dataset_1200* |
1200 |
71785 |
Y |
dataset_2400* |
2400 |
288600 |
Y |
dataset_3000* |
3000 |
449727 |
Y |
dataset_6000* |
6000 |
1799413 |
Y |
dataset_12000* |
12000 |
7199863 |
Y |
dataset_24000* |
24000 |
28792361 |
Y |
dataset_30000* |
30000 |
44991744 |
Y |
This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.
Each dataset contains the following columns:
Name of the Column | Type | Description |
ID | string | node identification |
label | string | node label (type of node) |
properties | string | a dictionary containing properties related to the node. |
Each dataset contains the following columns:
Name of the Column | Type | Description |
ID | string | relationship identification |
source | string | identification of the source node in the relationship |
target | string | identification of the target node in the relationship |
label | string | relationship label (type of relationship) |
properties | string | a dictionary containing properties related to the relationship. |
Graph | Number of Nodes | Number of Edges | Sparse graph |
dataset_dummy* | 3 | 6 | N |
dataset_dummy2* | 3 | 6 | N |