100+ datasets found
  1. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  2. Awesome Public Datasets as Neo4j Graph

    • kaggle.com
    Updated Dec 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manav Sehgal (2016). Awesome Public Datasets as Neo4j Graph [Dataset]. https://www.kaggle.com/datasets/startupsci/awesome-datasets-graph/suggestions?status=pending
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 20, 2016
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Manav Sehgal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The awesome datasets graph is a Neo4j graph database which catalogs and classifies datasets and data sources as scraped from the Awesome Public Datasets GitHub list.

    Content

    We started with a simple list of links on the Awesome Public Datasets page. We now have a semantic graph database with 10 labels, five relationship types, nine property keys, and more than 400 nodes. All within 1MB of database footprint. All database operations are query driven using the powerful and flexible Cypher Graph Query Language.

    The download includes CSV files which were created as an interim step after scraping and wrangling the source. The download also includes a working Neo4j Graph Database. Login: neo4j | Password: demo.

    Acknowledgements

    Data scraped from Awesome Public Datasets page. Prepared for the book Data Science Solutions.

    Inspiration

    While we have done basic data wrangling and preparation, how can this graph prove useful for your data science workflow? Can we record our data science project decisions taken across workflow stages and how the data catalog (datasources, datasets, tools) use cases help in these decisions by achieving data science solutions strategies?

  3. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  4. f

    European Mountain Territory and Value Chains: Knowledge Graphs, CSV, HTML,...

    • figshare.com
    txt
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    aimhdhgroup (2024). European Mountain Territory and Value Chains: Knowledge Graphs, CSV, HTML, and Excel Data [Dataset]. http://doi.org/10.6084/m9.figshare.25243009.v8
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    figshare
    Authors
    aimhdhgroup
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains a collection of data about 454 value chains from 23 rural European areas of 16 countries. This data is obtained through a semi-automatic workflow that transforms raw textual data from an unstructured MS Excel sheet into semantic knowledge graphs.In particular, the repository contains:MS Excel sheet containing different value chains details provided by MOuntain Valorisation through INterconnectedness and Green growth (MOVING) European project;454 CSV files containing events, titles, entities and coordinates of narratives of each value chain, obtained by pre-processing the MS Excel sheet454 Web Ontology Language (OWL) files. This collection of files is the result of the semi-automatic workflow, and is organized as a semantic knowledge graph of narratives, where each narrative is a sub-graph explaining one among the 454 value chains and its territory aspects. The knowledge graph is based on the Narrative Ontology, an ontology developed by Institute of Information Science and Technologies (ISTI-CNR) as an extension of CIDOC CRM, FRBRoo, and OWL Time.Two CSV files that compile all the possible available information extracted from 454 Web Ontology Language (OWL) files.GeoPackage files with the geographic coordinates related to the narratives.The HTML files that show all the different SPARQL and GeoSPARQL queries.The HTML files that show the story maps about the 454 value chains.An image showing how the various components of the dataset interact with each other.

  5. H

    Graph inference datasets. Replication Data for: "Learning Functional Causal...

    • dataverse.harvard.edu
    tsv, txt
    Updated Aug 25, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2017). Graph inference datasets. Replication Data for: "Learning Functional Causal Models with Generative Neural Networks" [Dataset]. http://doi.org/10.7910/DVN/UZMB69
    Explore at:
    tsv(206051), tsv(186072), tsv(443), tsv(235732), tsv(500), tsv(196316), tsv(283), tsv(176390), tsv(195886), tsv(265), tsv(276), tsv(476), tsv(368), tsv(489), tsv(272), tsv(176253), tsv(339), tsv(196441), tsv(166370), tsv(421), tsv(2146), tsv(205879), tsv(427), tsv(215987), tsv(469), tsv(206393), tsv(248), tsv(206395), tsv(196769), tsv(445), tsv(565), tsv(1010788), tsv(455), tsv(176284), tsv(287), tsv(203), tsv(508), tsv(195777), txt(521), tsv(176425), tsv(385), tsv(176495), tsv(216028), tsv(215849), tsv(216), tsv(438), tsv(331), tsv(2136), tsv(166779), txt(690), tsv(176232), tsv(206095), tsv(176214), tsv(531), tsv(2083), tsv(495), tsv(393), tsv(961875), tsv(381), tsv(292), tsv(353), tsv(206251), tsv(507), tsv(176462), tsv(1002508), tsv(374), tsv(2155), tsv(294), tsv(319), tsv(201), tsv(369), tsv(206019), tsv(166110), tsv(173), tsv(991344), tsv(196439), tsv(1020789), tsv(176421), tsv(206350), tsv(349), tsv(506), tsv(305), tsv(285), tsv(230), tsv(212), tsv(279), tsv(325), tsv(337), tsv(225777), tsv(241), tsv(306), tsv(175987), tsv(232), tsv(186125), tsv(166532), tsv(254), tsv(186144), tsv(215986), tsv(318), tsv(317), tsv(166634), tsv(568), tsv(216417), tsv(342), tsv(2186), tsv(206364), tsv(981518)Available download formats
    Dataset updated
    Aug 25, 2017
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Graph datasets in csv format. Used in the article Learning Functional Causal Models with Generative Neural Networks. 1) Each file *_numdata.csv contain the data of around 20 variables connected in a graph without hidden variables. G2, G3, G4 and G5 refered to graph with 2, 3, 4 and 5 parents maximum for each node. Each file *_target.csv contains the ground truth of the graph with cause -> effect File beginning by "Big" are larger graphs with 100 variables. 2) Each file *_confounders_numdata.csv contain the data of around 20 variables connected in a graph. There are 3 hidden variables. Each file *_confounders_skeleton.csv contains the skeleton of the graph (including spurious links due to common hidden cause). Each file *_confounders_target.csv contains the ground truth of the graph with the direct visible cause -> effect. The task is to recover the direct visible links cause->effect while removing the spurious links of the skeleton

  6. Semantic links between selected CSV datasets harvested by the European Data...

    • zenodo.org
    • eprints.soton.ac.uk
    zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis-Daniel Ibanez; Luis-Daniel Ibanez (2024). Semantic links between selected CSV datasets harvested by the European Data Portal and the DBpedia knowledge graph [Dataset]. http://doi.org/10.5281/zenodo.3837721
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luis-Daniel Ibanez; Luis-Daniel Ibanez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These dataset contains the results of the interlinking process between selected csv datasets harvested by the European DAta Portal and the DBpedia knowledge graph.

    We aim at answering the following questions:
    What are the more popular column types? This will provide hindsight about what the datasets hold and how they can be joined. It will also provide hindsight on what specific linking schemes could be applied in future elements.
    What datasets have columns of the same type? This will suggest datasets that may be similar or related.
    What entities appear in most datasets (co-referent entities)? This will suggest entities for which more data is published.
    What datasets share a particular entity? This will suggest datasets that may be joined, or are related through that particular entity

    Results are provided as augmented tables, that contain the columns of the original csv, plus a metadata file in JSON-LD format. The metadata files can be loaded in an RDF-store and queried.

    Refer to the accompanying report of activities for more details on the methodolog and how to query the dataset.


  7. g

    Datasets for manuscript: ADAM: A Web Platform for Graph-Based Modeling and...

    • gimi9.com
    Updated Nov 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Datasets for manuscript: ADAM: A Web Platform for Graph-Based Modeling and Optimization of Supply Chains | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_datasets-for-manuscript-adam-a-web-platform-for-graph-based-modeling-and-optimization-of-s/
    Explore at:
    Dataset updated
    Nov 2, 2023
    Description

    ADAM-Data-Repository This repository contains all the data needed to run the case studies for the ADAM manuscript. ## Biogas production The directory "biogas" contains all data for the biogas production case studies (Figs 13 and 14). Specifically, "biogas/biogas_x" contains the data files for the scenario where "x" is the corresponding Renewable Energy Certificates (RECs) value. ## Plastic waste recycling The directory "plastic_waste" contains all data for the plastic waste recycling case studies (Figs 15 and 16). Different scenarios share the same supply, technology site, and technology candidate data, as specified by the "csv" files under "plastic_waste". Each scenario has a different demand data file, which is contained in "plastic_waste/Elec_price" and "plastic_waste/PET_price". ## How to run the case studies In order to run the case studies, one can create a new model in ADAM and upload appropriate CSV file at each step (e.g. upload biogas/biogas_0/supplydata197.csv in step 2 where supply data are specified). This dataset is associated with the following publication: Hu, Y., W. Zhang, P. Tominac, M. Shen, D. Göreke, E. Martín-Hernández, M. Martín, G.J. Ruiz-Mercado, and V.M. Zavala. ADAM: A web platform for graph-based modeling and optimization of supply chains. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 165: 107911, (2022).

  8. w

    Our World In Data - Dataset - waterdata

    • wbwaterdata.org
    Updated Jul 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Our World In Data - Dataset - waterdata [Dataset]. https://wbwaterdata.org/dataset/our-world-in-data
    Explore at:
    Dataset updated
    Jul 12, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database collates 3552 development indicators from different studies with data by country and year, including single year and multiple year time series. The data is presented as charts, the data can be downloaded from linked project pages/references for each set, and the data for each presented graph is available as a CSV file as well as a visual download of the graph (both available via the download link under each chart).

  9. H

    Time-Series Matrix (TSMx): A visualization tool for plotting multiscale...

    • dataverse.harvard.edu
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgios Boumis; Brad Peter (2024). Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends [Dataset]. http://doi.org/10.7910/DVN/ZZDYM9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Georgios Boumis; Brad Peter
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...

  10. i

    Multi-agent Kidney Exchange Program: dataset for simulation along time...

    • rdm.inesctec.pt
    Updated Jun 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Multi-agent Kidney Exchange Program: dataset for simulation along time horizon - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/ii-2020-002
    Explore at:
    Dataset updated
    Jun 16, 2020
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The dataset contains the instances files for the paper X. Klimentova, A. Viana, J. P. Pedroso, N. Santos. Fairness models for multi-agent kidney exchange programmes. To appear in Omega: The International Journal of Management Science (2020). The same dataset was also use in Monteiro, T., Klimentova, X., Pedroso, J.P.Pedroso, A. Viana. A comparison of matching algorithms for kidney exchange programs addressing waiting time. Cent Eur J Oper Res (2020). https://doi.org/10.1007/s10100-020-00680-y Each instance mimics pools of kidney exchange progammes of several agents (e.g. countires) over time. Incompatible donor-recipient pairs appear and leave along the time horizon. Each of the pairs belongs to the pool of one of the agents. The virutal compatiblity among pairs is represented on a directed graph G = (V,A), called compatibility graph, where the set of vertices V corresponds to the set of incompatible pairs and non-directed donors. An arc from a vertex i to a vertex j indicates compatibility of donor in i with the patient in j. The positive real crossmatch testing is also incorporated by saving the arcs that would fail in case they are chosen is a cycle in one of the matching runs. The generator creates randomly graphs based on probabilities of blood type and of donor–patient tissue compatibility; the arrival of pairs and non-directed donors is generated based on a given arrival rates. An instance of the dataset represents a pools of 4 agents, that are simulated for the period of 6 years. There are 100 instances compressed in 4 zip-archives, each containing 25 instances. Each of the instances is described by 3 files, where index s is the seed used for random function when generating the instance. a) characterisations_s.csv -- csv file that contains information on each pair in the merged pool in the following columns 0 : Pair ID 1 : Donor ID 2 : Donor blood type 3 : Donor age 4 : Patient ID 5 : Patient blood type 6 : Patient PRA 7 : Patient cross-match probability 8 : Patient age 9 : Pair arrival day 10 : Pair departure day 11 : Pair probability of failure 12 : Pair from pool (e.g. country to which the pair belongs to) In case of non-directed donor the information about the patient is filled by -1; b) acrs_s.csv - csv file containts the compatibility graph of the problem described above. In the first line the file contains values n – number of vertices in the graph and m – number of arcs in the graph. In the following m lines of the file, the existing arcs (i,j) are presented as follows: i j w_ij where i and j are IDs of pairs, w_ij is the weight of the arc, which is always equal to 1.0 for all the instances in this dataset. c) fail_arcs_s.csv - is the list of arcs that would fail due to positive crossmatch test in case they appear in a chosen cycle or chain in any matching run. The format of the file is the same as that for arcs_s.csv. The first line represents the n - number of vertices in the graph, and m_fail the number of failed arcs listed in the following m_fail lines in the same way as in arcs_s.csv

  11. Wikipedia time-series graph

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre; Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre (2025). Wikipedia time-series graph [Dataset]. http://doi.org/10.5281/zenodo.886484
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre; Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wikipedia temporal graph.

    The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.

    Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using enwiki-20150403-graph.gt file that can be opened with open source Python library Graph-Tool.

    Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is 5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.

  12. H

    Text-attributed Temporal Graph Benchmark Datasets

    • dataverse.harvard.edu
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Longfei Ma (2025). Text-attributed Temporal Graph Benchmark Datasets [Dataset]. http://doi.org/10.7910/DVN/ZK7NGU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Longfei Ma
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    These are six real-world temporal graph datasets with text attributes, varying in size and originating from different domains. Specifically, they are collected from culinary recipe feedback, movie reviews, book reading records, beer rating data, and online shopping interactions. In these temporal graphs, users and items are represented as nodes, while the interactions between them — in the form of user reviews or comments — serve as edges. Each edge is associated with both a timestamp and raw textual content. Additionally, each item node is accompanied by a descriptive text attribute. The files under the $dataset directory are as follows: 1、raw_node.npy and raw_edges.csv store the raw text attributes of nodes and edges, respectively. 2、ml_$dataset.csv records the temporal edges of the dataset, where each row in the format (u, i, ts) represents a user u interacting with an item i at timestamp ts. 3、$dataset_unique_labels.json contains the complete set of human-readable labels for the dataset. 4、Both $dataset_labels_text.json and $dataset_labels.json correspond to the labels associated with each edge in ml_$dataset.csv, where the former provides the textual form of the item labels that users are interested in, and the latter provides their corresponding numeric labels.

  13. t

    Evaluating SQuAD-based Question Answering for the Open Research Knowledge...

    • service.tib.eu
    Updated Aug 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/luh-evaluating-squad-based-question-answering-for-the-open-research-knowledge-graph-completion
    Explore at:
    Dataset updated
    Aug 4, 2023
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset is part of the bachelor thesis "Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion". It was created for the finetuning of Bert Based models pre-trained on the SQUaD dataset. The Dataset was created using semi-automatic approach on the ORKG data. The dataset.csv file contains the entire data (all properties) in a tabular for and is unsplit. The json files contain only the necessary fields for training and evaluation, with additional fields (index of start and end of the answers in the abstracts). The data in the json files is split (training data) and evaluation data. We create 4 variants of the training and evaluation sets for each one of the question labels ("no label", "how", "what", "which") For detailed information on each of the fields in the dataset, refer to section 4.2 (Corpus) of the Thesis document that can be found in https://www.repo.uni-hannover.de/handle/123456789/12958. The script used to generate the dataset can be found in the public repository https://github.com/as18cia/thesis_work and https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-fine-tuning-squad-based-models

  14. j

    Data from: Data on the Construction Processes of Regression Models

    • jstagedata.jst.go.jp
    jpeg
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa (2023). Data on the Construction Processes of Regression Models [Dataset]. http://doi.org/10.50931/data.kona.22180318.v2
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    Hosokawa Powder Technology Foundation
    Authors
    Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This CSV dataset (numbered 1–8) demonstrates the construction processes of the regression models using machine learning methods, which are used to plot Fig. 2–7. The CSV file of 1.LSM_R^2 (plotting Fig. 2) shows the data of the relationship between estimated values and actual values when the least-squares method was used for a model construction. In the CSV file 2.PCR_R^2 (plotting Fig. 3), the number of the principal components was varied from 1 to 5 during the construction of a model using the principal component regression. The data in the CSV file 3.SVR_R^2 (plotting Fig. 4) is the result of the construction using the support vector regression. The hyperparameters were decided by the comprehensive combination from the listed candidates by exploring hyperparameters with maximum R2 values. When a deep neural network was applied to the construction of a regression model, NNeur., NH.L. and NL.T. were varied. The CSV file 4.DNN_HL (plotting Fig. 5a)) shows the changes in the relationship between estimated values and actual values at each NH.L.. Similarly, changes in the relationships between estimated values and actual values in the case NNeur. or NL.T. were varied in the CSV files 5.DNN_ Neur (plotting Fig. 5b)) and 6.DNN_LT (plotting Fig. 5c)). The data in the CSV file 7.DNN_R^2 (plotting Fig. 6) is the result using optimal NNeur., NH.L. and NL.T.. In the CSV file 8.R^2 (plotting Fig. 7), the validity of each machine learning method was compared by showing the optimal results for each method. Experimental conditions Supply volume of the raw material: 25–125 mL Addition rate of TiO2: 5.0–15.0 wt% Operation time: 1–15 min Rotation speed: 2,200–5,700 min-1 Temperature: 295–319 K Nomenclature NNeur.: the number of neurons NH.L.: the number of hidden layers NL.T.: the number of learning times

  15. Link-prediction on Biomedical Knowledge Graphs

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.12097377
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jun 25, 2021
    Description

    Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

    Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.
    Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).
    On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.
    Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.
    Inside experimental_data.zip, the following files are provided for each dataset:
    • {dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.
    • test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;
    • entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);
    • relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

    The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

    All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.

  16. Z

    PheKnowLator Human Disease Knowledge Graphs - Build Data (Original)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Callahan, Tiffany J (2022). PheKnowLator Human Disease Knowledge Graphs - Build Data (Original) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7026639
    Explore at:
    Dataset updated
    Aug 29, 2022
    Dataset authored and provided by
    Callahan, Tiffany J
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RELEASE V2.1.0 KNOWLEDGE GRAPH: ORIGINAL DATA SOURCES

    Release: v2.1.0

    The goal of this build was to create a knowledge graph that represented human disease mechanisms and included the central dogma. The data sources utilized in this release include many of the sources used in the initial release, as well as some new data made available by the Comparative Toxicogenomics Database and experimental data from the Human Protein Atlas.

    Data sources are listed by type (Ontology and Data not represented in an ontology [Database Sources]). Additional details are provided for each data source below. Please see documentation on the primary release (https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) for additional details on each data source as well as citation information.

    Data Access:

    https://console.cloud.google.com/storage/browser/pheknowlator/archived_builds/release_v2.1.0/build_01MAY2021

    ONTOLOGIES

    Cell Ontology

    Cell Line Ontology

    Chemical Entities of Biological Interest (ChEBI) Ontology

    Gene Ontology

    Human Phenotype Ontology

    Mondo Disease Ontology

    Pathway Ontology

    Protein Ontology

    Relations Ontology

    Sequence Ontology

    Uber-Anatomy Ontology

    Vaccine Ontology

    Cell Ontology (CL)

    Homepage: GitHub Citation:

    Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biology. 2005;6(2):R21

    Usage: Utilized to connect transcripts and proteins to cells. Additionally, the edges between this ontology and its dependencies are utilized:

    ChEBI

    GO

    PATO

    PRO

    RO

    UBERON

    Cell Line Ontology (CLO)

    Homepage: http://www.clo-ontology.org/ Citation:

    Sarntivijai S, Lin Y, Xiang Z, Meehan TF, Diehl AD, Vempati UD, Schürer SC, Pang C, Malone J, Parkinson H, Liu Y. CLO: the cell line ontology. Journal of Biomedical Semantics. 2014;5(1):37

    Usage: Utilized this ontology to map cell lines to transcripts and proteins. Additionally, the edges between this ontology and its dependencies are utilized:

    CL

    DOID

    NCBITaxon

    UBERON

    Chemical Entities of Biological Interest (ChEBI)

    Homepage: https://www.ebi.ac.uk/chebi/ Citation:

    Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research. 2015;44(D1):D1214-9

    Usage: Utilized to connect chemicals to complexes, diseases, genes, GO biological processes, GO cellular components, GO molecular functions, pathways, phenotypes, reactions, and transcripts.

    Gene Ontology (GO)

    Homepage: http://geneontology.org/ Citations:

    Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25

    The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research. 2018;47(D1):D330-8

    Usage: Utilized to connect biological processes, cellular components, and molecular functions to chemicals, pathways, and proteins. Additionally, the edges between this ontology and its dependencies are utilized:

    CL

    NCBITaxon

    RO

    UBERON

    Other Gene Ontology Data Used: goa_human.gaf.gz

    Human Phenotype Ontology (HPO)

    Homepage: https://hpo.jax.org/ Citation:

    Köhler S, Carmody L, Vasilevsky N, Jacobsen JO, Danis D, Gourdine JP, Gargano M, Harris NL, Matentzoglu N, McMurry JA, Osumi-Sutherland D. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Research. 2018;47(D1):D1018-27

    Usage: Utilized to connect phenotypes to chemicals, diseases, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:

    CL

    ChEBI

    GO

    UBERON

    Files

    Other Human Phenotype Ontology Data Used: phenotype.hpoa

    Mondo Disease Ontology (Mondo)

    Homepage: https://mondo.monarchinitiative.org/ Citation:

    Mungall CJ, McMurry JA, Köhler S, Balhoff JP, Borromeo C, Brush M, Carbon S, Conlin T, Dunn N, Engelstad M, Foster E. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Research. 2017;45(D1):D712-22

    Usage: Utilized to connect diseases to chemicals, phenotypes, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:

    CL

    NCBITaxon

    GO

    HPO

    UBERON

    Pathway Ontology (PW)

    Homepage: rgd.mcw.edu Citation:

    Petri V, Jayaraman P, Tutaj M, Hayman GT, Smith JR, De Pons J, Laulederkind SJ, Lowry TF, Nigam R, Wang SJ, Shimoyama M. The pathway ontology–updates and applications. Journal of Biomedical Semantics. 2014;5(1):7.

    Usage: Utilized to connect pathways to GO biological processes, GO cellular components, GO molecular functions, Reactome pathways. Several steps are taken in order to connect Pathway Ontology identifiers to Reactome pathways and GO biological processes. To connect Pathway Ontology identifiers to Reactome pathways, we use ComPath Pathway Database Mappings developed by Daniel Domingo-Fernández (PMID:30564458).

    Files

    Downloaded Mapping Data

    curated_mappings.txt

    kegg_reactome.csv

    Generated Mapping Data

    REACTOME_PW_GO_MAPPINGS.txt

    Protein Ontology (PRO)

    Homepage: https://proconsortium.org/ Citation:

    Natale DA, Arighi CN, Barker WC, Blake JA, Bult CJ, Caudy M, Drabkin HJ, D’Eustachio P, Evsikov AV, Huang H, Nchoutmboube J. The Protein Ontology: a structured representation of protein forms and complexes. Nucleic Acids Research. 2010;39(suppl_1):D539-45

    Usage: Utilized to connect proteins to chemicals, genes, anatomy, catalysts, cell lines, cofactors, complexes, GO biological processes, GO cellular components, GO molecular functions, pathways, proteins, reactions, and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:

    ChEBI

    DOID

    GO

    Notes: A partial, human-only version of this ontology was used. Details on how this version of the ontology was generated can be found under the Protein Ontology section of the Data_Preparation.ipynb Jupyter Notebook.

    Files

    Generated Human Version Protein Ontology (PRO)

    human_pro.owl (closed with hermit reasoner)

    Other PRO Data Used: promapping.txt

    Generated Mapping Data

    Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl

    Ensembl Transcript-PRO Identifier Mapping: ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt

    Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt

    UniProt Accession-PRO Identifier Mapping: UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt

    STRING-PRO Identifier Mapping: STRING_PRO_ONTOLOGY_MAP.txt

    Relations Ontology (RO)

    Homepage: GitHub Citation:

    Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C. Relations in biomedical ontologies. Genome Biology. 2005;6(5):R46.

    Usage: Utilizing this ontology to connect all data sources in knowledge graph. Additionally, the ontology is queried prior to building the knowledge graph to identify all relations, their inverse properties, and their labels.

    Files

    Generated RO Data

    INVERSE_RELATIONS.txt

    RELATIONS_LABELS.txt

    Sequence Ontology (SO)

    Homepage: GitHub Citation:

    Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology. 2005;6(5):R44

    Usage: Utilized to connect transcripts and other genomic material like genes and variants.

    Files

    Generated Mapping Data

    genomic_sequence_ontology_mappings.xlsx

    SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt

    Uber-Anatomy Ontology (Uberon)

    Homepage: GitHub Citation:

    Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biology. 2012;13(1):R5

    Usage: Utilized to connect tissues, fluids, and cells to proteins and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:

    ChEBI

    CL

    GO

    PRO

    Vaccine Ontology (VO)

    Homepage: http://www.violinet.org/vaccineontology/ Citations:

    He Y, Racz R, Sayers S, Lin Y, Todd T, Hur J, Li X, Patel M, Zhao B, Chung M, Ostrow J. Updates on the web-based VIOLIN vaccine database and analysis system. Nucleic Acids Research. 2013;42(D1):D1124-32

    Xiang Z, Todd T, Ku KP, Kovacic BL, Larson CB, Chen F, Hodges AP, Tian Y, Olenzek EA, Zhao B, Colby LA. VIOLIN: vaccine investigation and online information network. Nucleic Acids Research. 2007;36(suppl_1):D923-8

    Usage: Utilized the edges between this ontology and its dependencies:

    ChEBI

    DOID

    GO

    PRO

    UBERON

    DATABASE SOURCES

    BioPortal

    ClinVar

    Comparative Toxicogenomics Database

    DisGeNET

    Ensembl

    GeneMANIA

    Genotype-Tissue Expression Project

    Human Genome Organisation Gene Nomenclature Committee

    Human Protein Atlas

    National Center for Biotechnology Information Gene

    Reactome Pathway Database

    Search Tool for Recurring Instances of Neighbouring Genes Database

    Universal Protein Resource Knowledgebase

    BioPortal

    Homepage: BioPortal Citation:

    BioPortal. Lexical OWL Ontology Matcher (LOOM)

    Ghazvinian A, Noy NF, Musen MA. Creating mappings for ontologies in biomedicine: simple methods work. In AMIA Annual Symposium Proceedings 2009 (Vol. 2009, p. 198). American Medical Informatics Association

    Usage: BioPortal was utilized to obtain mappings between MeSH identifiers and ChEBI identifiers for chemicals-diseases, chemicals-genes, chemical-GO biological processes, chemicals-GO cellular components, chemicals-GO molecular functions, chemicals-phenotypes, chemicals-proteins, and chemicals-transcripts. Additional information on how this data was processed can be obtained

  17. o

    Output - M-Active

    • openaccessrepository.it
    bin, png
    Updated Apr 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anagnostou Anastasia; Anagnostou Anastasia (2025). Output - M-Active [Dataset]. http://doi.org/10.15161/oar.it/m019j-24769
    Explore at:
    bin, pngAvailable download formats
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    oar
    Authors
    Anagnostou Anastasia; Anagnostou Anastasia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file contains the output of the PALMS run with male active initial population.

    To reproduce the physical activity trajectory graph, please follow the steps below:

    1. Run PALMS (DOI: 10.15161/oar.it/23467) with input parameters M-Active (DOI: 10.15161/oar.it/23477).

    2. To run the simulation use the PALMS OAR Reproducibility container (DOI: 10.15161/oar.it/23494).

    3. The run generates five CSV files. For this graph, get the 'SimYear' (column A) and 'Avg PA status' (column H) records in the "AnnualPSA

    4. The data above will reproduce the M-Active graph.

  18. Citation Graph

    • kaggle.com
    Updated Jun 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caselaw Access Project (2020). Citation Graph [Dataset]. https://www.kaggle.com/harvardlil/citation-graph/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2020
    Dataset provided by
    Kaggle
    Authors
    Caselaw Access Project
    Description

    Context

    The Caselaw Access Project makes 40 million pages of U.S. caselaw freely available online from the collections of Harvard Law School Library.

    The CAP citation graph shows the connections between cases in the Caselaw Access Project dataset. You can use the citation graph to answer questions like "what is the most influential case?" and "what jurisdictions cite most often to this jurisdiction?".

    Learn More: https://case.law/download/citation_graph/

    Access Limits: https://case.law/api/#limits

    Content

    This dataset includes citations and metadata for the CAP citation graph in CSV format.

    Acknowledgements

    The Caselaw Access Project is by the Library Innovation Lab at Harvard Law School Library.

    Inspiration

    People are using CAP data to create research, applications, and more. We're sharing examples in our gallery.

    Cite Grid is the first visualization we've created based on data from our citation graph.

    Have something to share? We're excited to hear about it.

  19. o

    Output - F-Inactive

    • openaccessrepository.it
    bin, png
    Updated Apr 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anagnostou Anastasia; Anagnostou Anastasia (2025). Output - F-Inactive [Dataset]. http://doi.org/10.15161/oar.it/r8wfp-vvt57
    Explore at:
    bin, pngAvailable download formats
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    oar
    Authors
    Anagnostou Anastasia; Anagnostou Anastasia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file contains the output of the PALMS run with female inactive initial population.

    To reproduce the physical activity trajectory graph, please follow the steps below:

    1. Run PALMS (DOI: 10.15161/oar.it/23467) with input parameters F-Inactive (DOI: 10.15161/oar.it/23471).

    2. To run the simulation use the PALMS OAR Reproducibility container (DOI: 10.15161/oar.it/23494).

    3. The run generates five CSV files. For this graph, get the 'SimYear' (column A) and 'Avg PA status' (column H) records in the "AnnualPSA

    4. The data above will reproduce the F-Inactive graph.

  20. NetVotes ENIC Dataset

    • zenodo.org
    • explore.openaire.eu
    txt, zip
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Israel Mendonça; Vincent Labatut; Vincent Labatut; Rosa Figueiredo; Rosa Figueiredo; Israel Mendonça (2024). NetVotes ENIC Dataset [Dataset]. http://doi.org/10.5281/zenodo.6815510
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Israel Mendonça; Vincent Labatut; Vincent Labatut; Rosa Figueiredo; Rosa Figueiredo; Israel Mendonça
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description. The NetVote dataset contains the outputs of the NetVote program when applied to voting data coming from VoteWatch (http://www.votewatch.eu/).

    These results were used in the following conference papers:

    1. I. Mendonça, R. Figueiredo, V. Labatut, and P. Michelon, “Relevance of Negative Links in Graph Partitioning: A Case Study Using Votes From the European Parliament,” in 2nd European Network Intelligence Conference, 2015, pp. 122–129. ⟨hal-01176090⟩ DOI: 10.1109/ENIC.2015.25
    2. I. Mendonça, R. Figueiredo, V. Labatut, and P. Michelon, “Informative Value of Negative Links for Graph Partitioning, with an application to European Parliament Votes,” in 6ème Conférence sur les modèles et lánalyse de réseaux : approches mathématiques et informatiques, 2015, p. 12p. ⟨hal-02055158⟩

    Source code. The NetVote source code is available on GitHub: https://github.com/CompNet/NetVotes.

    Citation. If you use our dataset or tool, please cite article [1] above.


    @InProceedings{Mendonca2015,
    author = {Mendonça, Israel and Figueiredo, Rosa and Labatut, Vincent and Michelon, Philippe},

    title = {Relevance of Negative Links in Graph Partitioning: A Case Study Using Votes From the {E}uropean {P}arliament},
    booktitle = {2\textsuperscript{nd} European Network Intelligence Conference ({ENIC})},
    year = {2015},
    pages = {122-129},
    address = {Karlskrona, SE},
    publisher = {IEEE Publishing},
    doi = {10.1109/ENIC.2015.25},
    }

    -------------------------

    Details. This archive contains the following folders:

    • `votewatch_data`: the raw data extracted from the VoteWatch website.
      • `VoteWatch Europe European Parliament, Council of the EU.csv`: list of the documents voted during the considered term, with some details such as the date and topic.
      • `votes_by_document`: this folder contains a collection of CSV files, each one describing the outcome of the vote session relatively to one specific document.
      • `intermediate_files`: this folder contains several CSV files:
        • `allvotes.csv`: concatenation of all vote outcomes for all documents and all MEPS. Can be considered as a compact representation of the data contained in the folder `votes_by_document`.
        • `loyalty.csv`: same thing than allvotes.csv, but for the loyalty (i.e. whether or not the MEP voted like the majority of the MEPs in his political group).
        • `MPs.csv`: list of the MEPs having voted at least once in the considered term, with their details.
        • `policies.csv`: list of the topics considered during the term.
        • `qtd_docs.csv`: list of the topics with the corresponding number of documents.
    • `parallel_ils_results`: contains the raw results of the ILS tool. This is an external algorithm able to estimate the optimal partition of the network nodes in terms of structural balance. It was applied to all the networks extracted by our scripts (from the VoteWatch data), and the produced files were placed here for postprocessing. Each subfolder corresponds to one of the topic-year pair.
    • `output_files`: contains the file produced by our scripts.
      • `agreement`: histograms representing the distributions of agreement and rebellion indices. Each subfolder corresponds to a specific topic.
      • `community_algorithms_csv`: Performances obtained by the partitioning algorithms (for both community detection and correlation clustering). Each subfolder corresponds to a specific topic.
      • `xxxx_cluster_information.csv`: table containing several variants of the imbalance measure, for the considered algorithms.
      • `community_algorithms_results`: Comparison of the partitions detected by the various algorithms considered, and distribution of the cluster/community sizes. Each subfolder corresponds to a specific topic.
      • `xxxx_cluster_comparison.csv`: table comparing the partitions detected by the community detection algorithms, in terms of Rand index and other measures.
      • `xxxx_ils_cluster_comparison.csv`: like `xxxx_cluster_comparison.csv`, except we compare the partition of community detection algorithms with that of the ILS.
      • `xxxx_yyyy_distribution.pdf`: histogram of the community (or cluster) sizes detected by algorithm `yyyy`.
      • `graphs`: the networks extracted from the vote data. Each subfolder corresponds to a specific topic.
      • `xxxx_complete_graph.graphml`: network at the Graphml format, with all the information: nodes, edges, nodal attributes (including communities), weights, etc.
      • `xxxx_edges_Gephi.csv`: only the links, with their weights (i.e. vote similarity).
      • `xxxx_graph.g`: network at the g format (for ILS).
      • `xxxx_net_measures.csv`: table containing some stats on the network (number of links, etc.).
      • `xxxx_nodes_Gephi.csv`: list of nodes (i.e. MEPs), with details.
      • `plots`: synthesis plots from the paper.

    -------------------------

    License. These data are shared under a Creative Commons 0 license.

    Contact. Vincent Labatut <vincent.labatut@univ-avignon.fr> & Rosa Figueiredo <rosa.figueiredo@univ-avignon.fr>

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
Organization logo

Sample Graph Datasets in CSV Format

Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edwin Carreño; Edwin Carreño
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

  • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
  • dataset_30_edges_interactions.csv: contains 47 rows (edges).
  • the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the ColumnTypeDescription
UniProt IDstringprotein identification
labelstringprotein label (type of node)
propertiesstringa dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the ColumnTypeDescription
Relationship IDstringrelationship identification
Source IDstringidentification of the source protein in the relationship
Target IDstringidentification of the target protein in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata

GraphNumber of NodesNumber of EdgesSparse graph

dataset_30*

30

47

Y

dataset_60*

60

181

Y

dataset_120*

120

689

Y

dataset_240*

240

2819

Y

dataset_300*

300

4658

Y

dataset_600*

600

18004

Y

dataset_1200*

1200

71785

Y

dataset_2400*

2400

288600

Y

dataset_3000*

3000

449727

Y

dataset_6000*

6000

1799413

Y

dataset_12000*

12000

7199863

Y

dataset_24000*

24000

28792361

Y

dataset_30000*

30000

44991744

Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringnode identification
labelstringnode label (type of node)
propertiesstringa dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringrelationship identification
sourcestringidentification of the source node in the relationship
targetstringidentification of the target node in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata (tiny graphs)

GraphNumber of NodesNumber of EdgesSparse graph
dataset_dummy*36N
dataset_dummy2*36N
Search
Clear search
Close search
Google apps
Main menu