100+ datasets found
  1. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  2. GNN-LLM Fusion: Node & Edge CSV Dataset

    • kaggle.com
    zip
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daksh Bhatnagar (2025). GNN-LLM Fusion: Node & Edge CSV Dataset [Dataset]. https://www.kaggle.com/datasets/dakshbhatnagar08/gnn-llm-fusion-node-and-edge-csv-dataset
    Explore at:
    zip(658 bytes)Available download formats
    Dataset updated
    Jun 14, 2025
    Authors
    Daksh Bhatnagar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GNN + LLM Hybrid Baseline Dataset

    This dataset demonstrates how to fuse Large Language Model (LLM) generated embeddings with Graph Neural Networks (GNNs) for learning on tabular graph data.

    Files Included

    • sample_nodes.csv – Node features including ID, category, and description text
    • sample_edges.csv – Edge list (source, target, weight)
    • sample_augmented_nodes.csv – Node features + LLM-generated embeddings (simulated)
    • GNN_LLM_Hybrid_Baseline.ipynb – Main baseline model using PyTorch Geometric
    • CSV_Processing_1.ipynb – Basic loading and EDA of nodes/edges
    • CSV_Processing_2.ipynb – Preview of LLM-augmented node features

    Use Cases

    • Learn how to simulate and inject LLM embeddings into graph data
    • Experiment with hybrid GNN models for tabular reasoning
    • Use for educational purposes, benchmarking GNNs, or feature augmentation

    Technologies Used

    • PyTorch Geometric
    • Scikit-learn
    • Numpy & Pandas

    What’s Next?

    This is a synthetic dataset. For real-world use: - Replace the "LLM embeddings" with outputs from OpenAI / Mistral / HuggingFace models - Extend the node descriptions with actual context or domain-specific text - Scale to real-world graphs or use with competition tabular datasets

  3. Pubmed Knowledge Graph Dataset

    • kaggle.com
    zip
    Updated Jan 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna Kumar S (2022). Pubmed Knowledge Graph Dataset [Dataset]. https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset
    Explore at:
    zip(10883016548 bytes)Available download formats
    Dataset updated
    Jan 7, 2022
    Authors
    Krishna Kumar S
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    PubMed Knowledge Graph Datasets http://er.tacc.utexas.edu/datasets/ped

    Content

    Dataset Name : PKG2020S4 (1781-Dec. 2020), Version 4 The new version PKG, PKG2020S4 (1781-Dec. 2020), updated the previous PKG version with PubMed 2021 baseline files, PubMed daily updates files (up to Jan. 4th 2021), and extracted bio-entities, author disambiguation results, extended author information, Scimago that containing journal information, and WOS citations which contains reference relations between PMID and reference PMID and extracted from WOS.

    Database Features: 1-PKG2020S4 (1781-Dec. 2020) Features.pdf (https://web.corral.tacc.utexas.edu/dive_datasets/PKG2020S4/PKG2020S4_MySQL/1-PKG2020S4%20(1781-Dec.%202020)%20Features.pdf) Database Description: 2-PKG2020S4 (1781-Dec. 2020) Database Description.pdf (https://web.corral.tacc.utexas.edu/dive_datasets/PKG2020S4/PKG2020S4_MySQL/2-PKG2020S4%20(1781-Dec.%202020)%20Database%20Description.pdf)

    Acknowledgements

    http://er.tacc.utexas.edu/datasets/ped

    Inspiration

    http://er.tacc.utexas.edu/datasets/ped

  4. Awesome Public Datasets as Neo4j Graph

    • kaggle.com
    zip
    Updated Dec 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manav Sehgal (2016). Awesome Public Datasets as Neo4j Graph [Dataset]. https://www.kaggle.com/startupsci/awesome-datasets-graph
    Explore at:
    zip(1322695 bytes)Available download formats
    Dataset updated
    Dec 20, 2016
    Authors
    Manav Sehgal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The awesome datasets graph is a Neo4j graph database which catalogs and classifies datasets and data sources as scraped from the Awesome Public Datasets GitHub list.

    Content

    We started with a simple list of links on the Awesome Public Datasets page. We now have a semantic graph database with 10 labels, five relationship types, nine property keys, and more than 400 nodes. All within 1MB of database footprint. All database operations are query driven using the powerful and flexible Cypher Graph Query Language.

    The download includes CSV files which were created as an interim step after scraping and wrangling the source. The download also includes a working Neo4j Graph Database. Login: neo4j | Password: demo.

    Acknowledgements

    Data scraped from Awesome Public Datasets page. Prepared for the book Data Science Solutions.

    Inspiration

    While we have done basic data wrangling and preparation, how can this graph prove useful for your data science workflow? Can we record our data science project decisions taken across workflow stages and how the data catalog (datasources, datasets, tools) use cases help in these decisions by achieving data science solutions strategies?

  5. S

    A Fine-grained Knowledge Graph of Traditional Chinese Medicine Ancient Books...

    • scidb.cn
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuyang Meng (2024). A Fine-grained Knowledge Graph of Traditional Chinese Medicine Ancient Books [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00432
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Xuyang Meng
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset is in CSV format, consisting of two files: nodes.csv and relations.csv, which are encoded in UTF-8. When in use, two files can be imported into graph database tools such as Neo4j to form a visual knowledge graph, which can be used for further research.nodes.csv contains 26771 traditional Chinese medicine related entities extracted from five traditional Chinese medicine ancient books: 伤寒论, 伤寒类方, 伤寒悬解, 伤寒论浅注, and 伤寒九十论. Among them, each row represents an entity, the first column is the entity ID, the second column is the entity name, and the third column is the entity type.relationships.csv contains 8272 triplets of relationships between entities in nodes.csv. Among them, each row represents a pair of relationships, with the first column being the head entity ID, the second column being the tail entity ID, and the third column being the relationship type.

  6. Wikipedia time-series graph

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, csv
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre; Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre (2025). Wikipedia time-series graph [Dataset]. http://doi.org/10.5281/zenodo.886484
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre; Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wikipedia temporal graph.

    The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.

    Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using enwiki-20150403-graph.gt file that can be opened with open source Python library Graph-Tool.

    Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is 5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.

  7. Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  8. ORBITAAL: cOmpRehensive BItcoin daTaset for temporAl grAph anaLysis

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip, bin +1
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Célestin Coquidé; Célestin Coquidé; Remy Cazabet; Remy Cazabet (2024). ORBITAAL: cOmpRehensive BItcoin daTaset for temporAl grAph anaLysis [Dataset]. http://doi.org/10.5281/zenodo.12581515
    Explore at:
    csv, application/gzip, binAvailable download formats
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Célestin Coquidé; Célestin Coquidé; Remy Cazabet; Remy Cazabet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Construction

    This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/

    [1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.
    keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain},

    Dataset Description

    Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021

    Overview:

    This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs.

    Every dates have been retrieved from bloc UNIX timestamp and GMT timezone.

    Contents:

    The dataset is distributed across three compressed archives:

    All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package.

    1. orbitaal-stream_graph.tar.gz:

      • The root directory is STREAM_GRAPH/
      • Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes).
      • The stream graph is divided into 13 files, one for each year
      • Files format is parquet
      • Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering
      • These files are in the subdirectory STREAM_GRAPH/EDGES/
    2. orbitaal-snapshot-all.tar.gz:

      • The root directory is SNAPSHOT/
      • Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021).
      • Files format is parquet
      • Name format is orbitaal-snapshot-all.snappy.parquet.
      • These files are in the subdirectory SNAPSHOT/EDGES/ALL/
    3. orbitaal-snapshot-year.tar.gz:

      • The root directory is SNAPSHOT/
      • Contains the yearly resolution of snapshot networks
      • Files format is parquet
      • Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering
      • These files are in the subdirectory SNAPSHOT/EDGES/year/
    4. orbitaal-snapshot-month.tar.gz:

      • The root directory is SNAPSHOT/
      • Contains the monthly resoluted snapshot networks
      • Files format is parquet
      • Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where
      • [YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering
      • These files are in the subdirectory SNAPSHOT/EDGES/month/
    5. orbitaal-snapshot-day.tar.gz:

      • The root directory is SNAPSHOT/
      • Contains the daily resoluted snapshot networks
      • Files format is parquet
      • Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where
      • [YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering
      • These files are in the subdirectory SNAPSHOT/EDGES/day/
    6. orbitaal-snapshot-hour.tar.gz:

      • The root directory is SNAPSHOT/
      • Contains the hourly resoluted snapshot networks
      • Files format is parquet
      • Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where
      • [YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering
      • These files are in the subdirectory SNAPSHOT/EDGES/hour/
    7. orbitaal-nodetable.tar.gz:

      • The root directory is NODE_TABLE/
      • Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses.

    Small samples in CSV format

    1. orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv

      • These two CSV files are related to stream graph representations of an halvening happening in 2016.
    2. orbitaal-snapshot-2016_07_08.csv and orbitaal-snapshot-2016_07_09.csv

      • These two CSV files are related to daily snapshot representations of an halvening happening in 2016.

  9. f

    datasheet8_Analysis of Stock Price Motion Asymmetry via Visibility-Graph...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu, Ruiyun; Chen, Yu (2020). datasheet8_Analysis of Stock Price Motion Asymmetry via Visibility-Graph Algorithm.csv [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000525575
    Explore at:
    Dataset updated
    Nov 27, 2020
    Authors
    Liu, Ruiyun; Chen, Yu
    Description

    This paper is the first to differentiate between concave and convex price motion trajectories by applying visibility-graph and invisibility-graph algorithms to the analyses of stock indices. Concave and convex indicators for price increase and decrease motions are introduced to characterize accelerated and decelerated stock index increases and decreases. Upon comparing the distributions of these indicators, it is found that asymmetry exists in price motion trajectories and that the degree of asymmetry, which is characterized by the Kullback-Leibler divergence between the distributions of rise and fall indictors, fluctuates after a change in time scope. Moreover, asymmetry in price motion speeds is demonstrated by comparing conditional expected rise and fall returns on the node degrees of visibility and invisibility graphs.

  10. s

    Semantic links between selected CSV datasets harvested by the European Data...

    • eprints.soton.ac.uk
    • explore.openaire.eu
    • +1more
    Updated May 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibanez, Luis-Daniel (2020). Semantic links between selected CSV datasets harvested by the European Data Portal and the DBpedia knowledge graph [Dataset]. http://doi.org/10.5281/zenodo.3837721
    Explore at:
    Dataset updated
    May 21, 2020
    Dataset provided by
    Zenodo
    Authors
    Ibanez, Luis-Daniel
    Description

    These dataset contains the results of the interlinking process between selected csv datasets harvested by the European DAta Portal and the DBpedia knowledge graph. We aim at answering the following questions: What are the more popular column types? This will provide hindsight about what the datasets hold and how they can be joined. It will also provide hindsight on what specific linking schemes could be applied in future elements. What datasets have columns of the same type? This will suggest datasets that may be similar or related. What entities appear in most datasets (co-referent entities)? This will suggest entities for which more data is published. What datasets share a particular entity? This will suggest datasets that may be joined, or are related through that particular entity Results are provided as augmented tables, that contain the columns of the original csv, plus a metadata file in JSON-LD format. The metadata files can be loaded in an RDF-store and queried. Refer to the accompanying report of activities for more details on the methodolog and how to query the dataset.

  11. Further education and skills - Charts

    • explore-education-statistics.service.gov.uk
    Updated Nov 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Education (2020). Further education and skills - Charts [Dataset]. https://explore-education-statistics.service.gov.uk/data-catalogue/data-set/a0936643-495a-40a2-8c28-6db363e37fa0
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset authored and provided by
    Department for Educationhttp://www.gov.uk/dfe
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    In progress

    - Explore Education Statistics data set Charts from Further education and skills

  12. F

    Data from: Evaluating SQuAD-based Question Answering for the Open Research...

    • data.uni-hannover.de
    csv, json
    Updated Dec 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion [Dataset]. https://data.uni-hannover.de/dataset/evaluating-squad-based-question-answering-for-the-open-research-knowledge-graph-completion
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Dec 5, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset is part of the bachelor thesis "Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion". It was created for the finetuning of Bert Based models pre-trained on the SQUaD dataset. The Dataset was created using semi-automatic approach on the ORKG data.

    The dataset.csv file contains the entire data (all properties) in a tabular for and is unsplit. The json files contain only the necessary fields for training and evaluation, with additional fields (index of start and end of the answers in the abstracts). The data in the json files is split (training data) and evaluation data. We create 4 variants of the training and evaluation sets for each one of the question labels ("no label", "how", "what", "which")

    For detailed information on each of the fields in the dataset, refer to section 4.2 (Corpus) of the Thesis document that can be found in https://www.repo.uni-hannover.de/handle/123456789/12958.

    The script used to generate the dataset can be found in the public repository https://github.com/as18cia/thesis_work and https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-fine-tuning-squad-based-models

  13. NetVotes ENIC Dataset

    • zenodo.org
    • nde-dev.biothings.io
    txt, zip
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Israel Mendonça; Vincent Labatut; Vincent Labatut; Rosa Figueiredo; Rosa Figueiredo; Israel Mendonça (2024). NetVotes ENIC Dataset [Dataset]. http://doi.org/10.5281/zenodo.6815510
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Israel Mendonça; Vincent Labatut; Vincent Labatut; Rosa Figueiredo; Rosa Figueiredo; Israel Mendonça
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description. The NetVote dataset contains the outputs of the NetVote program when applied to voting data coming from VoteWatch (http://www.votewatch.eu/).

    These results were used in the following conference papers:

    1. I. Mendonça, R. Figueiredo, V. Labatut, and P. Michelon, “Relevance of Negative Links in Graph Partitioning: A Case Study Using Votes From the European Parliament,” in 2nd European Network Intelligence Conference, 2015, pp. 122–129. ⟨hal-01176090⟩ DOI: 10.1109/ENIC.2015.25
    2. I. Mendonça, R. Figueiredo, V. Labatut, and P. Michelon, “Informative Value of Negative Links for Graph Partitioning, with an application to European Parliament Votes,” in 6ème Conférence sur les modèles et lánalyse de réseaux : approches mathématiques et informatiques, 2015, p. 12p. ⟨hal-02055158⟩

    Source code. The NetVote source code is available on GitHub: https://github.com/CompNet/NetVotes.

    Citation. If you use our dataset or tool, please cite article [1] above.


    @InProceedings{Mendonca2015,
    author = {Mendonça, Israel and Figueiredo, Rosa and Labatut, Vincent and Michelon, Philippe},

    title = {Relevance of Negative Links in Graph Partitioning: A Case Study Using Votes From the {E}uropean {P}arliament},
    booktitle = {2\textsuperscript{nd} European Network Intelligence Conference ({ENIC})},
    year = {2015},
    pages = {122-129},
    address = {Karlskrona, SE},
    publisher = {IEEE Publishing},
    doi = {10.1109/ENIC.2015.25},
    }

    -------------------------

    Details. This archive contains the following folders:

    • `votewatch_data`: the raw data extracted from the VoteWatch website.
      • `VoteWatch Europe European Parliament, Council of the EU.csv`: list of the documents voted during the considered term, with some details such as the date and topic.
      • `votes_by_document`: this folder contains a collection of CSV files, each one describing the outcome of the vote session relatively to one specific document.
      • `intermediate_files`: this folder contains several CSV files:
        • `allvotes.csv`: concatenation of all vote outcomes for all documents and all MEPS. Can be considered as a compact representation of the data contained in the folder `votes_by_document`.
        • `loyalty.csv`: same thing than allvotes.csv, but for the loyalty (i.e. whether or not the MEP voted like the majority of the MEPs in his political group).
        • `MPs.csv`: list of the MEPs having voted at least once in the considered term, with their details.
        • `policies.csv`: list of the topics considered during the term.
        • `qtd_docs.csv`: list of the topics with the corresponding number of documents.
    • `parallel_ils_results`: contains the raw results of the ILS tool. This is an external algorithm able to estimate the optimal partition of the network nodes in terms of structural balance. It was applied to all the networks extracted by our scripts (from the VoteWatch data), and the produced files were placed here for postprocessing. Each subfolder corresponds to one of the topic-year pair.
    • `output_files`: contains the file produced by our scripts.
      • `agreement`: histograms representing the distributions of agreement and rebellion indices. Each subfolder corresponds to a specific topic.
      • `community_algorithms_csv`: Performances obtained by the partitioning algorithms (for both community detection and correlation clustering). Each subfolder corresponds to a specific topic.
      • `xxxx_cluster_information.csv`: table containing several variants of the imbalance measure, for the considered algorithms.
      • `community_algorithms_results`: Comparison of the partitions detected by the various algorithms considered, and distribution of the cluster/community sizes. Each subfolder corresponds to a specific topic.
      • `xxxx_cluster_comparison.csv`: table comparing the partitions detected by the community detection algorithms, in terms of Rand index and other measures.
      • `xxxx_ils_cluster_comparison.csv`: like `xxxx_cluster_comparison.csv`, except we compare the partition of community detection algorithms with that of the ILS.
      • `xxxx_yyyy_distribution.pdf`: histogram of the community (or cluster) sizes detected by algorithm `yyyy`.
      • `graphs`: the networks extracted from the vote data. Each subfolder corresponds to a specific topic.
      • `xxxx_complete_graph.graphml`: network at the Graphml format, with all the information: nodes, edges, nodal attributes (including communities), weights, etc.
      • `xxxx_edges_Gephi.csv`: only the links, with their weights (i.e. vote similarity).
      • `xxxx_graph.g`: network at the g format (for ILS).
      • `xxxx_net_measures.csv`: table containing some stats on the network (number of links, etc.).
      • `xxxx_nodes_Gephi.csv`: list of nodes (i.e. MEPs), with details.
      • `plots`: synthesis plots from the paper.

    -------------------------

    License. These data are shared under a Creative Commons 0 license.

    Contact. Vincent Labatut <vincent.labatut@univ-avignon.fr> & Rosa Figueiredo <rosa.figueiredo@univ-avignon.fr>

  14. Datasets for manuscript: ADAM: A Web Platform for Graph-Based Modeling and...

    • catalog.data.gov
    Updated Sep 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Datasets for manuscript: ADAM: A Web Platform for Graph-Based Modeling and Optimization of Supply Chains [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-adam-a-web-platform-for-graph-based-modeling-and-optimization-of-s
    Explore at:
    Dataset updated
    Sep 18, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    ADAM-Data-Repository This repository contains all the data needed to run the case studies for the ADAM manuscript. Biogas production The directory "biogas" contains all data for the biogas production case studies (Figs 13 and 14). Specifically, "biogas/biogas_x" contains the data files for the scenario where "x" is the corresponding Renewable Energy Certificates (RECs) value. Plastic waste recycling The directory "plastic_waste" contains all data for the plastic waste recycling case studies (Figs 15 and 16). Different scenarios share the same supply, technology site, and technology candidate data, as specified by the "csv" files under "plastic_waste". Each scenario has a different demand data file, which is contained in "plastic_waste/Elec_price" and "plastic_waste/PET_price". How to run the case studies In order to run the case studies, one can create a new model in ADAM and upload appropriate CSV file at each step (e.g. upload biogas/biogas_0/supplydata197.csv in step 2 where supply data are specified). This dataset is associated with the following publication: Hu, Y., W. Zhang, P. Tominac, M. Shen, D. Göreke, E. Martín-Hernández, M. Martín, G.J. Ruiz-Mercado, and V.M. Zavala. ADAM: A web platform for graph-based modeling and optimization of supply chains. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 165: 107911, (2022).

  15. graphe final 30 nov Links.csv

    • figshare.com
    txt
    Updated Jan 31, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jean-christophe Plantin (2016). graphe final 30 nov Links.csv [Dataset]. http://doi.org/10.6084/m9.figshare.2069058.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 31, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Jean-christophe Plantin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    graph of radiation monitoring website realize with Gephi

  16. d

    Data from: Correction of preferred-orientation induced distortion in...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weili Cao; Dongjie Zhu; Xinzheng Zhang (2025). Correction of preferred-orientation induced distortion in cryo-electron microscopy maps [Dataset]. http://doi.org/10.5061/dryad.73n5tb354
    Explore at:
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Weili Cao; Dongjie Zhu; Xinzheng Zhang
    Description

    Reconstruction maps of cryo-electron microscopy (cryo-EM) exhibit distortion when the cryo-EM dataset is incomplete, usually caused by unevenly distributed orientations. Prior efforts had been attempted to address this preferred orientation problem using tilt-collection strategy, modifications to grids or to air-water-interfaces. However, these approaches often require time-consuming experiments and the effect was always protein dependent. Here, we developed a procedure containing removing mis-aligned particles and an iterative reconstruction method based on signal-to-noise ratio of Fourier component to correct such distortion by recovering missing data using a purely computational algorithm. This procedure called Signal-to-Noise Ratio Iterative Reconstruction Method (SIRM) was applied on incomplete datasets of various proteins to fix distortion in cryo-EM maps and to a more isotropic resolution. In addition, SIRM provides a better reference map for further reconstruction refinements, r..., , , # SIRM: Open Source Data

    We have submitted the original chart files (.csv) and density maps (.mrc) related to the images in the article "Correction of preferred-orientation induced distortion in cryo-electron microscopy maps"

    Descriptions

    SIRM_Fig1_csv.rar

    • Fig1A.csv: The CSV file corresponding to the histogram of Euler angle deviations in particle sets with different degrees of missing wedge under conventional refinement.
    • Fig1B.csv: The CSV file corresponding to the histogram of Euler angle deviations in particle sets with different degrees of missing wedge After Cross-Validation process.

    SIRM_Fig3_MRC_map.rar

    • groundTruth.mrc: The ground-truth density map of Apo-ferritin without missing cone, corresponding to Fig.3A.
    • loss70_sameParticleNum.mrc: The MC-35 density map of Apo-ferritin, corresponding to Fig.3D.
    • loss70_sameParticleNum_SIRM.mrc: The MC-35 density map of Apo-ferritin after SIRM processing, corresponding to Fig.3G.
    • *loss80_samePa...
  17. Is Call Graph Pruning Really Effective? An Empirical Re-evaluation

    • zenodo.org
    bin, zip
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2025). Is Call Graph Pruning Really Effective? An Empirical Re-evaluation [Dataset]. http://doi.org/10.5281/zenodo.17204367
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Sep 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 17, 2025
    Description

    This artifact contains the dataset, results, and source code associated with the paper. It is divided into three archives:

    datasets.zip

    This archive includes the datasets produced by this study.

    Directory contents:

    • NJR1
      • Automatic
        • list of programs as well as false and true labels.
      • Manual
        • manual.csv file (final labels used in the paper)
        • recorded_evidence.csv (contains call chain and allocation trace evidence)
    • Xcorpus
      • Automatic
        • list of programs as well as false and true labels.
      • Manual
        • list of programs
          • manual.csv file (final labels used in the paper)
          • recorded_evidence.csv (contains call chain and allocation trace evidence)

    experimental_data.zip

    This archive includes all the generated data in this study.

    Directory contents:

    • generated_cgs/ – Automatically generated static call graphs and their associated labels.

    • feature_vectors/ – Structured and token-based features extracted using pre-trained CodeBERT and CodeT5 models.

    • ML_results/ – Contains all output files, including final results and plots used in the paper.

    source_code.zip

    This archive includes all scripts used to generate the dataset and conduct experiments.

    Directory contents:

    • static_cg_generation/ – Scripts for running WALA, DOOP, and OPAL with multiple configurations to generate static call graphs. Each tool’s settings can be found under its config/ subdirectory.

    • dataset_generation/ – Scripts for dataset construction:

      • manual_sampling/ – Stratified sampling of call graph edges.

      • semantic_features/ – Extraction of raw and fine-tuned semantic features.

      • structured_features/ – Generation of structured graph features.

    • approach/ – Machine learning experiments and evaluation pipelines described in the paper.

    • paper/ – Scripts used to generate plots and visualizations presented in the paper.

    Each directory includes a README file explaining its structure and usage.

    Configurations.xlsx

    This file contains the configurations we used for each tool in this study.

    File contents:

    • WALA_full_configuration : all the selected configuration for WALA.

    • Doop_full_configuration : all the selected configurations for Doop.

    • Opal_full_configuration : all the selected configurations for Opal.

    • WALA_partial_order : list of pairs of configurations we used to generate false labels using partial orders for WALA.
    • Doop_partial_order : list of pairs of configurations we used to generate false labels using partial orders for Doop.

    This artifact enables full reproducibility of the dataset creation, feature extraction, and experimental results discussed in the paper.

  18. Z

    Link-prediction on Biomedical Knowledge Graphs

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cattaneo, Alberto (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_12097376
    Explore at:
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Martynec, Thomas
    Bonner, Stephen
    Justus, Daniel
    Cattaneo, Alberto
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

    Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.

    Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).

    On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.

    Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.

    Inside experimental_data.zip, the following files are provided for each dataset:

    {dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.

    test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;

    entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);

    relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

    The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

    All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.

  19. Further education and skills - Underlying Charts Data

    • explore-education-statistics.service.gov.uk
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Education (2024). Further education and skills - Underlying Charts Data [Dataset]. https://explore-education-statistics.service.gov.uk/data-catalogue/data-set/c0579bf7-96fd-4771-9034-e8642b529114
    Explore at:
    Dataset updated
    Nov 28, 2024
    Dataset authored and provided by
    Department for Educationhttp://www.gov.uk/dfe
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    Historical time series of headline adult (19+) further education and skills learner participation, containing breakdowns by provision type and in some cases level. Also includes some all age apprenticeship participation figures.Academic years: 2005/06 to 2023/24 full academic yearsIndicators: ParticipationFilter: Provision type, Age group, Level

  20. All trait data - Datasets - OpenData.eol.org

    • opendata.eol.org
    Updated Jun 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    eol.org (2019). All trait data - Datasets - OpenData.eol.org [Dataset]. https://opendata.eol.org/dataset/all-trait-data-large
    Explore at:
    Dataset updated
    Jun 5, 2019
    Dataset provided by
    Encyclopedia of Lifehttp://eol.org/
    Description

    This zip archive records all of the trait records in EOL's graph database. It contains five .csv files: pages.csv listing taxa and their names, traits.csv with trait records, metadata.csv with auxiliary records referred to by trait records, inferred.csv (see below) and terms.csv listing all of the relationship URIs in the database. For a description of the schema, see https://github.com/EOL/eol_website/blob/master/doc/trait-schema.md inferred.csv lists additional taxa to which a trait record applies by taxonomic inference, in addition to the ancestral taxon to which it is attached. For instance, the record describing locomotion=flight for Aves is also inferred to apply to most of the descendants of Aves, except for any flightless subclades that are excluded from the inference pattern. All the trait record referred to in the 2nd column of the inferred file have full records available in the traits file. THIS RESOURCE IS UPDATED MONTHLY. It is not archived regularly. Please save your download if you want to be able to refer to it at a later date

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
Organization logo

Sample Graph Datasets in CSV Format

Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edwin Carreño; Edwin Carreño
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

  • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
  • dataset_30_edges_interactions.csv: contains 47 rows (edges).
  • the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the ColumnTypeDescription
UniProt IDstringprotein identification
labelstringprotein label (type of node)
propertiesstringa dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the ColumnTypeDescription
Relationship IDstringrelationship identification
Source IDstringidentification of the source protein in the relationship
Target IDstringidentification of the target protein in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata

GraphNumber of NodesNumber of EdgesSparse graph

dataset_30*

30

47

Y

dataset_60*

60

181

Y

dataset_120*

120

689

Y

dataset_240*

240

2819

Y

dataset_300*

300

4658

Y

dataset_600*

600

18004

Y

dataset_1200*

1200

71785

Y

dataset_2400*

2400

288600

Y

dataset_3000*

3000

449727

Y

dataset_6000*

6000

1799413

Y

dataset_12000*

12000

7199863

Y

dataset_24000*

24000

28792361

Y

dataset_30000*

30000

44991744

Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringnode identification
labelstringnode label (type of node)
propertiesstringa dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringrelationship identification
sourcestringidentification of the source node in the relationship
targetstringidentification of the target node in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata (tiny graphs)

GraphNumber of NodesNumber of EdgesSparse graph
dataset_dummy*36N
dataset_dummy2*36N
Search
Clear search
Close search
Google apps
Main menu