Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: none of the data sets published here contain actual data, they are for testing purposes only.
This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:
dataset_30_nodes_interactions.csv:contains 30 rows (nodes).dataset_30_edges_interactions.csv: contains 47 rows (edges).dataset_30 refers to the same graph.Each dataset contains the following columns:
| Name of the Column | Type | Description |
| UniProt ID | string | protein identification |
| label | string | protein label (type of node) |
| properties | string | a dictionary containing properties related to the protein. |
Each dataset contains the following columns:
| Name of the Column | Type | Description |
| Relationship ID | string | relationship identification |
| Source ID | string | identification of the source protein in the relationship |
| Target ID | string | identification of the target protein in the relationship |
| label | string | relationship label (type of relationship) |
| properties | string | a dictionary containing properties related to the relationship. |
| Graph | Number of Nodes | Number of Edges | Sparse graph |
|
dataset_30* |
30 | 47 |
Y |
|
dataset_60* |
60 |
181 |
Y |
|
dataset_120* |
120 |
689 |
Y |
|
dataset_240* |
240 |
2819 |
Y |
|
dataset_300* |
300 |
4658 |
Y |
|
dataset_600* |
600 |
18004 |
Y |
|
dataset_1200* |
1200 |
71785 |
Y |
|
dataset_2400* |
2400 |
288600 |
Y |
|
dataset_3000* |
3000 |
449727 |
Y |
|
dataset_6000* |
6000 |
1799413 |
Y |
|
dataset_12000* |
12000 |
7199863 |
Y |
|
dataset_24000* |
24000 |
28792361 |
Y |
|
dataset_30000* |
30000 |
44991744 |
Y |
This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.
Each dataset contains the following columns:
| Name of the Column | Type | Description |
| ID | string | node identification |
| label | string | node label (type of node) |
| properties | string | a dictionary containing properties related to the node. |
Each dataset contains the following columns:
| Name of the Column | Type | Description |
| ID | string | relationship identification |
| source | string | identification of the source node in the relationship |
| target | string | identification of the target node in the relationship |
| label | string | relationship label (type of relationship) |
| properties | string | a dictionary containing properties related to the relationship. |
| Graph | Number of Nodes | Number of Edges | Sparse graph |
| dataset_dummy* | 3 | 6 | N |
| dataset_dummy2* | 3 | 6 | N |
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset demonstrates how to fuse Large Language Model (LLM) generated embeddings with Graph Neural Networks (GNNs) for learning on tabular graph data.
sample_nodes.csv – Node features including ID, category, and description textsample_edges.csv – Edge list (source, target, weight)sample_augmented_nodes.csv – Node features + LLM-generated embeddings (simulated)GNN_LLM_Hybrid_Baseline.ipynb – Main baseline model using PyTorch GeometricCSV_Processing_1.ipynb – Basic loading and EDA of nodes/edgesCSV_Processing_2.ipynb – Preview of LLM-augmented node featuresThis is a synthetic dataset. For real-world use: - Replace the "LLM embeddings" with outputs from OpenAI / Mistral / HuggingFace models - Extend the node descriptions with actual context or domain-specific text - Scale to real-world graphs or use with competition tabular datasets
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
PubMed Knowledge Graph Datasets http://er.tacc.utexas.edu/datasets/ped
Dataset Name : PKG2020S4 (1781-Dec. 2020), Version 4 The new version PKG, PKG2020S4 (1781-Dec. 2020), updated the previous PKG version with PubMed 2021 baseline files, PubMed daily updates files (up to Jan. 4th 2021), and extracted bio-entities, author disambiguation results, extended author information, Scimago that containing journal information, and WOS citations which contains reference relations between PMID and reference PMID and extracted from WOS.
Database Features: 1-PKG2020S4 (1781-Dec. 2020) Features.pdf (https://web.corral.tacc.utexas.edu/dive_datasets/PKG2020S4/PKG2020S4_MySQL/1-PKG2020S4%20(1781-Dec.%202020)%20Features.pdf) Database Description: 2-PKG2020S4 (1781-Dec. 2020) Database Description.pdf (https://web.corral.tacc.utexas.edu/dive_datasets/PKG2020S4/PKG2020S4_MySQL/2-PKG2020S4%20(1781-Dec.%202020)%20Database%20Description.pdf)
http://er.tacc.utexas.edu/datasets/ped
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The awesome datasets graph is a Neo4j graph database which catalogs and classifies datasets and data sources as scraped from the Awesome Public Datasets GitHub list.
We started with a simple list of links on the Awesome Public Datasets page. We now have a semantic graph database with 10 labels, five relationship types, nine property keys, and more than 400 nodes. All within 1MB of database footprint. All database operations are query driven using the powerful and flexible Cypher Graph Query Language.
The download includes CSV files which were created as an interim step after scraping and wrangling the source. The download also includes a working Neo4j Graph Database. Login: neo4j | Password: demo.
Data scraped from Awesome Public Datasets page. Prepared for the book Data Science Solutions.
While we have done basic data wrangling and preparation, how can this graph prove useful for your data science workflow? Can we record our data science project decisions taken across workflow stages and how the data catalog (datasources, datasets, tools) use cases help in these decisions by achieving data science solutions strategies?
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is in CSV format, consisting of two files: nodes.csv and relations.csv, which are encoded in UTF-8. When in use, two files can be imported into graph database tools such as Neo4j to form a visual knowledge graph, which can be used for further research.nodes.csv contains 26771 traditional Chinese medicine related entities extracted from five traditional Chinese medicine ancient books: 伤寒论, 伤寒类方, 伤寒悬解, 伤寒论浅注, and 伤寒九十论. Among them, each row represents an entity, the first column is the entity ID, the second column is the entity name, and the third column is the entity type.relationships.csv contains 8272 triplets of relationships between entities in nodes.csv. Among them, each row represents a pair of relationships, with the first column being the head entity ID, the second column being the tail entity ID, and the third column being the relationship type.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wikipedia temporal graph.
The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.
Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using enwiki-20150403-graph.gt file that can be opened with open source Python library Graph-Tool.
Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is 5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/
[1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.
keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain},
Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021
This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs.
Every dates have been retrieved from bloc UNIX timestamp and GMT timezone.
The dataset is distributed across three compressed archives:
All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package.
orbitaal-stream_graph.tar.gz:
orbitaal-snapshot-all.tar.gz:
orbitaal-snapshot-year.tar.gz:
orbitaal-snapshot-month.tar.gz:
orbitaal-snapshot-day.tar.gz:
orbitaal-snapshot-hour.tar.gz:
orbitaal-nodetable.tar.gz:
Small samples in CSV format
orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv
orbitaal-snapshot-2016_07_08.csv and orbitaal-snapshot-2016_07_09.csv
Facebook
TwitterThis paper is the first to differentiate between concave and convex price motion trajectories by applying visibility-graph and invisibility-graph algorithms to the analyses of stock indices. Concave and convex indicators for price increase and decrease motions are introduced to characterize accelerated and decelerated stock index increases and decreases. Upon comparing the distributions of these indicators, it is found that asymmetry exists in price motion trajectories and that the degree of asymmetry, which is characterized by the Kullback-Leibler divergence between the distributions of rise and fall indictors, fluctuates after a change in time scope. Moreover, asymmetry in price motion speeds is demonstrated by comparing conditional expected rise and fall returns on the node degrees of visibility and invisibility graphs.
Facebook
TwitterThese dataset contains the results of the interlinking process between selected csv datasets harvested by the European DAta Portal and the DBpedia knowledge graph. We aim at answering the following questions: What are the more popular column types? This will provide hindsight about what the datasets hold and how they can be joined. It will also provide hindsight on what specific linking schemes could be applied in future elements. What datasets have columns of the same type? This will suggest datasets that may be similar or related. What entities appear in most datasets (co-referent entities)? This will suggest entities for which more data is published. What datasets share a particular entity? This will suggest datasets that may be joined, or are related through that particular entity Results are provided as augmented tables, that contain the columns of the original csv, plus a metadata file in JSON-LD format. The metadata files can be loaded in an RDF-store and queried. Refer to the accompanying report of activities for more details on the methodolog and how to query the dataset.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
In progress
- Explore Education Statistics data set Charts from Further education and skills
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is part of the bachelor thesis "Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion". It was created for the finetuning of Bert Based models pre-trained on the SQUaD dataset. The Dataset was created using semi-automatic approach on the ORKG data.
The dataset.csv file contains the entire data (all properties) in a tabular for and is unsplit. The json files contain only the necessary fields for training and evaluation, with additional fields (index of start and end of the answers in the abstracts). The data in the json files is split (training data) and evaluation data. We create 4 variants of the training and evaluation sets for each one of the question labels ("no label", "how", "what", "which")
For detailed information on each of the fields in the dataset, refer to section 4.2 (Corpus) of the Thesis document that can be found in https://www.repo.uni-hannover.de/handle/123456789/12958.
The script used to generate the dataset can be found in the public repository https://github.com/as18cia/thesis_work and https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-fine-tuning-squad-based-models
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description. The NetVote dataset contains the outputs of the NetVote program when applied to voting data coming from VoteWatch (http://www.votewatch.eu/).
These results were used in the following conference papers:
Source code. The NetVote source code is available on GitHub: https://github.com/CompNet/NetVotes.
Citation. If you use our dataset or tool, please cite article [1] above.
@InProceedings{Mendonca2015, author = {Mendonça, Israel and Figueiredo, Rosa and Labatut, Vincent and Michelon, Philippe}, title = {Relevance of Negative Links in Graph Partitioning: A Case Study Using Votes From the {E}uropean {P}arliament}, booktitle = {2\textsuperscript{nd} European Network Intelligence Conference ({ENIC})}, year = {2015}, pages = {122-129}, address = {Karlskrona, SE}, publisher = {IEEE Publishing}, doi = {10.1109/ENIC.2015.25},}
-------------------------
Details. This archive contains the following folders:
-------------------------
License. These data are shared under a Creative Commons 0 license.
Contact. Vincent Labatut <vincent.labatut@univ-avignon.fr> & Rosa Figueiredo <rosa.figueiredo@univ-avignon.fr>
Facebook
TwitterADAM-Data-Repository This repository contains all the data needed to run the case studies for the ADAM manuscript. Biogas production The directory "biogas" contains all data for the biogas production case studies (Figs 13 and 14). Specifically, "biogas/biogas_x" contains the data files for the scenario where "x" is the corresponding Renewable Energy Certificates (RECs) value. Plastic waste recycling The directory "plastic_waste" contains all data for the plastic waste recycling case studies (Figs 15 and 16). Different scenarios share the same supply, technology site, and technology candidate data, as specified by the "csv" files under "plastic_waste". Each scenario has a different demand data file, which is contained in "plastic_waste/Elec_price" and "plastic_waste/PET_price". How to run the case studies In order to run the case studies, one can create a new model in ADAM and upload appropriate CSV file at each step (e.g. upload biogas/biogas_0/supplydata197.csv in step 2 where supply data are specified). This dataset is associated with the following publication: Hu, Y., W. Zhang, P. Tominac, M. Shen, D. Göreke, E. Martín-Hernández, M. Martín, G.J. Ruiz-Mercado, and V.M. Zavala. ADAM: A web platform for graph-based modeling and optimization of supply chains. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 165: 107911, (2022).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
graph of radiation monitoring website realize with Gephi
Facebook
TwitterReconstruction maps of cryo-electron microscopy (cryo-EM) exhibit distortion when the cryo-EM dataset is incomplete, usually caused by unevenly distributed orientations. Prior efforts had been attempted to address this preferred orientation problem using tilt-collection strategy, modifications to grids or to air-water-interfaces. However, these approaches often require time-consuming experiments and the effect was always protein dependent. Here, we developed a procedure containing removing mis-aligned particles and an iterative reconstruction method based on signal-to-noise ratio of Fourier component to correct such distortion by recovering missing data using a purely computational algorithm. This procedure called Signal-to-Noise Ratio Iterative Reconstruction Method (SIRM) was applied on incomplete datasets of various proteins to fix distortion in cryo-EM maps and to a more isotropic resolution. In addition, SIRM provides a better reference map for further reconstruction refinements, r..., , , # SIRM: Open Source Data
We have submitted the original chart files (.csv) and density maps (.mrc) related to the images in the article "Correction of preferred-orientation induced distortion in cryo-electron microscopy maps"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This artifact contains the dataset, results, and source code associated with the paper. It is divided into three archives:
This archive includes the datasets produced by this study.
This archive includes all the generated data in this study.
generated_cgs/ – Automatically generated static call graphs and their associated labels.
feature_vectors/ – Structured and token-based features extracted using pre-trained CodeBERT and CodeT5 models.
ML_results/ – Contains all output files, including final results and plots used in the paper.
This archive includes all scripts used to generate the dataset and conduct experiments.
static_cg_generation/ – Scripts for running WALA, DOOP, and OPAL with multiple configurations to generate static call graphs. Each tool’s settings can be found under its config/ subdirectory.
dataset_generation/ – Scripts for dataset construction:
manual_sampling/ – Stratified sampling of call graph edges.
semantic_features/ – Extraction of raw and fine-tuned semantic features.
structured_features/ – Generation of structured graph features.
approach/ – Machine learning experiments and evaluation pipelines described in the paper.
paper/ – Scripts used to generate plots and visualizations presented in the paper.
Each directory includes a README file explaining its structure and usage.
This file contains the configurations we used for each tool in this study.
WALA_full_configuration : all the selected configuration for WALA.
Doop_full_configuration : all the selected configurations for Doop.
This artifact enables full reproducibility of the dataset creation, feature extraction, and experimental results discussed in the paper.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).
Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.
Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).
On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.
Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.
Inside experimental_data.zip, the following files are provided for each dataset:
{dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.
test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;
entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);
relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).
The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).
All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Historical time series of headline adult (19+) further education and skills learner participation, containing breakdowns by provision type and in some cases level. Also includes some all age apprenticeship participation figures.Academic years: 2005/06 to 2023/24 full academic yearsIndicators: ParticipationFilter: Provision type, Age group, Level
Facebook
TwitterThis zip archive records all of the trait records in EOL's graph database. It contains five .csv files: pages.csv listing taxa and their names, traits.csv with trait records, metadata.csv with auxiliary records referred to by trait records, inferred.csv (see below) and terms.csv listing all of the relationship URIs in the database. For a description of the schema, see https://github.com/EOL/eol_website/blob/master/doc/trait-schema.md inferred.csv lists additional taxa to which a trait record applies by taxonomic inference, in addition to the ancestral taxon to which it is attached. For instance, the record describing locomotion=flight for Aves is also inferred to apply to most of the descendants of Aves, except for any flightless subclades that are excluded from the inference pattern. All the trait record referred to in the 2nd column of the inferred file have full records available in the traits file. THIS RESOURCE IS UPDATED MONTHLY. It is not archived regularly. Please save your download if you want to be able to refer to it at a later date
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: none of the data sets published here contain actual data, they are for testing purposes only.
This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:
dataset_30_nodes_interactions.csv:contains 30 rows (nodes).dataset_30_edges_interactions.csv: contains 47 rows (edges).dataset_30 refers to the same graph.Each dataset contains the following columns:
| Name of the Column | Type | Description |
| UniProt ID | string | protein identification |
| label | string | protein label (type of node) |
| properties | string | a dictionary containing properties related to the protein. |
Each dataset contains the following columns:
| Name of the Column | Type | Description |
| Relationship ID | string | relationship identification |
| Source ID | string | identification of the source protein in the relationship |
| Target ID | string | identification of the target protein in the relationship |
| label | string | relationship label (type of relationship) |
| properties | string | a dictionary containing properties related to the relationship. |
| Graph | Number of Nodes | Number of Edges | Sparse graph |
|
dataset_30* |
30 | 47 |
Y |
|
dataset_60* |
60 |
181 |
Y |
|
dataset_120* |
120 |
689 |
Y |
|
dataset_240* |
240 |
2819 |
Y |
|
dataset_300* |
300 |
4658 |
Y |
|
dataset_600* |
600 |
18004 |
Y |
|
dataset_1200* |
1200 |
71785 |
Y |
|
dataset_2400* |
2400 |
288600 |
Y |
|
dataset_3000* |
3000 |
449727 |
Y |
|
dataset_6000* |
6000 |
1799413 |
Y |
|
dataset_12000* |
12000 |
7199863 |
Y |
|
dataset_24000* |
24000 |
28792361 |
Y |
|
dataset_30000* |
30000 |
44991744 |
Y |
This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.
Each dataset contains the following columns:
| Name of the Column | Type | Description |
| ID | string | node identification |
| label | string | node label (type of node) |
| properties | string | a dictionary containing properties related to the node. |
Each dataset contains the following columns:
| Name of the Column | Type | Description |
| ID | string | relationship identification |
| source | string | identification of the source node in the relationship |
| target | string | identification of the target node in the relationship |
| label | string | relationship label (type of relationship) |
| properties | string | a dictionary containing properties related to the relationship. |
| Graph | Number of Nodes | Number of Edges | Sparse graph |
| dataset_dummy* | 3 | 6 | N |
| dataset_dummy2* | 3 | 6 | N |