https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Dataset Information
169,343 1,166,243 128
Pre-processed as per the official codebase of https://arxiv.org/abs/2210.02016
Citations
@article{ju2023multi, title={Multi-task Self-supervised Graph Neural Networks Enable Stronger Task Generalization}, author={Ju, Mingxuan and Zhao, Tong and Wen, Qianlong and Yu, Wenhao and Shah, Neil and Ye, Yanfang and Zhang, Chuxu}, booktitle={International Conference on Learning… See the full description on the dataset page: https://huggingface.co/datasets/SauravMaheshkar/pareto-ogbn-arxiv.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins
import os.path as osp
import pandas as pd
import torch
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset
class PygOgbnProteins(PygNodePropPredDataset):
def _init_(self, meta_csv = None):
root, name, transform = '/kaggle/input', 'ogbn-proteins', T.ToSparseTensor()
if meta_csv is None:
meta_csv = osp.join(root, name, 'ogbn-master.csv')
master = pd.read_csv(meta_csv, index_col = 0)
meta_dict = master[name]
meta_dict['dir_path'] = osp.join(root, name)
super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
def get_idx_split(self, split_type = None):
if split_type is None:
split_type = self.meta_info['split']
path = osp.join(self.root, 'split', split_type)
if osp.isfile(os.path.join(path, 'split_dict.pt')):
return torch.load(os.path.join(path, 'split_dict.pt'))
if self.is_hetero:
train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
for nodetype in train_idx_dict.keys():
train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
else:
train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
train_idx = torch.from_numpy(train_idx).to(torch.long)
valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
valid_idx = torch.from_numpy(valid_idx).to(torch.long)
test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
test_idx = torch.from_numpy(test_idx).to(torch.long)
return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
dataset = PygOgbnProteins()
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
graph = dataset[0] # PyG Graph object
Graph: The ogbn-proteins dataset is an undirected, weighted, and typed (according to species) graph. Nodes represent proteins, and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression or homology [1,2]. All edges come with 8-dimensional features, where each dimension represents the strength of a single association type and takes values between 0 and 1 (the larger the value is, the stronger the association is). The proteins come from 8 species.
Prediction task: The task is to predict the presence of protein functions in a multi-label binary classification setup, where there are 112 kinds of labels to predict in total. The performance is measured by the average of ROC-AUC scores across the 112 tasks.
Dataset splitting: The authors split the protein nodes into training/validation/test sets according to the species which the proteins come from. This enables the evaluation of the generalization performance of the model across different species.
Note: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.
Package | #Nodes | #Edges | Split Type | Task Type | Metric |
---|---|---|---|---|---|
ogb>=1.1.1 | 132,534 | 39,561,252 | Species | Multi-label binary classification | ROC-AUC |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1):D607–D613, 2019. [2] Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong. Nucleic Acids Research, 47(D1):D330–D338, 2018. [3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
Graph classification and node classification datasets
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The distillations are done from trained teachers with different numbers of GCN layers: 3, 7, 14, 28, and 56. Note that the proposed method Student_MustaD provides the best performance among the student models.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Dataset Information
169,343 1,166,243 128
Pre-processed as per the official codebase of https://arxiv.org/abs/2210.02016
Citations
@article{ju2023multi, title={Multi-task Self-supervised Graph Neural Networks Enable Stronger Task Generalization}, author={Ju, Mingxuan and Zhao, Tong and Wen, Qianlong and Yu, Wenhao and Shah, Neil and Ye, Yanfang and Zhang, Chuxu}, booktitle={International Conference on Learning… See the full description on the dataset page: https://huggingface.co/datasets/SauravMaheshkar/pareto-ogbn-arxiv.