Facebook
TwitterThe Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.
Facebook
Twitter'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a graph prediction dataset from the Open Graph Benchmark (OGB).
This dataset is experimental, and the API is subject to change in future releases.
The below description of the dataset is adapted from the OGB paper:
All the molecules are pre-processed using RDKit ([1]).
The exact description of all features is available at https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py.
The task is to predict 128 different biological activities (inactive/active). See [2] and [3] for more description about these targets. Not all targets apply to each molecule: missing targets are indicated by NaNs.
[1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL: https://github.com/rdkit/rdkit
[2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'. URL: https://arxiv.org/pdf/1502.02072.pdf
[3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ogbg_molpcba', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/ogbg_molpcba-0.1.3.png" alt="Visualization" width="500px">
Facebook
TwitterGraph representation learning typically aims to learn an informative embedding for each graph node based on the graph topology (link) information.
Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-mol
import os
import os.path as osp
import pandas as pd
import torch
from ogb.graphproppred import PygGraphPropPredDataset
class PygOgbgMol(PygGraphPropPredDataset):
def _init_(self, name, transform = None, pre_transform = None, meta_csv = None):
root = '../input'
if meta_csv is None:
meta_csv = osp.join(root, name, 'ogbg-master.csv')
master = pd.read_csv(meta_csv, index_col = 0)
meta_dict = master[name]
meta_dict['dir_path'] = osp.join(root, name)
super()._init_(name = name, root = root, transform = transform, pre_transform = pre_transform, meta_dict = meta_dict)
def get_idx_split(self, split_type = None):
if split_type is None:
split_type = self.meta_info['split']
path = osp.join(self.root, 'split', split_type)
# short-cut if split_dict.pt exists
if os.path.isfile(os.path.join(path, 'split_dict.pt')):
return torch.load(os.path.join(path, 'split_dict.pt'))
train_idx = pd.read_csv(osp.join(path, 'train.csv'), header = None).values.T[0]
valid_idx = pd.read_csv(osp.join(path, 'valid.csv'), header = None).values.T[0]
test_idx = pd.read_csv(osp.join(path, 'test.csv'), header = None).values.T[0]
return {'train': torch.tensor(train_idx, dtype = torch.long), 'valid': torch.tensor(valid_idx, dtype = torch.long), 'test': torch.tensor(test_idx, dtype = torch.long)}
dataset = PygOgbgMol('ogbg-molhiv')
from torch_geometric.data import DataLoader
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDKit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].
Beside the two main datasets, the dataset authors additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].
For encoding these raw input features, the dataset authors prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.
from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
atom_encoder = AtomEncoder(emb_dim = 100)
bond_encoder = BondEncoder(emb_dim = 100)
atom_emb = atom_encoder(x) # x is the input atom feature
edge_emb = bond_encoder(edge_attr) # edge_attr is the input edge feature
Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, the dataset authors closely follow [2]. Specifically, for ogbg-molhiv, the dataset authors use ROC-AUC for...
Facebook
Twitterzkchen/OGB dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag
Warning: Currently not usable.
import torch_geometric
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset('ogbn-mag', root = '/kaggle/input')
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
graph = dataset[0] # PyG Graph object
Graph: The ogbn-mag dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1]. It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study. Similar to ogbn-arxiv, each paper is associated with a 128-dimensional word2vec feature vector, and all the other types of entities are not associated with input node features.
Prediction task: Given the heterogeneous ogbn-mag data, the task is to predict the venue (conference or journal) of each paper, given its content, references, authors, and authors’ affiliations. This is of practical interest as some manuscripts’ venue information is unknown or missing in MAG, due to the noisy nature of Web data. In total, there are 349 different venues in ogbn-mag, making the task a 349-class classification problem.
Dataset splitting: The authors of this dataset follow the same time-based strategy as ogbn-arxiv and ogbn-papers100M to split the paper nodes in the heterogeneous graph, i.e., training models to predict venue labels of all papers published before 2018, validating and testing the models on papers published in 2018 and since 2019, respectively.
| Package | #Nodes | #Edges | Split Type | Task Type | Metric |
|---|---|---|---|---|---|
ogb>=1.2.1 | 1,939,743 | 21,111,007 | Time | Multi-class classification | Accuracy |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [2] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020. [2] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Sajan Gohil
Released under MIT
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by NGUYENGN1410
Released under MIT
Facebook
TwitterView Ets Ogb Commerce General Imp Exp import export trade data, including shipment records, HS codes, top buyers, suppliers, trade values, and global market insights.
Facebook
TwitterComprehensive YouTube channel statistics for Real OGB Recent, featuring 899,000 subscribers and 110,425,829 total views. This dataset includes detailed performance metrics such as subscriber growth, video views, engagement rates, and estimated revenue. The channel operates in the Entertainment category. Track 191 videos with daily and monthly performance data, including view counts, subscriber changes, and earnings estimates. Analyze growth trends, engagement patterns, and compare performance against similar channels in the same category.
Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code
from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset
dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input')
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.
Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.
Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.
| Package | #Graphs | #Nodes per Graph | #Edges per Graph | Split Type | Task Type | Metric |
|---|---|---|---|---|---|---|
ogb>=1.2.0 | 452,741 | 125.2 | 124.2 | Project | Sub-token prediction | F1 score |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
Facebook
TwitterOGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification. OGB-LSC consists of three datasets: MAG240M-LSC, WikiKG90M-LSC, and PCQM4M-LSC. Each dataset offers an independent task. MAG240M-LSC is a heterogeneous academic graph, and the task is to predict the subject areas of papers situated in the heterogeneous graph (node classification). WikiKG90M-LSC is a knowledge graph, and the task is to impute missing triplets (link prediction). PCQM4M-LSC is a quantum chemistry dataset, and the task is to predict an important molecular property, the HOMO-LUMO gap, of a given molecule (graph regression).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.
Available benchmarking systems:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Calcium time series from OGB labeled V1 neurons in awake or anesthesized mice. Data published in: Pieter M. Goltstein, Jorrit S. Montijn, Cyriel M.A. Pennartz. (2015). Effects of isoflurane anesthesia on ensemble patterns of Ca2+ activity in mouse V1: Reduced direction selectivity independent of increased correlations in cellular activity. PLOS ONE.
Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-products
import os.path as osp
import pandas as pd
import datatable as dt
import torch
import torch_geometric as pyg
from ogb.nodeproppred import PygNodePropPredDataset
class PygOgbnProducts(PygNodePropPredDataset):
def _init_(self, meta_csv = None):
root, name, transform = '/kaggle/input', 'ogbn-products', None
if meta_csv is None:
meta_csv = osp.join(root, name, 'ogbn-master.csv')
master = pd.read_csv(meta_csv, index_col = 0)
meta_dict = master[name]
meta_dict['dir_path'] = osp.join(root, name)
super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
def get_idx_split(self, split_type = None):
if split_type is None:
split_type = self.meta_info['split']
path = osp.join(self.root, 'split', split_type)
if osp.isfile(os.path.join(path, 'split_dict.pt')):
return torch.load(os.path.join(path, 'split_dict.pt'))
if self.is_hetero:
train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
for nodetype in train_idx_dict.keys():
train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
else:
train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
train_idx = torch.from_numpy(train_idx).to(torch.long)
valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
valid_idx = torch.from_numpy(valid_idx).to(torch.long)
test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
test_idx = torch.from_numpy(test_idx).to(torch.long)
return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
dataset = PygOgbnProducts()
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
graph = dataset[0] # PyG Graph object
Graph: The ogbn-products dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network [1]. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. The authors follow [2] to process node features and target categories. Specifically, node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.
Prediction task: The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.
Dataset splitting: The authors consider a more challenging and realistic dataset splitting that differs from the one used in [2] Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), use the sales ranking (popularity) to split nodes into training/validation/test sets. Specifically, the authors sort the products according to their sales ranking and use the top 8% for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.
Note 1: A very small number of self-connecting edges are repeated (see here); you may remove them if necessary.
Note 2: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.
| Package | #Nodes | #Edges | Split Type | Task Type | Metric |
|---|---|---|---|---|---|
ogb>=1.1.1 | 2,449,029 | 61,859,140 | Sales rank | Multi-class classification | Accuracy |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] http://manikvarma.org/downloads/XC/XMLRepository.html [2] Wei-Lin Chiang, ...
Facebook
TwitterView O G B Company Limited import export trade data, including shipment records, HS codes, top buyers, suppliers, trade values, and global market insights.
Facebook
Twitterogbrandt/pjf_llama_instruction_prep dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Following the format of the Open Graph Benchmark (OGB), we design four prediction tasks of relations (mag-write, mag-cite) and higher-order patterns (tags-math, DBLP-coauthor) and construct the corresponding datasets over heterogeneous graphs and hypergraphs [1]. The original ogb-mag dataset only contains features for 'paper'-type nodes. We add the node embedding provided by [2] as raw features for other node types in MAG(P-A)/(P-P). For these four tasks, the model is evaluated by one positive query paired with a certain number of randomly sampled negative queries (1:1000 by default, except for tags-math 1:100).
Facebook
TwitterView Ogb And Partners Limited import export trade data, including shipment records, HS codes, top buyers, suppliers, trade values, and global market insights.
Facebook
TwitterThe Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.