OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.
OGB-LSC consists of three datasets: MAG240M-LSC, WikiKG90M-LSC, and PCQM4M-LSC. Each dataset offers an independent task.
MAG240M-LSC is a heterogeneous academic graph, and the task is to predict the subject areas of papers situated in the heterogeneous graph (node classification). WikiKG90M-LSC is a knowledge graph, and the task is to impute missing triplets (link prediction). PCQM4M-LSC is a quantum chemistry dataset, and the task is to predict an important molecular property, the HOMO-LUMO gap, of a given molecule (graph regression).
The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.
'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a graph prediction dataset from the Open Graph Benchmark (OGB).
This dataset is experimental, and the API is subject to change in future releases.
The below description of the dataset is adapted from the OGB paper:
All the molecules are pre-processed using RDKit ([1]).
The exact description of all features is available at https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py.
The task is to predict 128 different biological activities (inactive/active). See [2] and [3] for more description about these targets. Not all targets apply to each molecule: missing targets are indicated by NaNs.
[1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL: https://github.com/rdkit/rdkit
[2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'. URL: https://arxiv.org/pdf/1502.02072.pdf
[3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ogbg_molpcba', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/ogbg_molpcba-0.1.3.png" alt="Visualization" width="500px">
The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.
zkchen/OGB dataset hosted on Hugging Face and contributed by the HF Datasets community
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins
import os.path as osp
import pandas as pd
import torch
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset
class PygOgbnProteins(PygNodePropPredDataset):
def _init_(self, meta_csv = None):
root, name, transform = '/kaggle/input', 'ogbn-proteins', T.ToSparseTensor()
if meta_csv is None:
meta_csv = osp.join(root, name, 'ogbn-master.csv')
master = pd.read_csv(meta_csv, index_col = 0)
meta_dict = master[name]
meta_dict['dir_path'] = osp.join(root, name)
super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
def get_idx_split(self, split_type = None):
if split_type is None:
split_type = self.meta_info['split']
path = osp.join(self.root, 'split', split_type)
if osp.isfile(os.path.join(path, 'split_dict.pt')):
return torch.load(os.path.join(path, 'split_dict.pt'))
if self.is_hetero:
train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
for nodetype in train_idx_dict.keys():
train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
else:
train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
train_idx = torch.from_numpy(train_idx).to(torch.long)
valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
valid_idx = torch.from_numpy(valid_idx).to(torch.long)
test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
test_idx = torch.from_numpy(test_idx).to(torch.long)
return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
dataset = PygOgbnProteins()
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
graph = dataset[0] # PyG Graph object
Graph: The ogbn-proteins dataset is an undirected, weighted, and typed (according to species) graph. Nodes represent proteins, and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression or homology [1,2]. All edges come with 8-dimensional features, where each dimension represents the strength of a single association type and takes values between 0 and 1 (the larger the value is, the stronger the association is). The proteins come from 8 species.
Prediction task: The task is to predict the presence of protein functions in a multi-label binary classification setup, where there are 112 kinds of labels to predict in total. The performance is measured by the average of ROC-AUC scores across the 112 tasks.
Dataset splitting: The authors split the protein nodes into training/validation/test sets according to the species which the proteins come from. This enables the evaluation of the generalization performance of the model across different species.
Note: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.
Package | #Nodes | #Edges | Split Type | Task Type | Metric |
---|---|---|---|---|---|
ogb>=1.1.1 | 132,534 | 39,561,252 | Species | Multi-label binary classification | ROC-AUC |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1):D607–D613, 2019. [2] Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong. Nucleic Acids Research, 47(D1):D330–D338, 2018. [3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
OGB-LSC provides the three large-scale realistic benchmark datasets, covering the core graph ML tasks of node classification, link prediction, and graph regression.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems. Available benchmarking systems:
hiv
HIV replication inhibition from MoleculeNet and OGB with scaffold splits
pcba_random
Biological activities from MoleculeNet with random splits (with missing targets filled in with zeros as provided by MoleculeNet)
pcba_random_nans
Biological activities from MoleculeNet with random splits and data format to match OGB (with missing targets not filled in with zeros)
pcba_scaffold
Biological activities from OGB with scaffold splits
qm9_multitask
DFT calculated properties from MoleculeNet and OGB, trained as a multi-task model
qm9_u0
DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target U0 only
qm9_gap
DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target gap only
sampl
Water-octanol partition coefficients, used to predict molecules from the SAMPL6, 7 and 9 challenges
atom_bond_137k
Quantum-mechanical atom and bond descriptors
bde
Bond dissociation enthalpies trained as single-task model
bde_charges
Bond dissociation enthalpies trained as multi-task model together with atomic partial charges
charges_eps_4
Partial charges at a dielectric constant of 4 (in protein)
charges_eps_78
Partial charges at a dielectric constant of 78 (in water)
barriers_e2
Reaction barrier heights of E2 reactions
barriers_sn2
Reaction barrier heights of SN2 reactions
barriers_cycloadd
Reaction barrier heights of cycloaddition reactions
barriers_rdb7
Reaction barrier heights in the RDB7 dataset
barriers_rgd1
Reaction barrier heights in the RGD1-CNHO dataset
multi_molecule
UV/Vis peak absorption wavelengths in different solvents
ir
IR Spectra
pcqm4mv2
HOMO-LUMO gaps of the PCQM4Mv2 dataset
uncertainty_ensemble
Uncertainty estimation using an ensemble using the QM9 gap dataset
uncertainty_evidential
Uncertainty estimation using evidential learning using the QM9 gap dataset
uncertainty_mve
Uncertainty estimation using mean-variance estimation using the QM9 gap dataset
timing
Timing benchmark using subsets of QM9 gap
Version: This version of the dataset (Version 2) is compatible with all versions of Chemprop (supporting the respective functionality). Version 1 of this dataset is compatible with all versions except Chemprop v.1.6.1, which cannot process the charges_eps_4
and charges_eps_78
datasets (all other benchmarks work as expected). We therefore recommend to always use Version 2 of the dataset (with reformatted charges_eps_4
and charges_eps_78
datasets), since it is compatible with all versions of Chemprop. For use with any other ML software, you can use any version.
Eximpedia Export import trade data lets you search trade data and active Exporters, Importers, Buyers, Suppliers, manufacturers exporters from over 209 countries
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Investigate historical ownership changes and registration details by initiating a reverse Whois lookup for the name Vintage Ogb.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Used TheBloke/OpenHermes-2-Mistral-7B-GPTQ to convert chunks into QA pairs used for finetuning
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Calcium time series from OGB labeled V1 neurons in awake or anesthesized mice. Data published in: Pieter M. Goltstein, Jorrit S. Montijn, Cyriel M.A. Pennartz. (2015). Effects of isoflurane anesthesia on ensemble patterns of Ca2+ activity in mouse V1: Reduced direction selectivity independent of increased correlations in cellular activity. PLOS ONE.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to xn--biberciimento-ogb.com (Domain). Get insights into ownership history and changes over time.
ogbrandt/pjf_llama_instruction_prep dataset hosted on Hugging Face and contributed by the HF Datasets community
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to xn--digitalebrn-ogb.com (Domain). Get insights into ownership history and changes over time.
ogbrandt/nous-pjf dataset hosted on Hugging Face and contributed by the HF Datasets community
Specifications of Dataset Download in Geom3D
We provide both the raw and processed data at this HuggingFace link.
PCQM4Mv2
mkdir -p pcqm4mv2/raw cd pcqm4mv2/raw wget http://ogb-data.stanford.edu/data/lsc/pcqm4m-v2-train.sdf.tar.gz tar -xf pcqm4m-v2-train.sdf.tar.gz
wget http://ogb-data.stanford.edu/data/lsc/pcqm4m-v2.zip unzip pcqm4m-v2.zip mv pcqm4m-v2/raw/data.csv.gz . rm pcqm4m-v2.zip rm -rf pcqm4m-v2
GEOM
wget… See the full description on the dataset page: https://huggingface.co/datasets/chao1224/Geom3D_data.
ogbrandt/gpt4_preference_rlaif dataset hosted on Hugging Face and contributed by the HF Datasets community
OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.
OGB-LSC consists of three datasets: MAG240M-LSC, WikiKG90M-LSC, and PCQM4M-LSC. Each dataset offers an independent task.
MAG240M-LSC is a heterogeneous academic graph, and the task is to predict the subject areas of papers situated in the heterogeneous graph (node classification). WikiKG90M-LSC is a knowledge graph, and the task is to impute missing triplets (link prediction). PCQM4M-LSC is a quantum chemistry dataset, and the task is to predict an important molecular property, the HOMO-LUMO gap, of a given molecule (graph regression).