Open Catalyst 2020 is a dataset for catalysis in chemical engineering. Focusing on molecules that are important in renewable energy applications, the OC20 data set comprises over 1.3 million relaxations of molecular adsorptions onto surfaces, the largest data set of electrocatalyst structures to date.
nimashoghi/oc20-s2ef dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cite this dataset
Chanussot, L., Das, A., Goyal, S., Lavril, T., Shuaibi, M., Riviere, M., Tran, K., Heras-Domingo, J., Ho, C., Hu, W., Palizhati, A., Sriram, A., Wood, B., Yoon, J., Parikh, D., Zitnick, C. L., and Ulissi, Z. OC20 S2EF train 200K. ColabFit, 2024. https://doi.org/10.60732/6ccdeb1d
View on the ColabFit Exchange
https://materials.colabfit.org/id/DS_zdy2xz6y88nl_0
Dataset Name
OC20 S2EF train 200K
Description
OC20_S2EF_train_200K is… See the full description on the dataset page: https://huggingface.co/datasets/colabfit/OC20_S2EF_train_200K.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we included code to prepare dataset, train gemnet model, build the faiss index, search the faiss index and visualize the searched results in the notebook faiss-gemnet-qm9-mp.ipynb
. It reproduced our examples in the manuscript for the QM9 and the Materials Project dataset. For the OC20 dataset, we did not include its related data here because of its large size (> 50 GB), the code to process the OC20 dataset is almost the same as the code included in the notebook for the QM9 dataset.
We include the intermediate data (GemNet checkpoints, lmdb, faiss index and the searched result for the QM9 and the Materials project in the directory example-data
. We also put the GemNet checkpoint for the OC20 dataset in this directory. The training and evaluation of the Gaussian regression process model using the searched molecules for the query Benzene are demonstrated in the ben-gp-data
directory, in which the qm9-gp-gemnet-morgan-random-nrg.ipynb
can be run on Colab.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We built three Global Minimum Adsorption Energy (GMAE) benchmark datasets named OCD-GMAE, Alloy-GMAE and FG-GMAE from OC20-Dense, Catalysis Hub, and `functional groups' (FG)-dataset datasets through strict data cleaning, and each data point represents a unique combination of catalyst surface and adsorbate. These new benchmark datasets can be beneficial for future ML study on GMAE prediction.
In addition, a similar data cleaning procedure was employed on the OC20 dataset to create a new dataset named OC20-LMAE, which comprises surface/adsorbate pairings along with their local minimum adsorption energies (LMAE). The OC20-LMAE dataset contains 363,937 data points and serves as an effective resource for model pretraining.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset accompanying the release of the Open MatSciML Toolkit, an open source software for development graph neural networks on the OpenCatalyst project using the Deep Graph Library (DGL).
For more details about the Open MatSci ML Toolkit, check the associated open-source repository and paper.
Compressed files ~8GB with uncompressed file being ~80 GB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
OC22-IS2RE-Validation-in-domain
Description
In-domain validation configurations for the initial structure to relaxed total energy (IS2RE) task of OC22. Open Catalyst 2022 (OC22) is a database of training trajectories for predicting catalytic reactions on oxide surfaces meant to complement OC20, which did not contain oxide surfaces.Additional details stored in dataset columns prepended with "dataset_".
Dataset authors
Richard Tran, Janice Lan… See the full description on the dataset page: https://huggingface.co/datasets/colabfit/OC22-IS2RE-Validation-in-domain.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Open Catalyst 2020 is a dataset for catalysis in chemical engineering. Focusing on molecules that are important in renewable energy applications, the OC20 data set comprises over 1.3 million relaxations of molecular adsorptions onto surfaces, the largest data set of electrocatalyst structures to date.