1 dataset found

u
Foldclass databases for protein structural domains in CATH and TED
rdr.ucl.ac.uk
bin
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaun Kandathil; Andy Lau; Daniel Buchan; David Jones (2025). Foldclass databases for protein structural domains in CATH and TED [Dataset]. http://doi.org/10.5522/04/26348605.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5522/04/26348605.v2
Dataset updated
Jan 16, 2025
Dataset provided by
University College London
Authors
Shaun Kandathil; Andy Lau; Daniel Buchan; David Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shaun Kandathil; Andy Lau; Daniel Buchan; David Jones (2025). Foldclass databases for protein structural domains in CATH and TED [Dataset]. http://doi.org/10.5522/04/26348605.v2

Foldclass databases for protein structural domains in CATH and TED

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5522/04/26348605.v2

Dataset updated

Jan 16, 2025

Dataset provided by

University College London

Authors

Shaun Kandathil; Andy Lau; Daniel Buchan; David Jones

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.

Clear search

Close search

Google apps

Main menu