1 dataset found
  1. u

    Foldclass databases for protein structural domains in CATH and TED

    • rdr.ucl.ac.uk
    bin
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaun Kandathil; Andy Lau; Daniel Buchan; David Jones (2025). Foldclass databases for protein structural domains in CATH and TED [Dataset]. http://doi.org/10.5522/04/26348605.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    University College London
    Authors
    Shaun Kandathil; Andy Lau; Daniel Buchan; David Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shaun Kandathil; Andy Lau; Daniel Buchan; David Jones (2025). Foldclass databases for protein structural domains in CATH and TED [Dataset]. http://doi.org/10.5522/04/26348605.v2

Foldclass databases for protein structural domains in CATH and TED

Explore at:
binAvailable download formats
Dataset updated
Jan 16, 2025
Dataset provided by
University College London
Authors
Shaun Kandathil; Andy Lau; Daniel Buchan; David Jones
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.

Search
Clear search
Close search
Google apps
Main menu