Facebook
TwitterCATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CATH is a classification of protein structures downloaded from the Protein Data Bank. We group protein domains into superfamilies when there is sufficient evidence they have diverged from a common ancestor. The files contained in this dataset correspond to the version 4.2 release of the CATH classification.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.
In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.
For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.
Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:
For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).
We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.
Please use the gunzip command to extract files with a '.gz' extension.
CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Rodrigo A. Moreira (C) 2023 https://orcid.org/0000-0002-7605-8722 LICENSE: CC BY-NC-ND 4.0 (https://creativecommons.org/licenses/by-nc-nd/4.0/)
rbLEC - Local Euler Charactersitics - from CATH database
A. rbLEC NETWORK
[I] The networks for each PDB[1] structure is defined by the PDB atoms N,CA,C of each residue as nodes of a graph G.
[II] An edge of G is set if the distance between two atom in [I] is greater than 2.0 Angstrons.
[III] The graph G is defined in the files with extensions ".network_backboneRE_heavy_gt2"
Equation (1) [6,7] \begin{equation} \chi = \sum_{k=1}^{N} \kappa_k = \sum_{k=1}^{N} \underbrace{ \left(1 + \sum_{l=1}^{\infty} (-1)^{l} \frac{v_{l-1}}{l+1} \right)_{k}}_{\kappa_k} \end{equation}
Equation (2) \begin{equation} LEC = \sum_{m \in R} \kappa_m = \kappa_{N} + \kappa_{CA} + \kappa_{C} \end{equation}
B. FILENAME EXTENSIONS
B.1 Basic files
".fixed" PDB file after use of pdbfixer[2] in structures from CATH database.
".dssp" Output of DSSP[3] software
".stride" Output of STRIDE[4] software
B.2 Data files
".network_backboneRE_heavy_gt2" - Generate by D.2 below. Describe the network graph, as described in A. above.
".knill_curvature" - Generate by D.1 below. Contain the filtration of kappas for each vertice of the network.
".residues_curvature" - Generate by D.1 below. They are the filtration of LEC, Equation (2) above, for each residue, namely summation of 3 kappas from respective '.knill_curvature', correspoings to PDB atoms N,CA and C, describe in A. above.
".label" - Generated by D.3 below Extra file for easier assesment of structures. They have the same information about LEC as described in respective ".residue_curvature" file extensions, but merge also the information from ".dssp" and ".stride" classes as well as residue name and residue ID for each molecule. Format of columns: cutoff resname resid DSSP_class STRIDE_class LEC
C. FOLDERS
CATH_FIXED (after uncompress cath_fixed.tar.xz, approximately 13GB)
contains the fixed PDBs and LECs from CATH[5] database
D. SOFTWARE D.1 lec.py: compute the kappas in Equation (1) above. Example usage: $ python3 lec.py CATH_FIXED/2x0qA02/2x0qA02 It will create the files with extension ".kappas" and ".relec", which reproduces the respectively the files with extension ".knill_curvature" and ".residue_curvature".
D.2 pdb2network.lua: creates rbLEC network file (number of nodes and edges list) from PDB to be used as input by lec.py.
Example usage:
$ lua pdb2rbLEC.lua CATH_FIXED/2x0qA02/2x0qA02.fixed
Output reproduces the file CATH_FIXED/2x0qA02/2x0qA02.pdb.network_backboneRE_heavy_gt2
D.3 label.lua: create files with extension '*.label' from files '*.pdb.stride', '*.pdb.dssp' and '*.pdb.network_backboneRE_heavy_gt2.residues_curvature.
Example usage:
$ lua label.lua CATH_FIXED/2x0qA02/2x0qA02.pdb
Output reproduces the file CATH_FIXED/2x0qA02/2x0qA02.pdb.network_backboneRE_heavy_gt2.residues_curvature.label
REFERENCES [1] Herman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H., Shindyalov, I., & Bourne, P. (2000). The protein data bank. Nucleic acids research, 28, 235–42. [2] Eastman, P., Swails, J., Chodera, J., McGibbon, R., Zhao, Y., Beauchamp, K., Wang, L.P., Simmonett, A., Harrigan, M., Stern, C., & others (2017). OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS computational biology, 13(7), e1005659. [3] Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules, 22(12), 2577–2637. [4] Frishman, D., & Argos, P. (1995). Knowledge-based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics, 23(4), 566–579. [5] Knudsen, M., & Wiuf, C. (2010). The CATH database. Human genomics, 4(3), 1–6. [6] Levitt, N. (1992). The Euler characteristic is the unique locally determined numerical homotopy invariant of finite complexes. Discrete & computational geometry, 7, 59–67. [7] Knill, O. (2011). A graph theoretical Gauss-Bonnet-Chern theorem. arXiv preprint arXiv:1111.5395.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterCATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.