4 datasets found

c
Protein Structural Domain Classification
cathdb.info
ec.i4cologne.com
+3more
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
Explore at:
Unique identifier
https://identifiers.org/MIR:00100005
Dataset updated
Sep 30, 2024
Description
CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
u
CATH protein domain classification (version 4.2)
rdr.ucl.ac.uk
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Sillitoe; Natalie Dawson; Christine Orengo (2023). CATH protein domain classification (version 4.2) [Dataset]. http://doi.org/10.5522/04/7937330.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5522/04/7937330.v1
Dataset updated
May 30, 2023
Dataset provided by
University College London
Authors
Ian Sillitoe; Natalie Dawson; Christine Orengo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CATH is a classification of protein structures downloaded from the Protein Data Bank. We group protein domains into superfamilies when there is sufficient evidence they have diverged from a common ancestor. The files contained in this dataset correspond to the version 4.2 release of the CATH classification.
The Encyclopedia of Domains (TED) structural domains assignments for...
zenodo.org
application/gzip, bz2 +1
Updated Oct 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
Explore at:
application/gzip, bz2, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13369203
Dataset updated
Oct 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

Please use the gunzip command to extract files with a '.gz' extension.

CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.

This dataset contains:

ted_214m_per_chain_segmentation.tsv
The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
1. AFDB_model_ID: chain identifier from AFDB in the format AF-

ted_365m_domain_boundaries_consensus_level.tsv.gz
The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
1. TED_ID: TED domain identifier in the format AF-

ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).

ted_324m_seq_clustering.cathlabels.tsv
The file contains the results of the domain sequences clustering with MMseqs2.
Columns:
1. Cluster_representative
2. Cluster_member
3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

novel_folds_set.domain_summary.tsv is sorted by novelty.
1. ted_id - TED domain identifier in the format AF-

Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
The files contain a header with the following fields. Each column is tab-separated (.tsv).
1. TED_redundant_id - TED chain identifier in the format AF-

and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
The file contains a header with the following fields. Each column is tab-separated (.tsv).
1. TED_redundant_id - TED chain identifier in the format AF-

novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.

All per-tool domain boundaries predictions are in the same format with the following columns.
1. TED_chainID - TED chain identifier in the format AF-

Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

Merizo predicts one continuous domain and a discontinuous domain,
Domain1 (discontinuous): 10-52_289-394
segment1: 10-52
segment2: 289-394
Domain 2 (continuous):
segment 1: 53-288

ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.

cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.

ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)

gofocus_data.tar.bz2 - GOFocus model weights
Z
rbLEC - restricted backbone Local Euler Characteristic - from CATH database
data.niaid.nih.gov
Updated Sep 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo A. Moreira (2023). rbLEC - restricted backbone Local Euler Characteristic - from CATH database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8382583
Explore at:
Dataset updated
Sep 27, 2023
Dataset provided by
Basque Center for Applied Mathematics
Authors
Rodrigo A. Moreira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Rodrigo A. Moreira (C) 2023 https://orcid.org/0000-0002-7605-8722 LICENSE: CC BY-NC-ND 4.0 (https://creativecommons.org/licenses/by-nc-nd/4.0/)

rbLEC - Local Euler Charactersitics - from CATH database

A. rbLEC NETWORK

[I] The networks for each PDB[1] structure is defined by the PDB atoms N,CA,C of each residue as nodes of a graph G. [II] An edge of G is set if the distance between two atom in [I] is greater than 2.0 Angstrons. [III] The graph G is defined in the files with extensions ".network_backboneRE_heavy_gt2"

Equation (1) [6,7] \begin{equation} \chi = \sum_{k=1}^{N} \kappa_k = \sum_{k=1}^{N} \underbrace{ \left(1 + \sum_{l=1}^{\infty} (-1)^{l} \frac{v_{l-1}}{l+1} \right)_{k}}_{\kappa_k} \end{equation}

Equation (2) \begin{equation} LEC = \sum_{m \in R} \kappa_m = \kappa_{N} + \kappa_{CA} + \kappa_{C} \end{equation}

B. FILENAME EXTENSIONS

B.1 Basic files

".fixed" PDB file after use of pdbfixer[2] in structures from CATH database.

".dssp" Output of DSSP[3] software

".stride" Output of STRIDE[4] software

B.2 Data files

".network_backboneRE_heavy_gt2" - Generate by D.2 below. Describe the network graph, as described in A. above.

".knill_curvature" - Generate by D.1 below. Contain the filtration of kappas for each vertice of the network.

".residues_curvature" - Generate by D.1 below. They are the filtration of LEC, Equation (2) above, for each residue, namely summation of 3 kappas from respective '.knill_curvature', correspoings to PDB atoms N,CA and C, describe in A. above.

".label" - Generated by D.3 below Extra file for easier assesment of structures. They have the same information about LEC as described in respective ".residue_curvature" file extensions, but merge also the information from ".dssp" and ".stride" classes as well as residue name and residue ID for each molecule. Format of columns: cutoff resname resid DSSP_class STRIDE_class LEC

C. FOLDERS

CATH_FIXED (after uncompress cath_fixed.tar.xz, approximately 13GB) contains the fixed PDBs and LECs from CATH[5] database

D. SOFTWARE D.1 lec.py: compute the kappas in Equation (1) above. Example usage: $ python3 lec.py CATH_FIXED/2x0qA02/2x0qA02 It will create the files with extension ".kappas" and ".relec", which reproduces the respectively the files with extension ".knill_curvature" and ".residue_curvature".

D.2 pdb2network.lua: creates rbLEC network file (number of nodes and edges list) from PDB to be used as input by lec.py. Example usage: $ lua pdb2rbLEC.lua CATH_FIXED/2x0qA02/2x0qA02.fixed Output reproduces the file CATH_FIXED/2x0qA02/2x0qA02.pdb.network_backboneRE_heavy_gt2 D.3 label.lua: create files with extension '*.label' from files '*.pdb.stride', '*.pdb.dssp' and '*.pdb.network_backboneRE_heavy_gt2.residues_curvature. Example usage: $ lua label.lua CATH_FIXED/2x0qA02/2x0qA02.pdb Output reproduces the file CATH_FIXED/2x0qA02/2x0qA02.pdb.network_backboneRE_heavy_gt2.residues_curvature.label

REFERENCES [1] Herman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H., Shindyalov, I., & Bourne, P. (2000). The protein data bank. Nucleic acids research, 28, 235–42. [2] Eastman, P., Swails, J., Chodera, J., McGibbon, R., Zhao, Y., Beauchamp, K., Wang, L.P., Simmonett, A., Harrigan, M., Stern, C., & others (2017). OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS computational biology, 13(7), e1005659. [3] Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules, 22(12), 2577–2637. [4] Frishman, D., & Argos, P. (1995). Knowledge-based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics, 23(4), 566–579. [5] Knudsen, M., & Wiuf, C. (2010). The CATH database. Human genomics, 4(3), 1–6. [6] Levitt, N. (1992). The Euler characteristic is the unique locally determined numerical homotopy invariant of finite complexes. Discrete & computational geometry, 7, 59–67. [7] Knill, O. (2011). A graph theoretical Gauss-Bonnet-Chern theorem. arXiv preprint arXiv:1111.5395.
Not seeing a result you expected?
Learn how you can add new datasets to our index.