4 datasets found
  1. c

    Protein Structural Domain Classification

    • cathdb.info
    • ec.i4cologne.com
    • +3more
    Updated Sep 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
    Explore at:
    Dataset updated
    Sep 30, 2024
    Description

    CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

  2. u

    CATH protein domain classification (version 4.2)

    • rdr.ucl.ac.uk
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Sillitoe; Natalie Dawson; Christine Orengo (2023). CATH protein domain classification (version 4.2) [Dataset]. http://doi.org/10.5522/04/7937330.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    University College London
    Authors
    Ian Sillitoe; Natalie Dawson; Christine Orengo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CATH is a classification of protein structures downloaded from the Protein Data Bank. We group protein domains into superfamilies when there is sufficient evidence they have diverged from a common ancestor. The files contained in this dataset correspond to the version 4.2 release of the CATH classification.

  3. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
    Explore at:
    application/gzip, bz2, zipAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

    Please use the gunzip command to extract files with a '.gz' extension.

    CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
    Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • ted_324m_seq_clustering.cathlabels.tsv
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id - TED domain identifier in the format AF-
    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The files contain a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID - TED chain identifier in the format AF-
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.
    • cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.
    • ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)
    • gofocus_data.tar.bz2 - GOFocus model weights
  4. Z

    rbLEC - restricted backbone Local Euler Characteristic - from CATH database

    • data.niaid.nih.gov
    Updated Sep 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo A. Moreira (2023). rbLEC - restricted backbone Local Euler Characteristic - from CATH database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8382583
    Explore at:
    Dataset updated
    Sep 27, 2023
    Dataset provided by
    Basque Center for Applied Mathematics
    Authors
    Rodrigo A. Moreira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Rodrigo A. Moreira (C) 2023 https://orcid.org/0000-0002-7605-8722 LICENSE: CC BY-NC-ND 4.0 (https://creativecommons.org/licenses/by-nc-nd/4.0/)

    rbLEC - Local Euler Charactersitics - from CATH database

    A. rbLEC NETWORK

    [I] The networks for each PDB[1] structure is defined by the PDB atoms N,CA,C of each residue as nodes of a graph G.
    [II] An edge of G is set if the distance between two atom in [I] is greater than 2.0 Angstrons.
    [III] The graph G is defined in the files with extensions ".network_backboneRE_heavy_gt2"
    

    Equation (1) [6,7] \begin{equation} \chi = \sum_{k=1}^{N} \kappa_k = \sum_{k=1}^{N} \underbrace{ \left(1 + \sum_{l=1}^{\infty} (-1)^{l} \frac{v_{l-1}}{l+1} \right)_{k}}_{\kappa_k} \end{equation}

    Equation (2) \begin{equation} LEC = \sum_{m \in R} \kappa_m = \kappa_{N} + \kappa_{CA} + \kappa_{C} \end{equation}

    B. FILENAME EXTENSIONS

    B.1 Basic files

    ".fixed" PDB file after use of pdbfixer[2] in structures from CATH database.

    ".dssp" Output of DSSP[3] software

    ".stride" Output of STRIDE[4] software

    B.2 Data files

    ".network_backboneRE_heavy_gt2" - Generate by D.2 below. Describe the network graph, as described in A. above.

    ".knill_curvature" - Generate by D.1 below. Contain the filtration of kappas for each vertice of the network.

    ".residues_curvature" - Generate by D.1 below. They are the filtration of LEC, Equation (2) above, for each residue, namely summation of 3 kappas from respective '.knill_curvature', correspoings to PDB atoms N,CA and C, describe in A. above.

    ".label" - Generated by D.3 below Extra file for easier assesment of structures. They have the same information about LEC as described in respective ".residue_curvature" file extensions, but merge also the information from ".dssp" and ".stride" classes as well as residue name and residue ID for each molecule. Format of columns: cutoff resname resid DSSP_class STRIDE_class LEC

    C. FOLDERS

    CATH_FIXED (after uncompress cath_fixed.tar.xz, approximately 13GB)
      contains the fixed PDBs and LECs from CATH[5] database
    

    D. SOFTWARE D.1 lec.py: compute the kappas in Equation (1) above. Example usage: $ python3 lec.py CATH_FIXED/2x0qA02/2x0qA02 It will create the files with extension ".kappas" and ".relec", which reproduces the respectively the files with extension ".knill_curvature" and ".residue_curvature".

     D.2 pdb2network.lua: creates rbLEC network file (number of nodes and edges list) from PDB to be used as input by lec.py.
      Example usage:
        $ lua pdb2rbLEC.lua CATH_FIXED/2x0qA02/2x0qA02.fixed
      Output reproduces the file CATH_FIXED/2x0qA02/2x0qA02.pdb.network_backboneRE_heavy_gt2
    
    
     D.3 label.lua: create files with extension '*.label' from files '*.pdb.stride', '*.pdb.dssp' and '*.pdb.network_backboneRE_heavy_gt2.residues_curvature.
      Example usage:
         $ lua label.lua CATH_FIXED/2x0qA02/2x0qA02.pdb
      Output reproduces the file CATH_FIXED/2x0qA02/2x0qA02.pdb.network_backboneRE_heavy_gt2.residues_curvature.label
    

    REFERENCES [1] Herman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H., Shindyalov, I., & Bourne, P. (2000). The protein data bank. Nucleic acids research, 28, 235–42. [2] Eastman, P., Swails, J., Chodera, J., McGibbon, R., Zhao, Y., Beauchamp, K., Wang, L.P., Simmonett, A., Harrigan, M., Stern, C., & others (2017). OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS computational biology, 13(7), e1005659. [3] Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules, 22(12), 2577–2637. [4] Frishman, D., & Argos, P. (1995). Knowledge-based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics, 23(4), 566–579. [5] Knudsen, M., & Wiuf, C. (2010). The CATH database. Human genomics, 4(3), 1–6. [6] Levitt, N. (1992). The Euler characteristic is the unique locally determined numerical homotopy invariant of finite complexes. Discrete & computational geometry, 7, 59–67. [7] Knill, O. (2011). A graph theoretical Gauss-Bonnet-Chern theorem. arXiv preprint arXiv:1111.5395.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005

Protein Structural Domain Classification

Explore at:
Dataset updated
Sep 30, 2024
Description

CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

Search
Clear search
Close search
Google apps
Main menu