3 datasets found
  1. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13908086
    Explore at:
    application/gzip, zip, bz2Available download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 31, 2024
    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy, and putative CATH SuperFamily or Fold assignments, for all 365 million domains (~324 million domains in TED100 and ~40 million domains in TED-redundant).

    For all chains in the chain-level TED-redundant files, the file contains boundary predictions, consensus level and information on the TED100 representative.

    For both TED100 and TED-redundant we provide domain boundary predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 PDB files for potentially novel folds identified during the TED classification process, with an annotation table sorted by novelty, as well as 6,433 highly symmetrical folds representatives.

    Please use the gunzip command to extract files with a '.gz' extension and "tar -xzvf file.tar.gz" to open .tar.gz files .

    CATH annotations have been assigned using the Foldseek algorithm applied in various modes, and the Foldclass algorithm, both of which are used to report significant structural similarity to a known CATH domain.


    Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

    Changelog Version 5:

    • Add: ted_365m.domain_summary.cath.globularity.taxid.tsv.tar.gz - This table, in the same format as the previous ted_100_324m.domain_summary.cath.globularity.taxid.tsv.tar.gz, contains per-domain annotations for the whole of TED, including metadata on domain quality metrics such as secondary structure elements counts, globularity scores, average pLDDT and taxonomical assignments.
    • Add: high_symmetry_folds_set.domain_summary.tsv.gz - subset of ted_365m.domain_summary.cath.globularity.taxid.tsv containing information on 6,433 high symmetry folds in TED. The entries are sorted in descending order by Z-score obtained from SymD.
    • Add: high_symmetry_folds_set_models.tar.gz - TED domain models in PDB format for 6,433 high symmetry folds in TED.
    • Add: ISP_data.tar.gz - Raw data for Interacting SuperFamily Pairs calculations used in the manuscript. A more detailed description of the ISP data is available below as well as within the tar.gz file.
    • Add: ted_redundant_40m_domain_id.list.gz - list of TED_domain_ID in TED redundant
    • Add: ted_100_324m_domain_id.list.gz - list of TED_domain_ID in TED100
    • Fix/Replace: A domain-level summary of TED, now consolidated into ted_365m.domain_summary.cath.globularity.taxid.tsv, is consistent with the protocol used in the manuscript. As Foldclass and Foldseek T-level hits provide all 4 CATH digits, we removed the H portion of the CATH code from each prediction at the T-level.
      Previously, the following columns
      14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300
      15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment.
      16. cath_assignment_method - Method used to assign a CATH label, either Foldseek or Foldclass
      sometimes showed an additional label with a T-level prediction by Foldclass in the case of T-level assignments obtained by Foldseek, e.g.
      3.40.30,3.40.30 T foldseek,foldclass
      This has now been corrected to reflect the TED protocol, with Foldclass T-level assignments applied only to domains where a T-level assignment could not be applied using Foldseek, e.g.
      domain-x 3.40.30 T foldseek
      domain-y 3.20.20 T foldclass

      Thus, in the current version of the data, CATH assignments label can only be
      H-level assignment by Foldseek (i.e. 3.40.50.300 H foldseek)
      T-level assignment by Foldseek (i.e. 3.40.30 T foldseek)
      T-level assignment by Foldclass (i.e. 3.40.30 T foldclass)
      or no assignment (- - - )


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m_domain_id.list.gz - list of ~324 million domain identifiers in TED100, one per line in the format AF-
    • ted_redundant_40m_domain_id.list.gz - list of ~40 million domain identifiers in TED redundant, one per line in the format AF-
    • ted_365m.domain_summary.cath.globularity.taxid.tsv, novel_folds_set.domain_summary.tsv and high_symmetry_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv). novel_folds_set.domain_summary.tsv is sorted by novelty
      Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

      1. ted_id - TED domain identifier in the format AF-
    • ted_324m_seq_clustering.cathlabels.tsv.gz
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv.gz
    • The file ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz contains a header with the

  2. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
    Explore at:
    application/gzip, bz2, zipAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

    Please use the gunzip command to extract files with a '.gz' extension.

    CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
    Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • ted_324m_seq_clustering.cathlabels.tsv
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id - TED domain identifier in the format AF-
    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The files contain a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID - TED chain identifier in the format AF-
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.
    • cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.
    • ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)
    • gofocus_data.tar.bz2 - GOFocus model weights
  3. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.10848710
    Explore at:
    Dataset updated
    Mar 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    Organism TaxonID

    arabidopsis_thaliana 3702
    caenorhabditis_elegans 6239
    candida_albicans 237561
    danio_rerio 7955
    dictyostelium_discoideum 44689
    drosophila_melanogaster 7227
    escherichia_coli 83333
    glycine_max 3847
    homo_sapiens 9606
    methanocaldococcus_jannaschii 243232
    mus_musculus 10090
    oryza_sativa 39947
    rattus_norvegicus 10116
    saccharomyces_cerevisiae 559292
    schizosaccharomyces_pombe 284812
    zea_mays 4577
    ajellomyces_capsulatus 447093
    brugia_malayi 6279
    campylobacter_jejuni 192222
    cladophialophora_carrionii 86049
    dracunculus_medinensis 318479
    fonsecaea_pedrosoi 1442368
    haemophilus_influenzae 71421
    helicobacter_pylori 85962
    klebsiella_pneumoniae 1125630
    leishmania_infantum 5671
    madurella_mycetomatis 100816
    mycobacterium_leprae 272631
    mycobacterium_tuberculosis 83332
    mycobacterium_ulcerans 1299332
    neisseria_gonorrhoeae 242231
    nocardia_brasiliensis 1133849
    onchocerca_volvulus 6282
    paracoccidioides_lutzii 502779
    plasmodium_falciparum 36329
    pseudomonas_aeruginosa 208964
    salmonella_typhimurium 99287
    schistosoma_mansoni 6183
    shigella_dysenteriae 300267
    sporothrix_schenckii 1391915
    staphylococcus_aureus 93061
    streptococcus_pneumoniae 171101
    strongyloides_stercoralis 6248
    trypanosoma_brucei 185431
    trypanosoma_cruzi 353153
    wuchereria_bancrofti 6293


    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.


    This dataset contains:

    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id
      2. md5_domain
      3. consensus_level
      4. chopping
      5. nres_domain
      6. num_segments
      7. plddt
      8. num_helix_strand_turn
      9. num_helix
      10. num_strand
      11. num_helix_strand
      12. num_turn
      13. proteome_id
      14. cath_label
      15. cath_assignment_level
      16. cath_assignment_method
      17. packing_density
      18. norm_rg
      19. tax_common_name
      20. tax_scientific_name
      21. tax_lineage
    • Domain assignments for TED redundant in ted_redundant_39m.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab separated (.tsv).
      1. TED_redundant_id
      2. md5
      3. nres
      4. n_high
      5. n_med
      6. high_consensus
      7. med_consensus
      8. ndom_consensus
      9. n_targets
      10. proteome_id
      11. TED_redundant_species
      12. TED100_chain_rep
      13. TED100_chain_rep_species
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID
      2. TED_chain_md5
      3. TED_chain_length
      4. ndoms
      5. Domain boundaries
      6. Prediction probability
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • model_organisms_and_global_health_proteomes.tar.gz - domain assignments for 21 model organisms and 25 global health proteomes
  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13908086
Organization logo

The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

Explore at:
application/gzip, zip, bz2Available download formats
Dataset updated
Oct 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Oct 31, 2024
Description

Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy, and putative CATH SuperFamily or Fold assignments, for all 365 million domains (~324 million domains in TED100 and ~40 million domains in TED-redundant).

For all chains in the chain-level TED-redundant files, the file contains boundary predictions, consensus level and information on the TED100 representative.

For both TED100 and TED-redundant we provide domain boundary predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 PDB files for potentially novel folds identified during the TED classification process, with an annotation table sorted by novelty, as well as 6,433 highly symmetrical folds representatives.

Please use the gunzip command to extract files with a '.gz' extension and "tar -xzvf file.tar.gz" to open .tar.gz files .

CATH annotations have been assigned using the Foldseek algorithm applied in various modes, and the Foldclass algorithm, both of which are used to report significant structural similarity to a known CATH domain.


Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

Changelog Version 5:

  • Add: ted_365m.domain_summary.cath.globularity.taxid.tsv.tar.gz - This table, in the same format as the previous ted_100_324m.domain_summary.cath.globularity.taxid.tsv.tar.gz, contains per-domain annotations for the whole of TED, including metadata on domain quality metrics such as secondary structure elements counts, globularity scores, average pLDDT and taxonomical assignments.
  • Add: high_symmetry_folds_set.domain_summary.tsv.gz - subset of ted_365m.domain_summary.cath.globularity.taxid.tsv containing information on 6,433 high symmetry folds in TED. The entries are sorted in descending order by Z-score obtained from SymD.
  • Add: high_symmetry_folds_set_models.tar.gz - TED domain models in PDB format for 6,433 high symmetry folds in TED.
  • Add: ISP_data.tar.gz - Raw data for Interacting SuperFamily Pairs calculations used in the manuscript. A more detailed description of the ISP data is available below as well as within the tar.gz file.
  • Add: ted_redundant_40m_domain_id.list.gz - list of TED_domain_ID in TED redundant
  • Add: ted_100_324m_domain_id.list.gz - list of TED_domain_ID in TED100
  • Fix/Replace: A domain-level summary of TED, now consolidated into ted_365m.domain_summary.cath.globularity.taxid.tsv, is consistent with the protocol used in the manuscript. As Foldclass and Foldseek T-level hits provide all 4 CATH digits, we removed the H portion of the CATH code from each prediction at the T-level.
    Previously, the following columns
    14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300
    15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment.
    16. cath_assignment_method - Method used to assign a CATH label, either Foldseek or Foldclass
    sometimes showed an additional label with a T-level prediction by Foldclass in the case of T-level assignments obtained by Foldseek, e.g.
    3.40.30,3.40.30 T foldseek,foldclass
    This has now been corrected to reflect the TED protocol, with Foldclass T-level assignments applied only to domains where a T-level assignment could not be applied using Foldseek, e.g.
    domain-x 3.40.30 T foldseek
    domain-y 3.20.20 T foldclass

    Thus, in the current version of the data, CATH assignments label can only be
    H-level assignment by Foldseek (i.e. 3.40.50.300 H foldseek)
    T-level assignment by Foldseek (i.e. 3.40.30 T foldseek)
    T-level assignment by Foldclass (i.e. 3.40.30 T foldclass)
    or no assignment (- - - )


This dataset contains:

  • ted_214m_per_chain_segmentation.tsv
    The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
    1. AFDB_model_ID: chain identifier from AFDB in the format AF-
  • ted_365m_domain_boundaries_consensus_level.tsv.gz
    The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
    1. TED_ID: TED domain identifier in the format AF-
  • ted_100_324m_domain_id.list.gz - list of ~324 million domain identifiers in TED100, one per line in the format AF-
  • ted_redundant_40m_domain_id.list.gz - list of ~40 million domain identifiers in TED redundant, one per line in the format AF-
  • ted_365m.domain_summary.cath.globularity.taxid.tsv, novel_folds_set.domain_summary.tsv and high_symmetry_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv). novel_folds_set.domain_summary.tsv is sorted by novelty
    Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

    1. ted_id - TED domain identifier in the format AF-
  • ted_324m_seq_clustering.cathlabels.tsv.gz
    The file contains the results of the domain sequences clustering with MMseqs2.
    Columns:
    1. Cluster_representative
    2. Cluster_member
    3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
    4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

  • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv.gz
  • The file ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz contains a header with the

Search
Clear search
Close search
Google apps
Main menu