33 datasets found
  1. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    Updated Mar 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.10848710
    Explore at:
    Dataset updated
    Mar 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    Organism TaxonID

    arabidopsis_thaliana 3702
    caenorhabditis_elegans 6239
    candida_albicans 237561
    danio_rerio 7955
    dictyostelium_discoideum 44689
    drosophila_melanogaster 7227
    escherichia_coli 83333
    glycine_max 3847
    homo_sapiens 9606
    methanocaldococcus_jannaschii 243232
    mus_musculus 10090
    oryza_sativa 39947
    rattus_norvegicus 10116
    saccharomyces_cerevisiae 559292
    schizosaccharomyces_pombe 284812
    zea_mays 4577
    ajellomyces_capsulatus 447093
    brugia_malayi 6279
    campylobacter_jejuni 192222
    cladophialophora_carrionii 86049
    dracunculus_medinensis 318479
    fonsecaea_pedrosoi 1442368
    haemophilus_influenzae 71421
    helicobacter_pylori 85962
    klebsiella_pneumoniae 1125630
    leishmania_infantum 5671
    madurella_mycetomatis 100816
    mycobacterium_leprae 272631
    mycobacterium_tuberculosis 83332
    mycobacterium_ulcerans 1299332
    neisseria_gonorrhoeae 242231
    nocardia_brasiliensis 1133849
    onchocerca_volvulus 6282
    paracoccidioides_lutzii 502779
    plasmodium_falciparum 36329
    pseudomonas_aeruginosa 208964
    salmonella_typhimurium 99287
    schistosoma_mansoni 6183
    shigella_dysenteriae 300267
    sporothrix_schenckii 1391915
    staphylococcus_aureus 93061
    streptococcus_pneumoniae 171101
    strongyloides_stercoralis 6248
    trypanosoma_brucei 185431
    trypanosoma_cruzi 353153
    wuchereria_bancrofti 6293


    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.


    This dataset contains:

    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id
      2. md5_domain
      3. consensus_level
      4. chopping
      5. nres_domain
      6. num_segments
      7. plddt
      8. num_helix_strand_turn
      9. num_helix
      10. num_strand
      11. num_helix_strand
      12. num_turn
      13. proteome_id
      14. cath_label
      15. cath_assignment_level
      16. cath_assignment_method
      17. packing_density
      18. norm_rg
      19. tax_common_name
      20. tax_scientific_name
      21. tax_lineage
    • Domain assignments for TED redundant in ted_redundant_39m.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab separated (.tsv).
      1. TED_redundant_id
      2. md5
      3. nres
      4. n_high
      5. n_med
      6. high_consensus
      7. med_consensus
      8. ndom_consensus
      9. n_targets
      10. proteome_id
      11. TED_redundant_species
      12. TED100_chain_rep
      13. TED100_chain_rep_species
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID
      2. TED_chain_md5
      3. TED_chain_length
      4. ndoms
      5. Domain boundaries
      6. Prediction probability
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • model_organisms_and_global_health_proteomes.tar.gz - domain assignments for 21 model organisms and 25 global health proteomes
  2. e

    Foldclass databases for protein structural domains in CATH and TED - Dataset...

    • b2find.eudat.eu
    Updated Feb 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Foldclass databases for protein structural domains in CATH and TED - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/89e96e17-6989-53e9-9768-45066418d923
    Explore at:
    Dataset updated
    Feb 1, 2025
    Description

    This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.

  3. c

    Protein Structural Domain Classification

    • cathdb.info
    • ec.i4cologne.com
    • +3more
    Updated Sep 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
    Explore at:
    Dataset updated
    Sep 30, 2024
    Description

    CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

  4. f

    TED-to-CATH HHM database

    • figshare.com
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claudia Alvarez-Carreño (2025). TED-to-CATH HHM database [Dataset]. http://doi.org/10.6084/m9.figshare.28531754.v1
    Explore at:
    Dataset updated
    May 9, 2025
    Dataset provided by
    figshare
    Authors
    Claudia Alvarez-Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database is a version of the TED database mapped to CATH domains and filtered for a maximum pairwise sequence identity of 30%.

  5. e

    Alpha-2-macroglobulin, TED domain

    • ebi.ac.uk
    Updated Apr 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Alpha-2-macroglobulin, TED domain [Dataset]. https://www.ebi.ac.uk/interpro/entry/IPR041813
    Explore at:
    Dataset updated
    Apr 4, 2019
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This entry represents the TED (thiol ester-containing domain) domain found in alpha2-macroglobulin (alpha(2)-M) and related proteins . This domain has a short highly conserved region of proteinase-binding alpha-macro-globulins containing the cysteine and a glutamine of a thiol-ester bond that is cleaved at the moment of proteinase binding, and mediates the covalent binding of the alpha-macro-globulin to the proteinase. The GCGEQ motif is highly conserved .Proteins containing this domain also include pregnancy zone protein (PZP). Alpha(2)-M and PZP are broadly specific proteinase inhibitors. Alpha(2)-M is a major carrier protein in serum. The structural thioester of alpha(2)-M, is involved in the immobilization and entrapment of proteases. PZP is a trace protein in the plasma of non-pregnant females and males which is elevated in pregnancy. Alpha(2)-M and PZP bind to placental protein-14 and may modulate its activity in T-cell growth and cytokine production contributing to fetal survival. It has been suggested that thioester bond cleavage promotes the binding of PZP and alpha(2)-M to the CD91 receptor clearing them from circulation .

  6. e

    A-macroglobulin TED domain

    • ebi.ac.uk
    Updated Jun 13, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). A-macroglobulin TED domain [Dataset]. https://www.ebi.ac.uk/interpro/entry/pfam/PF07678
    Explore at:
    Dataset updated
    Jun 13, 2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This entry corresponds to the TED domain of the complement components such as C3, C4 and C5. This domain contains a short highly conserved region of proteinase-binding alpha-macro-globulins contains the cysteine and a glutamine of a thiol-ester bond that is cleaved at the moment of proteinase binding, and mediates the covalent binding of the alpha-macro-globulin to the proteinase. The GCGEQ motif is highly conserved.

  7. w

    Dataset of book subjects that contain Doonreagan : a private domain of Ted...

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain Doonreagan : a private domain of Ted Hughes [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Doonreagan+%3A+a+private+domain+of+Ted+Hughes&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 1 row and is filtered where the books is Doonreagan : a private domain of Ted Hughes. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  8. w

    ted.in - Historical whois Lookup

    • whoisdatacenter.com
    csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, ted.in - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/ted.in/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Aug 24, 2025
    Description

    Explore the historical Whois records related to ted.in (Domain). Get insights into ownership history and changes over time.

  9. w

    ted-med.com - Historical whois Lookup

    • whoisdatacenter.com
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, ted-med.com - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/ted-med.com/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Sep 14, 2025
    Description

    Explore the historical Whois records related to ted-med.com (Domain). Get insights into ownership history and changes over time.

  10. w

    internet-ted.com - Historical whois Lookup

    • whoisdatacenter.com
    csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, internet-ted.com - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/internet-ted.com/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Sep 13, 2025
    Description

    Explore the historical Whois records related to internet-ted.com (Domain). Get insights into ownership history and changes over time.

  11. w

    ted.pub - Historical whois Lookup

    • whoisdatacenter.com
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, ted.pub - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/ted.pub/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Aug 25, 2025
    Description

    Explore the historical Whois records related to ted.pub (Domain). Get insights into ownership history and changes over time.

  12. e

    TEP1, second CUB domain

    • ebi.ac.uk
    Updated Aug 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). TEP1, second CUB domain [Dataset]. https://www.ebi.ac.uk/interpro/entry/pfam/PF21412
    Explore at:
    Dataset updated
    Aug 15, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thioester-containing protein 1 (TEP1) is a central component in the innate immune response of Anopheles gambiae to Plasmodium infection. TEP1 has a series of eight macroglobulin (MG) domains and the beta-sheet CUB domain which is flanking the alpha-helical TED domain. The CUB domain can be divided in two sections, one at the N-terminal and the second at the C-terminal of the TED domain. This entry represents the second part of the CUB domain [[cite:PUB00043241],[cite:PUB00064207]].

  13. w

    ted-construction.com - Historical whois Lookup

    • whoisdatacenter.com
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, ted-construction.com - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/ted-construction.com/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Aug 28, 2025
    Description

    Explore the historical Whois records related to ted-construction.com (Domain). Get insights into ownership history and changes over time.

  14. w

    ted-lapidus.net - Historical whois Lookup

    • whoisdatacenter.com
    csv
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc (2024). ted-lapidus.net - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/ted-lapidus.net/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset authored and provided by
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Aug 25, 2025
    Description

    Explore the historical Whois records related to ted-lapidus.net (Domain). Get insights into ownership history and changes over time.

  15. e

    Thioester domain

    • ebi.ac.uk
    Updated Dec 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Thioester domain [Dataset]. https://www.ebi.ac.uk/interpro/entry/pfam/PF20610
    Explore at:
    Dataset updated
    Dec 24, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This domain is related to the TED domain [pfam:PF08341].

  16. E

    MENYO-20k: A Multi-domain English - Yorùbá Corpus for Machine Translation

    • live.european-language-grid.eu
    tsv
    Updated Oct 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). MENYO-20k: A Multi-domain English - Yorùbá Corpus for Machine Translation [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7687
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Oct 8, 2021
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    MENYO-20k is a multi-domain parallel dataset with texts obtained from news articles, ted talks, movie transcripts, radio transcripts, science and technology texts, and other short articles curated from the web and professional translators. The dataset has 20,100 parallel sentences split into 10,070 training sentences, 3,397 development sentences, and 6,633 test sentences (3,419 multi-domain, 1,714 news domain, and 1,500 ted talks speech transcript domain)The dataset is open but for non-commercial use because some of the data sources like Ted talks and JW news requires permission for commercial use.Acknowledgement: This project was supported by the AI4D language dataset fellowship through K4All and Zindi Africa

  17. w

    my-ted.com - Historical whois Lookup

    • whoisdatacenter.com
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, my-ted.com - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/my-ted.com/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Sep 2, 2025
    Description

    Explore the historical Whois records related to my-ted.com (Domain). Get insights into ownership history and changes over time.

  18. w

    CreativeWork

    • pfocr.wikipathways.org
    Updated Nov 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WikiPathways (2022). CreativeWork [Dataset]. https://pfocr.wikipathways.org/figures/PMC4451739_fimmu-06-00262-g009.html
    Explore at:
    Dataset updated
    Nov 8, 2022
    Dataset authored and provided by
    WikiPathways
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Regulation of the alternative pathway. (A) FH as a master regulator of C3b in the fluid phase and on the cell surface. FH binds to C3b in fluid phase preventing novel convertase formation. FH may bind to C3b and GAGs on the cell surface and the architecture of the complex depends on the level of activation of the cell and the density of deposited C3 fragments. Resting cells have only a few C3b molecules that are deposited and FH binds to them with the regulatory domains CCP1-4. CCP7 and CCP20 interact with GAG on the membrane. Alternately, CCP19 may bind to the TED domain of C3b allowing CCP20 to interact with GAGs. If the cell is activated and C3b and C3d (or two C3b molecules) are deposited in close proximity, FH may bind to two of these molecules, allowing GAG binding by CCP20. (B) On resting cells, C3b will immediately be inactivated to iC3b by the action of FI and the assistance of cofactors (FH, MCP, CR1). iC3b cannot bind FB and forms C3 convertases. Only the cofactor CR1 allows FI to execute a second cleavage generating C3c (released in the fluid phase) and C3dg, which remains bound to the cell. C3dg is rapidly transformed to C3d by tissue proteases. (C) If the host cell is activated, the complement control will not be sufficient to prevent any complement deposition and C3 convertases could be formed. To avoid cell damage, these convertases need to be dissociated. Multiple complement regulators such as DAF, CR1, and FH decay the C3bBb complex formed on host cells. Remaining C3b will be inactivated by FI, using FH, MCP, or CR1.

  19. e

    Domain of unknown function (DUF5979)

    • ebi.ac.uk
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Domain of unknown function (DUF5979) [Dataset]. https://www.ebi.ac.uk/interpro/entry/pfam/PF19407
    Explore at:
    Dataset updated
    Oct 26, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This domain is found as tandem repeats in bacterial cell surface proteins, associated with TED adhesive domains. This domain contains a conserved lysine, glutamic acid and asparagine diagnostic of an intradomain isopeptide bond.

  20. e

    A2MG, CUB domain

    • ebi.ac.uk
    Updated May 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). A2MG, CUB domain [Dataset]. https://www.ebi.ac.uk/interpro/entry/InterPro/IPR049122/taxonomy/uniprot/
    Explore at:
    Dataset updated
    May 1, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This domain is found in a group of proteins predominantly found in beta and gammaproteobacteria, such as alpha-2-macroglobulin (A2MG) from Salmonella typhimurium, a protein that protects the bacterial cell from host peptidases. A2MG is composed of 13 domains, 12 of them folding as beta sandwiches. This entry represents the CUB (complement C1r/C1s, Uegf, Bmp1) domain, which is connected to the TED domain ([interpro:IPR011626]).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.10848710
Organization logo

The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

Explore at:
Dataset updated
Mar 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

Organism TaxonID

arabidopsis_thaliana 3702
caenorhabditis_elegans 6239
candida_albicans 237561
danio_rerio 7955
dictyostelium_discoideum 44689
drosophila_melanogaster 7227
escherichia_coli 83333
glycine_max 3847
homo_sapiens 9606
methanocaldococcus_jannaschii 243232
mus_musculus 10090
oryza_sativa 39947
rattus_norvegicus 10116
saccharomyces_cerevisiae 559292
schizosaccharomyces_pombe 284812
zea_mays 4577
ajellomyces_capsulatus 447093
brugia_malayi 6279
campylobacter_jejuni 192222
cladophialophora_carrionii 86049
dracunculus_medinensis 318479
fonsecaea_pedrosoi 1442368
haemophilus_influenzae 71421
helicobacter_pylori 85962
klebsiella_pneumoniae 1125630
leishmania_infantum 5671
madurella_mycetomatis 100816
mycobacterium_leprae 272631
mycobacterium_tuberculosis 83332
mycobacterium_ulcerans 1299332
neisseria_gonorrhoeae 242231
nocardia_brasiliensis 1133849
onchocerca_volvulus 6282
paracoccidioides_lutzii 502779
plasmodium_falciparum 36329
pseudomonas_aeruginosa 208964
salmonella_typhimurium 99287
schistosoma_mansoni 6183
shigella_dysenteriae 300267
sporothrix_schenckii 1391915
staphylococcus_aureus 93061
streptococcus_pneumoniae 171101
strongyloides_stercoralis 6248
trypanosoma_brucei 185431
trypanosoma_cruzi 353153
wuchereria_bancrofti 6293


For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.


This dataset contains:

  • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
  • novel_folds_set.domain_summary.tsv is sorted by novelty.
    1. ted_id
    2. md5_domain
    3. consensus_level
    4. chopping
    5. nres_domain
    6. num_segments
    7. plddt
    8. num_helix_strand_turn
    9. num_helix
    10. num_strand
    11. num_helix_strand
    12. num_turn
    13. proteome_id
    14. cath_label
    15. cath_assignment_level
    16. cath_assignment_method
    17. packing_density
    18. norm_rg
    19. tax_common_name
    20. tax_scientific_name
    21. tax_lineage
  • Domain assignments for TED redundant in ted_redundant_39m.consensus_domain_summary.taxid.tsv
    The file contains a header with the following fields. Each column is tab separated (.tsv).
    1. TED_redundant_id
    2. md5
    3. nres
    4. n_high
    5. n_med
    6. high_consensus
    7. med_consensus
    8. ndom_consensus
    9. n_targets
    10. proteome_id
    11. TED_redundant_species
    12. TED100_chain_rep
    13. TED100_chain_rep_species
  • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
  • All per-tool domain boundaries predictions are in the same format with the following columns.
    1. TED_chainID
    2. TED_chain_md5
    3. TED_chain_length
    4. ndoms
    5. Domain boundaries
    6. Prediction probability
  • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

    i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
    AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

    Merizo predicts one continuous domain and a discontinuous domain,
    Domain1 (discontinuous): 10-52_289-394
    segment1: 10-52
    segment2: 289-394
    Domain 2 (continuous):
    segment 1: 53-288
  • model_organisms_and_global_health_proteomes.tar.gz - domain assignments for 21 model organisms and 25 global health proteomes
Search
Clear search
Close search
Google apps
Main menu