92 datasets found
  1. AlphaFold Protein Structure Database

    • console.cloud.google.com
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=es&inv=1&invt=Ab0Kww (2024). AlphaFold Protein Structure Database [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold?hl=es
    Explore at:
    Dataset updated
    Sep 3, 2024
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    License
    Description

    The AlphaFold Protein Structure Database is a collection of protein structure predictions made using the machine learning model AlphaFold. AlphaFold was developed by DeepMind , and this database was created in partnership with EMBL-EBI . For information on how to interpret, download and query the data, as well as on which proteins are included / excluded, and change log, please see our main dataset guide and FAQs . To interactively view individual entries or to download proteomes / Swiss-Prot please visit https://alphafold.ebi.ac.uk/ . The current release aims to cover most of the over 200M sequences in UniProt (a commonly used reference set of annotated proteins). The files provided for each entry include the structure plus two model confidence metrics (pLDDT and PAE). The files can be found in the Google Cloud Storage bucket gs://public-datasets-deepmind-alphafold-v4 with metadata in the BigQuery table bigquery-public-data.deepmind_alphafold.metadata . If you use this data, please cite: Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021) Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021) This public dataset is hosted in Google Cloud Storage and is available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage.

  2. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13908086
    Explore at:
    application/gzip, zip, bz2Available download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 31, 2024
    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy, and putative CATH SuperFamily or Fold assignments, for all 365 million domains (~324 million domains in TED100 and ~40 million domains in TED-redundant).

    For all chains in the chain-level TED-redundant files, the file contains boundary predictions, consensus level and information on the TED100 representative.

    For both TED100 and TED-redundant we provide domain boundary predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 PDB files for potentially novel folds identified during the TED classification process, with an annotation table sorted by novelty, as well as 6,433 highly symmetrical folds representatives.

    Please use the gunzip command to extract files with a '.gz' extension and "tar -xzvf file.tar.gz" to open .tar.gz files .

    CATH annotations have been assigned using the Foldseek algorithm applied in various modes, and the Foldclass algorithm, both of which are used to report significant structural similarity to a known CATH domain.


    Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

    Changelog Version 5:

    • Add: ted_365m.domain_summary.cath.globularity.taxid.tsv.tar.gz - This table, in the same format as the previous ted_100_324m.domain_summary.cath.globularity.taxid.tsv.tar.gz, contains per-domain annotations for the whole of TED, including metadata on domain quality metrics such as secondary structure elements counts, globularity scores, average pLDDT and taxonomical assignments.
    • Add: high_symmetry_folds_set.domain_summary.tsv.gz - subset of ted_365m.domain_summary.cath.globularity.taxid.tsv containing information on 6,433 high symmetry folds in TED. The entries are sorted in descending order by Z-score obtained from SymD.
    • Add: high_symmetry_folds_set_models.tar.gz - TED domain models in PDB format for 6,433 high symmetry folds in TED.
    • Add: ISP_data.tar.gz - Raw data for Interacting SuperFamily Pairs calculations used in the manuscript. A more detailed description of the ISP data is available below as well as within the tar.gz file.
    • Add: ted_redundant_40m_domain_id.list.gz - list of TED_domain_ID in TED redundant
    • Add: ted_100_324m_domain_id.list.gz - list of TED_domain_ID in TED100
    • Fix/Replace: A domain-level summary of TED, now consolidated into ted_365m.domain_summary.cath.globularity.taxid.tsv, is consistent with the protocol used in the manuscript. As Foldclass and Foldseek T-level hits provide all 4 CATH digits, we removed the H portion of the CATH code from each prediction at the T-level.
      Previously, the following columns
      14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300
      15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment.
      16. cath_assignment_method - Method used to assign a CATH label, either Foldseek or Foldclass
      sometimes showed an additional label with a T-level prediction by Foldclass in the case of T-level assignments obtained by Foldseek, e.g.
      3.40.30,3.40.30 T foldseek,foldclass
      This has now been corrected to reflect the TED protocol, with Foldclass T-level assignments applied only to domains where a T-level assignment could not be applied using Foldseek, e.g.
      domain-x 3.40.30 T foldseek
      domain-y 3.20.20 T foldclass

      Thus, in the current version of the data, CATH assignments label can only be
      H-level assignment by Foldseek (i.e. 3.40.50.300 H foldseek)
      T-level assignment by Foldseek (i.e. 3.40.30 T foldseek)
      T-level assignment by Foldclass (i.e. 3.40.30 T foldclass)
      or no assignment (- - - )


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m_domain_id.list.gz - list of ~324 million domain identifiers in TED100, one per line in the format AF-
    • ted_redundant_40m_domain_id.list.gz - list of ~40 million domain identifiers in TED redundant, one per line in the format AF-
    • ted_365m.domain_summary.cath.globularity.taxid.tsv, novel_folds_set.domain_summary.tsv and high_symmetry_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv). novel_folds_set.domain_summary.tsv is sorted by novelty
      Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

      1. ted_id - TED domain identifier in the format AF-
    • ted_324m_seq_clustering.cathlabels.tsv.gz
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv.gz
    • The file ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz contains a header with the

  3. s

    AlphaFold Protein Structure Database

    • scicrunch.org
    Updated Nov 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). AlphaFold Protein Structure Database [Dataset]. http://identifiers.org/RRID:SCR_023662
    Explore at:
    Dataset updated
    Nov 19, 2021
    Description

    Database of protein structure predictions by AlphaFold that are freely and openly available to global scientific community. Included are nearly all catalogued proteins known to science. Provides programmatic access to and interactive visualization of predicted atomic coordinates, per residue and pairwise model confidence estimates and predicted aligned errors.

  4. c

    Reciprocal Best Structure Hits (RBSH)

    • repository.cam.ac.uk
    bin
    Updated Sep 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monzon, Vivian; Paysan-Lafosse, Typhaine; Wood, Valerie; Bateman, Alex (2022). Reciprocal Best Structure Hits (RBSH) [Dataset]. http://doi.org/10.17863/CAM.87873
    Explore at:
    bin(171535 bytes), bin(155431 bytes), bin(79489 bytes), bin(84547 bytes), bin(39107 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    University of Cambridge
    Apollo
    Authors
    Monzon, Vivian; Paysan-Lafosse, Typhaine; Wood, Valerie; Bateman, Alex
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this work, we are using AlphaFold structure models to find the closest homologues proteins between Homo sapiens and D. melanogaster, C. elegans, S. cerevisiae and S. pombe as well as between S. cerevisiae and S. pombe. We are using the structure aligner Foldseek to run all against all and search for the best scoring hit in both directions to detect the Reciprocal Best Structure Hits (RBSH). We compare the results to protein pairs detected by their sequence similarity as Reciprocal Best Hits (RBH) and verify the results using the PANTHER family classification files. \( \ \) Note: This dataset is an updated version of the dataset at https://doi.org/10.17863/CAM.85487.

  5. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    Updated Mar 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.10848710
    Explore at:
    Dataset updated
    Mar 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    Organism TaxonID

    arabidopsis_thaliana 3702
    caenorhabditis_elegans 6239
    candida_albicans 237561
    danio_rerio 7955
    dictyostelium_discoideum 44689
    drosophila_melanogaster 7227
    escherichia_coli 83333
    glycine_max 3847
    homo_sapiens 9606
    methanocaldococcus_jannaschii 243232
    mus_musculus 10090
    oryza_sativa 39947
    rattus_norvegicus 10116
    saccharomyces_cerevisiae 559292
    schizosaccharomyces_pombe 284812
    zea_mays 4577
    ajellomyces_capsulatus 447093
    brugia_malayi 6279
    campylobacter_jejuni 192222
    cladophialophora_carrionii 86049
    dracunculus_medinensis 318479
    fonsecaea_pedrosoi 1442368
    haemophilus_influenzae 71421
    helicobacter_pylori 85962
    klebsiella_pneumoniae 1125630
    leishmania_infantum 5671
    madurella_mycetomatis 100816
    mycobacterium_leprae 272631
    mycobacterium_tuberculosis 83332
    mycobacterium_ulcerans 1299332
    neisseria_gonorrhoeae 242231
    nocardia_brasiliensis 1133849
    onchocerca_volvulus 6282
    paracoccidioides_lutzii 502779
    plasmodium_falciparum 36329
    pseudomonas_aeruginosa 208964
    salmonella_typhimurium 99287
    schistosoma_mansoni 6183
    shigella_dysenteriae 300267
    sporothrix_schenckii 1391915
    staphylococcus_aureus 93061
    streptococcus_pneumoniae 171101
    strongyloides_stercoralis 6248
    trypanosoma_brucei 185431
    trypanosoma_cruzi 353153
    wuchereria_bancrofti 6293


    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.


    This dataset contains:

    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id
      2. md5_domain
      3. consensus_level
      4. chopping
      5. nres_domain
      6. num_segments
      7. plddt
      8. num_helix_strand_turn
      9. num_helix
      10. num_strand
      11. num_helix_strand
      12. num_turn
      13. proteome_id
      14. cath_label
      15. cath_assignment_level
      16. cath_assignment_method
      17. packing_density
      18. norm_rg
      19. tax_common_name
      20. tax_scientific_name
      21. tax_lineage
    • Domain assignments for TED redundant in ted_redundant_39m.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab separated (.tsv).
      1. TED_redundant_id
      2. md5
      3. nres
      4. n_high
      5. n_med
      6. high_consensus
      7. med_consensus
      8. ndom_consensus
      9. n_targets
      10. proteome_id
      11. TED_redundant_species
      12. TED100_chain_rep
      13. TED100_chain_rep_species
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID
      2. TED_chain_md5
      3. TED_chain_length
      4. ndoms
      5. Domain boundaries
      6. Prediction probability
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • model_organisms_and_global_health_proteomes.tar.gz - domain assignments for 21 model organisms and 25 global health proteomes
  6. Z

    Prediction and Visualization of Human Transmembrane Proteins using AlphaFold...

    • data.niaid.nih.gov
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grekova, Anastasia (2024). Prediction and Visualization of Human Transmembrane Proteins using AlphaFold and Protein Language Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6816082
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Heinzinger, Michael
    Marquet, Céline
    Grekova, Anastasia
    Rost, Burkhard
    Houri, Leen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure “TMvis_project_overview.png” provides a graphical overview for each step described above.

    TMvis Folder Structure: TMvis is separated into “alpha” containing predicted alpha-helical TMPs, and “beta” containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of: - “UniprotID.fasta” with UniProt ID, sequence, TMbed per-residue prediction - “AF-UniprotID-F1-model_v2.pdb” with the AlphaFold structure - “AF-UniprotID-F1-model_v2.cif” with the AlphaFold structure - “AF-UniprotID-F1-model_v2_ANVIL.pdb” with predicted ANVIL membrane embedding - “AF-UniprotID-F1-model_v2_ppm.pdb” predicted PPM3 membrane embedding

    TMvis
    |
    ├── alpha
    │ │
    │ ├── A0A087X1C5
    │ │ ├── A0A087X1C5.fasta
    │ │ ├── AF-A0A087X1C5-F1-model_v2.pdb
    │ │ ├── AF-A0A087X1C5-F1-model_v2.cif
    │ │ ├── AF-A0A087X1C5-F1-model_v2_ANVIL.pdb
    │ │ └── AF-A0A087X1C5-F1-model_v2_ppm.PDB
    │ └── ...
    └── beta
    └── P45880

    TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook “TMvis.ipynb”. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.

    ——————————————————————————————————————————————————————————————————————————

    References:

    [1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. “TMbed – Transmembrane Proteins Predicted through Language Model Embeddings.” bioRxiv.

    [2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.

    [3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480–D489.

    [4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.

    [5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. “AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.” Nucleic Acids Research 50 (D1): D439–44.

    [6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. “Membrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.” Protein Engineering, Design & Selection: PEDS 29 (3): 87–91.

    [7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. “OPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.” Nucleic Acids Research 40 (Database issue): D370–76.

    ——————————————————————————————————————————————————————————————————————————

    License:

    This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).

  7. r

    AlphaFold Unmasked data sets

    • demo.researchdata.se
    • figshare.scilifelab.se
    • +1more
    Updated Jan 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claudio Mirabello; Björn Wallner; Björn Nystedt; Marta Carroni (2025). AlphaFold Unmasked data sets [Dataset]. http://doi.org/10.17044/SCILIFELAB.24198669
    Explore at:
    Dataset updated
    Jan 27, 2025
    Dataset provided by
    Linköping University
    Authors
    Claudio Mirabello; Björn Wallner; Björn Nystedt; Marta Carroni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here are deposited all of the predictions generated for the test cases presented in "AlphaFold Unmasked: integration of experiments and predictions with a smarter template mechanism" (doi: https://doi.org/10.1101/2023.09.20.558579) along with the log files necessary to reproduce the experiments.

    Each tar.gz file includes one or more AlphaFold experiments, where multiple predictions have been generated either with AlphaFold-Multimer (standard pipeline, v2.2 and/or v2.3 parameters) or with AF_unmasked. An experiment is made of a set of 3D structure predictions (.pdb files) along with the ancillary data generated by AlphaFold (pickle files) and the corresponding inputs (Multiple Sequence Alignments, sequences). Scripts to reproduce the results are included along with the log files generated during the experiments.

    H1111, H1142, T1109 and T1110 are multimeric prediction targets from CASP15 (https://predictioncenter.org/casp15/) chosen because most or all predictors failed to correctly predict these complexes in the 2021 edition of CASP.

    Rubisco, NF1 and ClpB are examples of large and/or challenging targets where Cryo-EM data is available to be integrated in the prediction pipeline.

    The PDB benchmark is made of a set of protein heterodimeric structures deposited in the PDB before January 2022, i.e. before AlphaFold v2.3 was trained and released. These heterodimers have been redundancy reduced by structural similarity (MMalign score threshold: 0.4) to increase their diversity

  8. c

    Reciprocal Best Structure Hits

    • repository.cam.ac.uk
    bin
    Updated Jun 14, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monzon, Vivian; Paysan-Lafosse, Typhaine; Wood, Valerie; Bateman, Alex (2022). Reciprocal Best Structure Hits [Dataset]. http://doi.org/10.17863/CAM.85487
    Explore at:
    bin(181809 bytes), bin(39365 bytes), bin(86555 bytes), bin(200706 bytes), bin(90557 bytes)Available download formats
    Dataset updated
    Jun 14, 2022
    Dataset provided by
    University of Cambridge
    Apollo
    Authors
    Monzon, Vivian; Paysan-Lafosse, Typhaine; Wood, Valerie; Bateman, Alex
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this work, we are using AlphaFold structure models to find the closest homologues proteins between Homo sapiens and D. melanogaster, C. elegans, S. cerevisiae and S. pombe as well as between S. cerevisiae and S. pombe. We are using the structure aligner Foldseek to run all against all and search for the best scoring hit in both directions to detect the Reciprocal Best Structure Hits (RBSH). We compare the results to protein pairs detected by their sequence similarity as Reciprocal Best Hits (RBH) and verify the results using the PANTHER family classification files. \( \ \) Note: This dataset is an earlier version of a more up-to-date dataset at https://doi.org/10.17863/CAM.87873

  9. Data from: MsmRho AlphaFold predictions

    • ourarchive.otago.ac.nz
    Updated Jun 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Magalhaes Moreira (2024). MsmRho AlphaFold predictions [Dataset]. https://ourarchive.otago.ac.nz/esploro/outputs/dataset/MsmRho-AlphaFold-predictions/9926653800801891
    Explore at:
    Dataset updated
    Jun 24, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Sofia Magalhaes Moreira
    Time period covered
    Jun 24, 2024
    Description

    This folder contains the files in cif format generated by AlphaFold 3 to build Figure 4.1 of the thesis of Sofia Megalhães Moreira - https://hdl.handle.net/10523/43234. The data is embargoed in Figshare until 24 June 2026.

  10. AlphaFold 2 generated models of SurA homologues

    • zenodo.org
    bin
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bob Schiffrin; Bob Schiffrin (2024). AlphaFold 2 generated models of SurA homologues [Dataset]. http://doi.org/10.5281/zenodo.13150730
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bob Schiffrin; Bob Schiffrin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Models of SurA homologues which are present in the InterPro family IPRO15391 but are not in the EBI AlphaFold database (2024)

  11. Z

    Data from: Application of AlphaFold on metamorphic proteins - dataset

    • data.niaid.nih.gov
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Góra (2024). Application of AlphaFold on metamorphic proteins - dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8093512
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Weronika Bagrowska
    Karolina Mitusińska
    Artur Góra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset comprises a set of five structures of metamorphic proteins used for the study.

  12. Z

    Supplementary Data for "Using AlphaFold and Experimental Structures for the...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lihan, Muyun (2023). Supplementary Data for "Using AlphaFold and Experimental Structures for the Prediction of the Structure and Binding Affinities of GPCR Complexes via Induced Fit Docking and Free Energy Perturbation" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10037622
    Explore at:
    Dataset updated
    Oct 24, 2023
    Dataset provided by
    Coskun, Dilek
    Miller, Edward
    Lihan, Muyun
    Rodrigues, Joao
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Supplementary data for publication "Using AlphaFold and Experimental Structures for the Prediction of the Structure and Binding Affinities of GPCR Complexes via Induced Fit Docking and Free Energy Perturbation". Includes:

    All input structures used in the the retrospective benchmark dataset as well as the (at most) 5 best scoring output models. Input structures and output models for IFD-MD predictions of SSTR2, SSTR4, and SSTR5 complexes. Output FEP+ maps (in fmp format) for SSTR2, SSTR4, and SSTR5 best models (representative runs shown in publication).

  13. e

    AlphaFold

    • ebi.ac.uk
    Updated Mar 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). AlphaFold [Dataset]. https://www.ebi.ac.uk/ebisearch/search.ebi?db=allebi&t=SPCH
    Explore at:
    Dataset updated
    Mar 27, 2019
    Description

    AlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.

  14. r

    Predictomes

    • rrid.site
    Updated Jun 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Predictomes [Dataset]. http://identifiers.org/RRID:SCR_026691
    Explore at:
    Dataset updated
    Jun 15, 2025
    Description

    Interactive database of protein protein interactions modeled by AlphaFold multimer. Classifier-curated database of AlphaFold-modeled protein-protein interactions.

  15. DPAM Domain Classification of Human Proteins against ECOD Reference

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Dec 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Schaeffer; Jing Zhang; Lisa Kinch; Jimin Pei; Qian Cong; Nick Grishin; Richard Schaeffer; Jing Zhang; Lisa Kinch; Jimin Pei; Qian Cong; Nick Grishin (2022). DPAM Domain Classification of Human Proteins against ECOD Reference [Dataset]. http://doi.org/10.5281/zenodo.6998803
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard Schaeffer; Jing Zhang; Lisa Kinch; Jimin Pei; Qian Cong; Nick Grishin; Richard Schaeffer; Jing Zhang; Lisa Kinch; Jimin Pei; Qian Cong; Nick Grishin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Domain definitions of AlphaFold classifications of the human proteome (v1) from the AlphaFold Database. Also included are classifications of Danio rerio, Mus musculus, Pan paniscus, Drosophila melanogaster, Caenorhabditis elegans used for comparative analysis to human. See README file for descriptions of file formats.

  16. f

    Data from: Structure-guided isoform identification for the human...

    • figshare.com
    bin
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Sommer; Sooyoung Cha; Ales Varabyou; Natalia Rincon; Sukhwan Park; Ilia Minkin; Mihaela Pertea; Martin Steinegger; Steven L. Salzberg (2023). Structure-guided isoform identification for the human transcriptome [Dataset]. http://doi.org/10.6084/m9.figshare.21802476.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Markus Sommer; Sooyoung Cha; Ales Varabyou; Natalia Rincon; Sukhwan Park; Ilia Minkin; Mihaela Pertea; Martin Steinegger; Steven L. Salzberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Protein structure prediction files for the CHESS human protein structure database version 1.2. AlphaFold2/ColabFold predictions of the GTEx assembled human proteome.

  17. Z

    AlphaFold structures reported in "AlphaFold2 Can Predict Single-Mutation...

    • data.niaid.nih.gov
    Updated Oct 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McBride, John M (2023). AlphaFold structures reported in "AlphaFold2 Can Predict Single-Mutation Effects" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10013252
    Explore at:
    Dataset updated
    Oct 21, 2023
    Dataset authored and provided by
    McBride, John M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This contains AlphaFold predictions for X proteins that are found in the Protein Data Bank (PDB), that were used to evalluate AlphaFold's predictions of mutation effects. This includes one set of structures predicted by AlphaFold2.0, using default settings, and one structure for each of 5 models. This also includes structures predicted by the ColabFold version of AlphaFold (6 recycles, 5 models, no template, amber minimization, 4 repeats). There are also additional predicted structures that are found in the PDB that were not analyzed in the paper. There are AlphaFold predictions for three proteins (BFP / RFP, GFP, and PafA), covering either all (BFP/RFP, PafA) or a subset (GFP) of the sequences in three datasets of phenotype measurements from high-throughput experiments. Results are separated into tar files based on whether DeepMind (AF2.0) or ColabFold implementation was used. Folders under "ColabFold/PDB" are labelled according to a sequence ID, since multiple PDB structures can exist for a single sequence. These sequence IDs can be mapped back to PDB IDs using the information in "seq_id_pdb_id.json". All PDB files have been compressed using Foldcomp (https://github.com/steineggerlab/foldcomp). Foldcomp is required to decompress the ".fcz" files in order to recover the ".pdb" files.

  18. Data from: Integrating AlphaFold pLDDT Scores into CABS-flex for Enhanced...

    • zenodo.org
    application/gzip, csv +1
    Updated Oct 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karol Wróblewski; Karol Wróblewski; Sebastian Kmiecik; Sebastian Kmiecik (2024). Integrating AlphaFold pLDDT Scores into CABS-flex for Enhanced Protein Flexibility Simulations [Dataset]. http://doi.org/10.5281/zenodo.13984926
    Explore at:
    txt, application/gzip, csvAvailable download formats
    Dataset updated
    Oct 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Karol Wróblewski; Karol Wróblewski; Sebastian Kmiecik; Sebastian Kmiecik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Data created during research on: Integrating AlphaFold pLDDT Scores into CABS-flex for Enhanced Protein Flexibility Simulations.
    Training_set_protein_chains.txt and Whole_set_protein_chains.txt have lists of all PDB ID and chain used.
    Description_of_runs.csv has a list of every run tested. Run number correponds to csv file in Results_run.tar.gz.
    Every csv file has following columns:
    • PDB - PDB ID and chain
    • Total_residues
    • %_C - Percent of secondary structure assigned as coil by DSSP
    • %_H - Percent of secondary structure assigned as helix by DSSP
    • %_E - Percent of secondary structure assigned as sheet by DSSP
    • %_T - Percent of secondary structure assigned as turn by DSSP
    • pLDDT_mean - Average pLDDT score across all residues
    • pLDDT_std - Standard deviation of pLDDT scores across all residues
    • Unique_restraints - Number of unique restraints created by CABS-flex
    • RMSF_CABS_R1_corr - RMSF correlation between CABS-flex and first MD simulation from ATLAS
    • RMSF_CABS_R2_corr - RMSF correlation between CABS-flex and second MD simulation from ATLAS
    • RMSF_CABS_R3_corr - RMSF correlation between CABS-flex and third MD simulation from ATLAS
    • Highest_RMSF_corr - Highest RMSF correlation out of three
  19. UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emre Brookes; Mattia Rocco (2023). UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural small angle scattering and SESCA circular dichroism (CD) calculations on AlphaFold predicted structures [Dataset]. http://doi.org/10.5061/dryad.jq2bvq89s
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 13, 2023
    Dataset provided by
    University of Montana
    Ospedale Policlinico San Martino
    Authors
    Emre Brookes; Mattia Rocco
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Recent spectacular advances by AI programs in 3D structure predictions from protein sequences have revolutionized the field in terms of accuracy and speed. The resulting "folding frenzy" has already produced predicted protein structure databases for the entire human and other organisms' proteomes. However, rapidly ascertaining a predicted structure's reliability based on measured properties in solution should be considered. Shape-sensitive hydrodynamic parameters such as the diffusion and sedimentation coefficients (D0t(20,w),s0(20,w)) and the intrinsic viscosity ([η]) can provide a rapid assessment of the overall structure likeliness, and SAXS would yield the structure-related pair-wise distance distribution function p(r) vs. r. Using the extensively validated UltraScan SOlution MOdeler (US‑SOMO) suite, a database was implemented calculating from AlphaFold structures the corresponding D0t(20,w), s0(20,w), [η], p(r) vs. r, and other parameters. Circular dichroism spectra were computed using the SESCA program. Some of AlphaFold's drawbacks were mitigated, such as generating whenever possible a protein's mature form. Others, like the AlphaFold direct applicability to single-chain structures only, the absence of prosthetic groups, or flexibility issues, are discussed. Overall, this implementation of the US‑SOMO‑AF database should already aid in rapidly evaluating the consistency in solution of a relevant portion of AlphaFold predicted protein structures. Methods Production of this dataset required three major steps: collect the AlphaFold entries and additional metadata; prepare the structures for hydrodynamic, structural and CD calculations; and compute the hydrodynamic, structural and CD propertiesBriefly, each entry in the entire AlphaFold database was first compared with the corresponding entry in the UniProt database to find the (putative) initiator methionine, signal peptide and transit peptide regions, which were subsequently removed from the AlphaFold PDB files. Additional variants were created when propeptides were found. Potential disulfides were identified (subsequently allowing a better evaluation of the partial specific volume and of M) and written as SSBOND records in the cured PDBs, together with HELIX and SHEET information identified using the DSSP implementation in UCSF Chimera (Pettersen et al, 2004. Journal of computational chemistry, 25(13), pp.1605-1612). Batch-mode US-SOMO was then used to calculate the mass M, The translational diffusion coefficient D0t(20,w), the sedimentation coefficient s0(20,w), the derived Stokes' (or hydrodynamic) radius Rs, the intrinsic viscosity [η], the radius of gyration Rg, the maximum extensions along the principal X, Y and Z axes of the molecule, and the generation of an anhydrous small angle X-ray scattering pairwise distribution function p( r ) vs. r distributions (that are normalized by the M of the structure). SESCA was subsequently used to generate 170-270 nm circular dichroism CD spectra from each cured structure.

  20. Z

    Metalloprotein AlphaFold set with enzyme/non-enzyme labeled sites

    • data.niaid.nih.gov
    Updated Mar 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feehan, Ryan (2023). Metalloprotein AlphaFold set with enzyme/non-enzyme labeled sites [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7689819
    Explore at:
    Dataset updated
    Mar 20, 2023
    Dataset authored and provided by
    Feehan, Ryan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The AlphaFold set contains computationally generated structures for metalloproteins that were used to test MAHOMES II's enzyme/non-enzyme predictive performance (Feehan et al. 2023).

    README.md - Detailed description of AlphaFold set generation.

    AF-...-model_v2.pdb - Files with the 3D atomic coordinates of a metalloprotein.

    MAHOMES-II_AlphaFold_set_site_data.csv - Contains the data used during the generation of the AlphaFold set for the final sites. Columns are - Entry: The UniProt accession number of the protein with the bound metal site. - struc_id: The structures AlphaFold DB name (Febuary 2022) and the name of the file in this directory with added metal site. - metal_resName: The two letter PDB residue abbreviation for the site's metal - metal_seqID: The residue index number for the added metal ion. - Enzyme: The enzyme (True) or non-enzyme (False) label. - Entry name: UniProt entry name. - Protein names: The UniProt provided metalloprotein name(s). - Number of homologs with solved structures (PDB): Number of protein sequences in the PDB (May 21, 2020) with an E-value < 1. - Number of homologs in MAHOMES II dataset and T-metal-sites10: Number of protein sequences used to train and evaluate MAHOMES II with an E-value < 1 (0 for all entries). - Metal binding note: UniProt metal binding note that includes information covering the metal’s identity and catalytic flag. - Metal coordinating residue seqIDs: The sequence indices for the metal coordinating residues included in the UniProt’s metal binding section.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=es&inv=1&invt=Ab0Kww (2024). AlphaFold Protein Structure Database [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold?hl=es
Organization logoOrganization logo

AlphaFold Protein Structure Database

Explore at:
Dataset updated
Sep 3, 2024
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
License
Description

The AlphaFold Protein Structure Database is a collection of protein structure predictions made using the machine learning model AlphaFold. AlphaFold was developed by DeepMind , and this database was created in partnership with EMBL-EBI . For information on how to interpret, download and query the data, as well as on which proteins are included / excluded, and change log, please see our main dataset guide and FAQs . To interactively view individual entries or to download proteomes / Swiss-Prot please visit https://alphafold.ebi.ac.uk/ . The current release aims to cover most of the over 200M sequences in UniProt (a commonly used reference set of annotated proteins). The files provided for each entry include the structure plus two model confidence metrics (pLDDT and PAE). The files can be found in the Google Cloud Storage bucket gs://public-datasets-deepmind-alphafold-v4 with metadata in the BigQuery table bigquery-public-data.deepmind_alphafold.metadata . If you use this data, please cite: Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021) Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021) This public dataset is hosted in Google Cloud Storage and is available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage.

Search
Clear search
Close search
Google apps
Main menu