Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure “TMvis_project_overview.png” provides a graphical overview for each step described above.
TMvis Folder Structure: TMvis is separated into “alpha” containing predicted alpha-helical TMPs, and “beta” containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of:
- “UniprotID.fasta” with UniProt ID, sequence, TMbed per-residue prediction
- “AF-UniprotID-F1-model_v2.pdb” with the AlphaFold structure
- “AF-UniprotID-F1-model_v2.cif” with the AlphaFold structure
- “AF-UniprotID-F1-model_v2_ANVIL.pdb” with predicted ANVIL membrane embedding
- “AF-UniprotID-F1-model_v2_ppm.pdb” predicted PPM3 membrane embedding
TMvis
|
├── alpha
│ │
│ ├── A0A087X1C5
│ │ ├── A0A087X1C5.fasta
│ │ ├── AF-A0A087X1C5-F1-model_v2.pdb
│ │ ├── AF-A0A087X1C5-F1-model_v2.cif
│ │ ├── AF-A0A087X1C5-F1-model_v2_ANVIL.pdb
│ │ └── AF-A0A087X1C5-F1-model_v2_ppm.PDB
│ └── ...
└── beta
└── P45880
TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook “TMvis.ipynb”. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.
——————————————————————————————————————————————————————————————————————————
References:
[1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. “TMbed – Transmembrane Proteins Predicted through Language Model Embeddings.” bioRxiv.
[2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.
[3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480–D489.
[4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.
[5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. “AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.” Nucleic Acids Research 50 (D1): D439–44.
[6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. “Membrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.” Protein Engineering, Design & Selection: PEDS 29 (3): 87–91.
[7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. “OPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.” Nucleic Acids Research 40 (Database issue): D370–76.
——————————————————————————————————————————————————————————————————————————
License:
This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.
In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.
For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.
Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:
For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).
We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.
Please use the gunzip command to extract files with a '.gz' extension.
CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
100 proteins were selected from the set of human proteins covered by the AlphaFold Protein Structure Database. For each protein, 20 random seed values in the range form 1 to 999999999 were generated. Based on the canonical peptide sequence of the protein, json input files for the AlphaFold3 server were created and then manually submitted. The output was manually downloaded and processed with Matlab scripts that are contained in the folders for the individual proteins. This processing requires MMMx, which is available at https://github.com/gjeschke/MMMx. The main folder contains two additional Matlab scripts for analysis of the whole set of ensembles.
This dataset contains the structural models for the primary transcripts of the Rhodopseudomonas palustris proteome. For each protein, the five models inferred from AlphaFold 2 are provided. The largest pTM-scoring model for each protein was energy minimized; this minimized structure as well as its AlphaFold pickle output file are also provided. This set of structures represent an alternate source of models for the R. palustris proteome to those available in the AlphaFold Protein Structure Database.
Recent spectacular advances by AI programs in 3D structure predictions from protein sequences have revolutionized the field in terms of accuracy and speed. The resulting "folding frenzy" has already produced predicted protein structure databases for the entire human and other organisms' proteomes. However, rapidly ascertaining a predicted structure's reliability based on measured properties in solution should be considered. Shape-sensitive hydrodynamic parameters such as the diffusion and sedimentation coefficients (D0t(20,w),s0(20,w)) and the intrinsic viscosity ([η]) can provide a rapid assessment of the overall structure likeliness, and SAXS would yield the structure-related pair-wise distance distribution function p(r) vs. r. Using the extensively validated UltraScan SOlution MOdeler (US‑SOMO) suite, a database was implemented calculating from AlphaFold structures the corresponding D0t(20,w), s0(20,w), [η], p(r) vs. r, and other parameters. Circular dichroism spectra were computed u...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Chagas disease is a devastating neglected disease caused by the parasite Trypanosoma cruzi, which affects millions of people worldwide. The two anti-parasitic drugs available, nifurtimox and benznidazole, have a good efficacy against the acute stage of the infection. But this is short, usually asymptomatic and often goes undiagnosed. Access to treatment is mostly achieved during the chronic stage, when the cardiac and/or digestive life-threatening symptoms manifest. Then, the efficacy of both drugs is diminished, and their long administration regimens involve frequently associated adverse effects that compromise treatment compliance. Therefore, the discovery of safer and more effective drugs is an urgent need. Despite its advantages over lately used phenotypic screening, target-based identification of new anti-parasitic molecules has been hampered by incomplete annotation and lack of structures of the parasite protein space. Presently, the AlphaFold Protein Structure Database is home to 19,036 protein models from T. cruzi, which could hold the key to not only describe new therapeutic approaches, but also shed light on molecular mechanisms of action for known compounds. In this proof-of-concept study, we screened the AlphaFold T. cruzi set of predicted protein models to find prospective targets for a pre-selected list of compounds with known anti-trypanosomal activity using docking-based inverse virtual screening. The best receptors (targets) for the most promising ligands were analyzed in detail to address molecular interactions and potential drugs’ mode of action. The results provide insight into the mechanisms of action of the compounds and their targets, and pave the way for new strategies to finding novel compounds or optimize already existing ones.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AlphaFold set contains computationally generated structures for metalloproteins that were used to test MAHOMES II's enzyme/non-enzyme predictive performance (Feehan et al. 2023). README.md - Overview, detailed description of AlphaFold set generation, and description of content in this directory. AF-...-model_v2.pdb - Files with the 3D atomic coordinates of a metalloprotein. addMetalIon.py - Code used to place metal ion sites in apo metalloprotein structures. MAHOMES-II_AlphaFold_set_site_data.csv - Contains the data used during the generation of the AlphaFold set for the final sites. Columns are - Entry: The UniProt accession number of the protein with the bound metal site. - struc_id: The structures AlphaFold DB name (Febuary 2022) and the name of the file in this directory with added metal site. - metal_resName: The two letter PDB residue abbreviation for the site's metal - metal_seqID: The residue index number for the added metal ion. - Enzyme: The enzyme (True) or non-enzyme (False) label. - Entry name: UniProt entry name. - Protein names: The UniProt provided metalloprotein name(s). - Number of homologs with solved structures (PDB): Number of protein sequences in the PDB (May 21, 2020) with an E-value < 1. - Number of homologs in MAHOMES II dataset and T-metal-sites10: Number of protein sequences used to train and evaluate MAHOMES II with an E-value < 1 (0 for all entries). - Metal binding note: UniProt metal binding note that includes information covering the metal’s identity and catalytic flag. - Metal coordinating residue seqIDs: The sequence indices for the metal coordinating residues included in the UniProt’s metal binding section.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Chagas disease is a devastating neglected disease caused by the parasite Trypanosoma cruzi, which affects millions of people worldwide. The two anti-parasitic drugs available, nifurtimox and benznidazole, have a good efficacy against the acute stage of the infection. But this is short, usually asymptomatic and often goes undiagnosed. Access to treatment is mostly achieved during the chronic stage, when the cardiac and/or digestive life-threatening symptoms manifest. Then, the efficacy of both drugs is diminished, and their long administration regimens involve frequently associated adverse effects that compromise treatment compliance. Therefore, the discovery of safer and more effective drugs is an urgent need. Despite its advantages over lately used phenotypic screening, target-based identification of new anti-parasitic molecules has been hampered by incomplete annotation and lack of structures of the parasite protein space. Presently, the AlphaFold Protein Structure Database is home to 19,036 protein models from T. cruzi, which could hold the key to not only describe new therapeutic approaches, but also shed light on molecular mechanisms of action for known compounds. In this proof-of-concept study, we screened the AlphaFold T. cruzi set of predicted protein models to find prospective targets for a pre-selected list of compounds with known anti-trypanosomal activity using docking-based inverse virtual screening. The best receptors (targets) for the most promising ligands were analyzed in detail to address molecular interactions and potential drugs’ mode of action. The results provide insight into the mechanisms of action of the compounds and their targets, and pave the way for new strategies to finding novel compounds or optimize already existing ones.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive database of Discoba protein sequences, gathered for the purpose of improving protein structure predictions of Discoba species (including Trypanosoma and Leishmania) by AlphaFold and RoseTTAFold. Originally gathered for use with: https://github.com/zephyris/discoba_alphafold
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
For the semi-automated identification of potential cleavage specificity-modulating SNVs we developed a program, called CAIRA (Catalytic Associated Irregular Residue Analyser). Entering the UniProt ID of a protease, CAIRA takes the predicted AlphaFold structure, generates a user-definable radius (e.g. 20 Å) around the active site and filters the COSMIC database for cancer-associated SNVs within this radius, which might have cleavage specificity-modulating effects. CAIRA can be used with all types of proteases (metallo, serine, aspartyl and cysteine proteases). To further assess their potential impact, SNV-affected amino acids are labelled in the protein structure of a downloadable pdb-file. Methods CAIRA was programmed using python. Protein structures were obtained from AlphaFold. Cleavage specificity logos were generated as described above using cleavage site specificity data from MEROPS the Peptidase Database. Cancer-associated SNVs data was taken from the COSMIC database (cancer.sanger.ac.uk;). Catalytically important amino acid residue(s) within the active site were extracted from UniProt annotation and are used as the centre of the sphere. For proteases with more than one catalytically important amino acid residue within the active site (aspartyl, serine and cysteine proteases) CAIRA uses the middle between the relevant residues as the centre of the sphere. In the beta version, the ‘use binding cavity (beta)’ button can also be additionally used to switch to a version in which the binding cavity is imitated and SNVs can be output within it. For this purpose, cylinders of adjustable size are placed around the chemical bonds of the backbone of the protease's propeptide and amino acids of the protease that lie with at least one atom within these cylinders are filtered out and checked for annotated SNVs. The logo of CAIRA was designed with the help of Adobe Firefly.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tables
Table S1. List of fungal genomes analyzed in this work, associated references and properties.
Table S2. List of all secreted proteins less than 300 amino-acids from the 20 fungal genomes. The table includes Signalp4.0 output, mature sequence, Espritz % disorder, pfam domains, AlphaFold top prediction pLDDT and the associated pdb file in Dataset S1.
Table S3. Top Hits to pdb database for all OCE structures. 'network_node_name' corresponds to the portein identifier in the OCE structure similarity network provided in Dataset S3. 'Hidef_raw_community' corresponds to groups of structural OCE analogs identified by HiDEF community detection performed on the network provided in Dataset S3.
Table S4. Table S4. List of the 62 major OCE folds with associated statistics. Columns I to AB provide the number of occurrences per species. Note that the actual number of members per species might be underestimated due to the stringent pipeline used for OCE identification (excluding proteins larger than 300 amino acids or containing PFAMs for instance).
Table S5. Relative surface exposure, conformational flexibility and conservation data mapped on residues of members of the Alt-A1 and BoNT families. RMSD, root mean square deviation for all aligned atoms; Conservation, percentage conservation in multiple structure alignment.
Table S6. Assignment of NCBI accessions to MMseqs clusters and assignment of MMseqs clusters to HMM matching-based super-clusters.
Table S7. Co-mutation occurrences and associated p-values in two OCE clades from the Alt-A1 and KP6 families.
Table S8. Amino acid properties inferred from mutation scans and frustration analyses in Alt-A1 cluster yellow1 and KP6 cluster 43. 'Number of aa variants' corresponds to the number of different amino acids found at each position (deletion counts as 1). 'Alanine scan ∆Z' and 'Deletion scan ∆Z' correspond to the difference between Z-score for the native protein agains itself and Z-score for the native protein against mutant at each position (either Alanine replacement or 5-aa deletion). 'Destabilization factor' is the average of column E and F. 'Stabilization factor' corresponds to the difference between expected structural variation due to destabilization factor and the observed structural variation in multiple mutants. 'netEffect' is difference between column G and H. 'Max co-mutation %' is the highest frequency of co-mutation observed with other residues in natural variants, with 'Min co-mutation p-value (Bonferroni corrected)' the associated p-value.Table S9. Sequence and delta Z of natural variants and mutants from AA1_cl25
Table S9. List of natural variants and in silico mutants from the Alt-A1 cluster 25 analyzed in this work, including protein sequence and structure comparison scores (comparison with the reconstructed clade ancestor n0).
Table S10. List of natural variants and in silico mutants from the KP6 cluster 43 analyzed in this work, including protein sequence and structure comparison scores (comparison with the reconstructed clade ancestor n0).
Table S11. Summary statistics for the phylogenetic trees of 15 OCE clades analyzed for structure and frustration evolution.
Table S12. Mapping of structural and frustration data onto phylogenetic trees for 15 OCE clades. The corresponding trees and protein structures are provided in Dataset S7.
Datasets
Dataset S1. AlphaFold rank1 models for 3 927 OCEs (.pdb format).
Dataset S2. Pairwise structure comparison for 3 911 OCE. DALI matrix output containing pairwise Z-scores.
Dataset S3. Network file including 2 561 OCEs with 3 or more vertices of Z-score weight 5.2 or more, in .sif and .xgmml formats.
Dataset S4. Videos illustrating the mapping of relative surface exposure and structural variability in Alt-A1 and BoNT groups, amino-acids conservation, co-selected mutation patches and residue net stabilization effects on Alt-A1 clade 25 ancestor and KP6 cluster 43 ancestor. Color scales are as in Figure 2 and 3 respectively (.mp4 format).
Dataset S5. Phylogenetic trees (.nwk), ancestral (.fasta) and modern variant (.faa) sequences, and AlphaFold best protein models (.pdb) for members of KP6 cluster 43 and Alt-A1 cluster 25. The archive includes 140 Alt-A1 protein structure and 128 KP6 protein structures.
Dataset S6. Best predicted structures for 917 natural variants and mutants of AA1_cl25 and 801 natural variants and mutants of KP6_cl43 (.pdb format).
Dataset S7. Phylogenetic trees (.nwk) and AlphaFold best protein models (.pdb) for 15 OCE clades. The file includes 2 598 protein structures distributed from clades AA1_s (139), AA1_t (135), AA1_y1 (140), AA1_y2 (90), AA1_y3 (128), BoNT_s (291), CIP_s (167), CIP_t (231), crystallin (233), GNK2 (189), KP6_cl3 (203), KP6_cl26 (111), KP6_cl43 (123), KP6_cl96 (231), KP6_cl242 (187).
Text and Figures
Text S1. Contains supplementary methods, results and figures S1 to S13.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 at a high predicted accuracy and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4. The dataset deposited here corresponds to the metadata generated, and that makes the base of the similarity network constructed and its interpretation. These files are either generated or processed using the code available at https://github.com/ProteinUniverseAtlas/AFDB90v4.
This repository further contains the detailed, individual sequence similarity networks (in CLANS format) generated for the 3 example protein (super)families described in the text.
The full content of this repository includes:
AFDBv4_pLDDT_diggestion_UniRef50_2023-02-01.csv: table listing all uniref50 clusters in UniProt, including information on structural representatives from AFDB. Each column provides different annotations, including functional brightness, median and best pLDDT, brightness and structural representatives, etc.
AFDBv4_DUF_dark_diggestion_UniRef50_2023-02-06.csv: table listing all uniref50 clusters in UniProt and whether they include proteins mapped to known domains of unknown function (DUF).
AFDBv4_90.fasta: fasta file with the sequences of all UniRef50 clusters selected, and used for the all-against-all mmseqs searches that make the base of the network.
AFDB90v4_data.csv: the subset of file (1) that corresponds to the AFDB90v4 dataset, including columns such as functional brightness, median and best pLDDT, brightness and structural representatives, etc.
AFDB90v4_data_with_graph_labels.csv: table listing each individual uniref50 cluster included in the AFDB90v4 dataset, together with their mapping to communities, and connected components.
AFDB90v4_cc_data.csv: table of uniref50 clusters in connected components, including their annotations, and the columns in file (5).
AFDB90v4_cc_data_uniprot_community_taxonomy_map.csv: mapping of each uniprotAC entry to their corresponding component, community and taxonomy.
AFDB90v4_subgraphs_summary.csv: table summarising the properties of individual connected components, including the average brightness, the number of members, the number of unique protein sequences, the median length, and the number of communities.
communities_summary.csv: table summarising the properties of individual communities, including average brightness, the number of members, the number of unique protein sequences, the median length, the most common superkingdom represented, the average structure outlier score, etc.
communities_edge_list-coordinates.csv: the coordinates of each community in the graphical representation. Singleton communities or singleton UniRef50 clusters are not included.
communities_edge_list_no_duplicates.csv: list of edges making the graph.
node_class.json: map between each Uniref50 cluster and its corresponding component and community.
subgraphs.tar.gz: tar file containing gml files for each individual connected component.
AFDB90v4_outlier_scores.tsv: table containing the outlier scores for each community representative.
AFDB90v4_dark_galaxies_summary.csv: table containing the summary of all dark connected components, including average brightness, median length, representatives, number of communities, etc.
AFDB90v4_uniprot_naming_assessment_counts.csv: table listing the per-component semantic diversity scores, as well as the major source of the titles of the proteins included and their count.
uniprot_naming_assessment.tar.gz: tar file containing the per community assessment of predicted protein names in UniProt as of February 2023.
CLANS_files.tar.gz: stores the 3 sequence similarity networks, in CLANS format, constructed for the analysis of the sequence diversity and sequence similarities of the proteins in components, 27, 159 and 3314. These CLANS files make the base of panel A in all figures 3 and 4 and extended data figure 5.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Protonation states of ionizable protein residues modulate many essential biological processes. For correct modeling and understanding of these processes, it is crucial to accurately determine their pKa values. Here, we present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction tool PROPKA and 15% better than the published result from the pKa prediction method DelPhiPKa. The overall root-mean-square error (RMSE) for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys, and Tyr), and 0.63 when considering Asp, Glu, His, and Lys only. We provide pKa predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pKa values close to the physiological pH.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To effectively assess the capabilities of various methods in flexible protein-protein docking, it is essential for a protein-protein docking dataset to encompass not only the structures of the heterodimer but also that of unbound monomers. Existing datasets such as DB5.5 and AB-Benchmark, while useful, are relatively limited in scale. In contrast, the Database of Interacting Protein Structures (DIPS) contains up to 42,826 binary protein complex structures but lacks the unbound state structures of the monomers. This limitation restricts its applicability to evaluations of rigid docking models rather than flexible ones. Consequently, the impact of large-scale docking datasets on methods for flexible protein-protein docking has not been thoroughly explored. To address this gap, we introduce the Flexible Protein-Protein Docking Benchmark (FD1.0), which, to our knowledge, is currently the largest dataset dedicated to flexible protein-protein docking. By providing a large and well-characterized dataset, FD1.0 aims to foster innovation in the development of flexible docking algorithms. It allows researchers to rigorously test and refine their methods, facilitating more accurate predictions of protein interactions, which are essential for understanding biological functions and designing therapeutic interventions.
In our analysis of the DIPS dataset, we identified several critical issues: (1) Multiple three-dimensional structures correspond to a single protein sequence, introducing substantial noise and affecting fair comparisons among baselines, especially for models reliant on 3D structural data. (2) The DIPS training set, primarily consisting of homo-multimers, fails to capture the diversity of interface types fully. Moreover, protein-protein docking predictions are most valuable for elucidating mechanisms of protein-protein interactions (PPIs), which predominantly involve heterodimers. Homomers, often synthesized directly rather than through docking, do not accurately represent typical PPI scenarios. (3) A significant number of docking cases in DIPS involve the interaction of one polymeric protein with another, further complicating the dataset.
As a cornerstone for the flexible docking dataset, it is imperative to acquire the structures of protein monomers in their unbound state. Specifically, this can be achieved through protein structure prediction methods, such as AlphaFold2, and the aggregation of structural data from sources including electron microscopy. Additionally, acknowledging the deficiencies of the DIPS dataset, several guidelines were established in the construction process of the FD1.0 dataset: (1) Each protein monomer is associated with a unique three-dimensional structure, reducing dataset noise. (2) We ensured that the similarity score (as determined by MMSeq) between docking monomers does not exceed 0.6, thereby filtering out homodimeric pairs from the dataset. (3) Unlike DIPS, a certain proportion of cases in the which dataset actually involve docking of two protein multimer. Current methods for predicting multimeric structures, such as AlphaFold Multimer, still do not achieve satisfactory results (AlphaFold3's license prohibits its use for docking purposes). However, current methods for predicting monomeric structures have reached a high level of accuracy. Therefore, we filtered out such cases, ensuring that each docking instance involves only protein monomers, guaranteeing the quality of the dataset. By adhering to these standardized construction criteria and through the collection, cleaning, and organization of data from various sources, including the Protein Data Bank and existing datasets, we compiled 3721 entries. Following the DIPS division ratio, these entries were divided into training, validation, and test sets of 3546, 98, and 77, respectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains supplementary replication data for the paper titled "DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction". In particular, it contains a new version of our `final_raw_dips.tar.gz` protein pair representations which now contain (1) residue-level annotations for intrinsic disorder regions (IDRs) as well as (2) a copy of each protein pair representation in the HDF5 file format for programming language-agnostic read capabilities. In addition, this record also contains (3) raw MSAs (in HDF5 file format) generated for each protein pair using Jackhmmer and AlphaFold's small version of the Big Fantastic Database (BFD). Lastly, this record contains (4) PDB metadata derived for each DIPS-Plus complex using Graphein's PDBManager API as well as (5) structure-based (i.e., FoldSeek-based) training and validation splits of the dataset's complexes in the form of respective text files containing the file paths of complexes assigned to each split.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Domain definitions of AlphaFold classifications of the human proteome (v1) from the AlphaFold Database. Also included are classifications of Danio rerio, Mus musculus, Pan paniscus, Drosophila melanogaster, Caenorhabditis elegans used for comparative analysis to human. See README file for descriptions of file formats.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure “TMvis_project_overview.png” provides a graphical overview for each step described above.
TMvis Folder Structure: TMvis is separated into “alpha” containing predicted alpha-helical TMPs, and “beta” containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of:
- “UniprotID.fasta” with UniProt ID, sequence, TMbed per-residue prediction
- “AF-UniprotID-F1-model_v2.pdb” with the AlphaFold structure
- “AF-UniprotID-F1-model_v2.cif” with the AlphaFold structure
- “AF-UniprotID-F1-model_v2_ANVIL.pdb” with predicted ANVIL membrane embedding
- “AF-UniprotID-F1-model_v2_ppm.pdb” predicted PPM3 membrane embedding
TMvis
|
├── alpha
│ │
│ ├── A0A087X1C5
│ │ ├── A0A087X1C5.fasta
│ │ ├── AF-A0A087X1C5-F1-model_v2.pdb
│ │ ├── AF-A0A087X1C5-F1-model_v2.cif
│ │ ├── AF-A0A087X1C5-F1-model_v2_ANVIL.pdb
│ │ └── AF-A0A087X1C5-F1-model_v2_ppm.PDB
│ └── ...
└── beta
└── P45880
TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook “TMvis.ipynb”. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.
——————————————————————————————————————————————————————————————————————————
References:
[1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. “TMbed – Transmembrane Proteins Predicted through Language Model Embeddings.” bioRxiv.
[2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.
[3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480–D489.
[4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.
[5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. “AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.” Nucleic Acids Research 50 (D1): D439–44.
[6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. “Membrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.” Protein Engineering, Design & Selection: PEDS 29 (3): 87–91.
[7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. “OPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.” Nucleic Acids Research 40 (Database issue): D370–76.
——————————————————————————————————————————————————————————————————————————
License:
This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).