17 datasets found
  1. Prediction and Visualization of Human Transmembrane Proteins using AlphaFold...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +2
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Céline Marquet; Céline Marquet; Anastasia Grekova; Leen Houri; Michael Heinzinger; Michael Heinzinger; Burkhard Rost; Anastasia Grekova; Leen Houri; Burkhard Rost (2024). Prediction and Visualization of Human Transmembrane Proteins using AlphaFold and Protein Language Models [Dataset]. http://doi.org/10.5281/zenodo.6816083
    Explore at:
    png, bin, txt, application/gzipAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Céline Marquet; Céline Marquet; Anastasia Grekova; Leen Houri; Michael Heinzinger; Michael Heinzinger; Burkhard Rost; Anastasia Grekova; Leen Houri; Burkhard Rost
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure “TMvis_project_overview.png” provides a graphical overview for each step described above.

    TMvis Folder Structure: TMvis is separated into “alpha” containing predicted alpha-helical TMPs, and “beta” containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of:
    - “UniprotID.fasta” with UniProt ID, sequence, TMbed per-residue prediction
    - “AF-UniprotID-F1-model_v2.pdb” with the AlphaFold structure
    - “AF-UniprotID-F1-model_v2.cif” with the AlphaFold structure
    - “AF-UniprotID-F1-model_v2_ANVIL.pdb” with predicted ANVIL membrane embedding
    - “AF-UniprotID-F1-model_v2_ppm.pdb” predicted PPM3 membrane embedding

    TMvis
    |
    ├── alpha
    │ │
    │ ├── A0A087X1C5
    │ │ ├── A0A087X1C5.fasta
    │ │ ├── AF-A0A087X1C5-F1-model_v2.pdb
    │ │ ├── AF-A0A087X1C5-F1-model_v2.cif
    │ │ ├── AF-A0A087X1C5-F1-model_v2_ANVIL.pdb
    │ │ └── AF-A0A087X1C5-F1-model_v2_ppm.PDB
    │ └── ...
    └── beta
    └── P45880

    TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook “TMvis.ipynb”. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.

    ——————————————————————————————————————————————————————————————————————————

    References:

    [1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. “TMbed – Transmembrane Proteins Predicted through Language Model Embeddings.” bioRxiv.

    [2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.

    [3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480–D489.

    [4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.

    [5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. “AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.” Nucleic Acids Research 50 (D1): D439–44.

    [6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. “Membrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.” Protein Engineering, Design & Selection: PEDS 29 (3): 87–91.

    [7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. “OPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.” Nucleic Acids Research 40 (Database issue): D370–76.

    ——————————————————————————————————————————————————————————————————————————

    License:

    This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).

  2. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
    Explore at:
    application/gzip, bz2, zipAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

    Please use the gunzip command to extract files with a '.gz' extension.

    CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
    Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • ted_324m_seq_clustering.cathlabels.tsv
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id - TED domain identifier in the format AF-
    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The files contain a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID - TED chain identifier in the format AF-
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.
    • cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.
    • ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)
    • gofocus_data.tar.bz2 - GOFocus model weights
  3. f

    RefSeq virus protein structure prediction database

    • uvaauas.figshare.com
    zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    W.E.W. Schravesande; Adriaan Verhage; M.V. Cligge; Raoul Frijters; H.A. van den Burg (2025). RefSeq virus protein structure prediction database [Dataset]. http://doi.org/10.21942/uva.28417079.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    University of Amsterdam / Amsterdam University of Applied Sciences
    Authors
    W.E.W. Schravesande; Adriaan Verhage; M.V. Cligge; Raoul Frijters; H.A. van den Burg
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Custom Virus database A custom foldseek target database was created, including all protein sequences derived from plant-infecting viruses currently found in the NCBI RefSeq database. In total, 8,191 protein sequences were extracted and used as template for protein structure predictions. Colabfold v1.5.2 (using localcolabfold), which is based upon AlphaFold v2.3.1(40), was used for protein model prediction. Setting: --random-seed 101 --num-seeds 3 --use-dropout --num-models 1 --num-recycle 8 --recycle-early-stop-tolerance 0.5No templates were used during the protein model prediction. The uniref30_2302 and colabfold_envdb_202108 databases were used to generate the multiple sequence alignments (https://colabfold.mmseqs.com/)The predicted structures were filtered based on the pLDDT value, resulting in a set of 7545 protein structures with a pLDDT ≥ 50.## Filesmodelling_stats.txt < Tab seperated file containing the modelling statistics for each structure predictionpdb_files/all < folder containing all pdb files resulting from the structure predictionpdb_files/pLDDT50 < folder containing all pdb files resulting from the structure prediction having a pLDDT score of 50 or higherVIRAL_PROTEIN_PLANT_REFSEQ.fasta < fasta file contain all protein sequences extracted from plant infecting viral genomes uploaded in the NCBI RefSeq database

  4. AlphaFold3 ensembles of 100 randomly selected human proteins with 100...

    • zenodo.org
    zip
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gunnar Jeschke; Gunnar Jeschke (2025). AlphaFold3 ensembles of 100 randomly selected human proteins with 100 conformers per ensemble [Dataset]. http://doi.org/10.5281/zenodo.14609656
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gunnar Jeschke; Gunnar Jeschke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 7, 2025
    Description

    100 proteins were selected from the set of human proteins covered by the AlphaFold Protein Structure Database. For each protein, 20 random seed values in the range form 1 to 999999999 were generated. Based on the canonical peptide sequence of the protein, json input files for the AlphaFold3 server were created and then manually submitted. The output was manually downloaded and processed with Matlab scripts that are contained in the folders for the individual proteins. This processing requires MMMx, which is available at https://github.com/gjeschke/MMMx. The main folder contains two additional Matlab scripts for analysis of the whole set of ensembles.

  5. Data from: Structural Models of the Rhodopseudomonas palustris Proteome

    • osti.gov
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coletti, Mark; Davidson, Russell B; Gao, Mu; Sedova, Ada (2023). Structural Models of the Rhodopseudomonas palustris Proteome [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1988165-structural-models-rhodopseudomonas-palustris-proteome
    Explore at:
    Dataset updated
    Jul 12, 2023
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Department of Energy Biological and Environmental Research Program
    Office of Sciencehttp://www.er.doe.gov/
    Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
    Authors
    Coletti, Mark; Davidson, Russell B; Gao, Mu; Sedova, Ada
    Description

    This dataset contains the structural models for the primary transcripts of the Rhodopseudomonas palustris proteome. For each protein, the five models inferred from AlphaFold 2 are provided. The largest pTM-scoring model for each protein was energy minimized; this minimized structure as well as its AlphaFold pickle output file are also provided. This set of structures represent an alternate source of models for the R. palustris proteome to those available in the AlphaFold Protein Structure Database.

  6. d

    UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Nov 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emre Brookes; Mattia Rocco (2023). UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural small angle scattering and SESCA circular dichroism (CD) calculations on AlphaFold predicted structures [Dataset]. http://doi.org/10.5061/dryad.jq2bvq89s
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Emre Brookes; Mattia Rocco
    Time period covered
    Jan 1, 2021
    Description

    Recent spectacular advances by AI programs in 3D structure predictions from protein sequences have revolutionized the field in terms of accuracy and speed. The resulting "folding frenzy" has already produced predicted protein structure databases for the entire human and other organisms' proteomes. However, rapidly ascertaining a predicted structure's reliability based on measured properties in solution should be considered. Shape-sensitive hydrodynamic parameters such as the diffusion and sedimentation coefficients (D0t(20,w),s0(20,w)) and the intrinsic viscosity ([η]) can provide a rapid assessment of the overall structure likeliness, and SAXS would yield the structure-related pair-wise distance distribution function p(r) vs. r. Using the extensively validated UltraScan SOlution MOdeler (US‑SOMO) suite, a database was implemented calculating from AlphaFold structures the corresponding D0t(20,w), s0(20,w), [η], p(r) vs. r, and other parameters. Circular dichroism spectra were computed u...

  7. f

    Table_5_The Use of AlphaFold for In Silico Exploration of Drug Targets in...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Albert Ros-Lucas; Nieves Martinez-Peinado; Jaume Bastida; Joaquim Gascón; Julio Alonso-Padilla (2023). Table_5_The Use of AlphaFold for In Silico Exploration of Drug Targets in the Parasite Trypanosoma cruzi.xlsx [Dataset]. http://doi.org/10.3389/fcimb.2022.944748.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Frontiers
    Authors
    Albert Ros-Lucas; Nieves Martinez-Peinado; Jaume Bastida; Joaquim Gascón; Julio Alonso-Padilla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Chagas disease is a devastating neglected disease caused by the parasite Trypanosoma cruzi, which affects millions of people worldwide. The two anti-parasitic drugs available, nifurtimox and benznidazole, have a good efficacy against the acute stage of the infection. But this is short, usually asymptomatic and often goes undiagnosed. Access to treatment is mostly achieved during the chronic stage, when the cardiac and/or digestive life-threatening symptoms manifest. Then, the efficacy of both drugs is diminished, and their long administration regimens involve frequently associated adverse effects that compromise treatment compliance. Therefore, the discovery of safer and more effective drugs is an urgent need. Despite its advantages over lately used phenotypic screening, target-based identification of new anti-parasitic molecules has been hampered by incomplete annotation and lack of structures of the parasite protein space. Presently, the AlphaFold Protein Structure Database is home to 19,036 protein models from T. cruzi, which could hold the key to not only describe new therapeutic approaches, but also shed light on molecular mechanisms of action for known compounds. In this proof-of-concept study, we screened the AlphaFold T. cruzi set of predicted protein models to find prospective targets for a pre-selected list of compounds with known anti-trypanosomal activity using docking-based inverse virtual screening. The best receptors (targets) for the most promising ligands were analyzed in detail to address molecular interactions and potential drugs’ mode of action. The results provide insight into the mechanisms of action of the compounds and their targets, and pave the way for new strategies to finding novel compounds or optimize already existing ones.

  8. Z

    Metalloprotein AlphaFold set with enzyme/non-enzyme labeled sites

    • data.niaid.nih.gov
    Updated Mar 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feehan, Ryan (2023). Metalloprotein AlphaFold set with enzyme/non-enzyme labeled sites [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7689819
    Explore at:
    Dataset updated
    Mar 20, 2023
    Dataset authored and provided by
    Feehan, Ryan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The AlphaFold set contains computationally generated structures for metalloproteins that were used to test MAHOMES II's enzyme/non-enzyme predictive performance (Feehan et al. 2023). README.md - Overview, detailed description of AlphaFold set generation, and description of content in this directory. AF-...-model_v2.pdb - Files with the 3D atomic coordinates of a metalloprotein. addMetalIon.py - Code used to place metal ion sites in apo metalloprotein structures. MAHOMES-II_AlphaFold_set_site_data.csv - Contains the data used during the generation of the AlphaFold set for the final sites. Columns are - Entry: The UniProt accession number of the protein with the bound metal site. - struc_id: The structures AlphaFold DB name (Febuary 2022) and the name of the file in this directory with added metal site. - metal_resName: The two letter PDB residue abbreviation for the site's metal - metal_seqID: The residue index number for the added metal ion. - Enzyme: The enzyme (True) or non-enzyme (False) label. - Entry name: UniProt entry name. - Protein names: The UniProt provided metalloprotein name(s). - Number of homologs with solved structures (PDB): Number of protein sequences in the PDB (May 21, 2020) with an E-value < 1. - Number of homologs in MAHOMES II dataset and T-metal-sites10: Number of protein sequences used to train and evaluate MAHOMES II with an E-value < 1 (0 for all entries). - Metal binding note: UniProt metal binding note that includes information covering the metal’s identity and catalytic flag. - Metal coordinating residue seqIDs: The sequence indices for the metal coordinating residues included in the UniProt’s metal binding section.

  9. f

    Table_3_The Use of AlphaFold for In Silico Exploration of Drug Targets in...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Albert Ros-Lucas; Nieves Martinez-Peinado; Jaume Bastida; Joaquim Gascón; Julio Alonso-Padilla (2023). Table_3_The Use of AlphaFold for In Silico Exploration of Drug Targets in the Parasite Trypanosoma cruzi.xlsx [Dataset]. http://doi.org/10.3389/fcimb.2022.944748.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Albert Ros-Lucas; Nieves Martinez-Peinado; Jaume Bastida; Joaquim Gascón; Julio Alonso-Padilla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Chagas disease is a devastating neglected disease caused by the parasite Trypanosoma cruzi, which affects millions of people worldwide. The two anti-parasitic drugs available, nifurtimox and benznidazole, have a good efficacy against the acute stage of the infection. But this is short, usually asymptomatic and often goes undiagnosed. Access to treatment is mostly achieved during the chronic stage, when the cardiac and/or digestive life-threatening symptoms manifest. Then, the efficacy of both drugs is diminished, and their long administration regimens involve frequently associated adverse effects that compromise treatment compliance. Therefore, the discovery of safer and more effective drugs is an urgent need. Despite its advantages over lately used phenotypic screening, target-based identification of new anti-parasitic molecules has been hampered by incomplete annotation and lack of structures of the parasite protein space. Presently, the AlphaFold Protein Structure Database is home to 19,036 protein models from T. cruzi, which could hold the key to not only describe new therapeutic approaches, but also shed light on molecular mechanisms of action for known compounds. In this proof-of-concept study, we screened the AlphaFold T. cruzi set of predicted protein models to find prospective targets for a pre-selected list of compounds with known anti-trypanosomal activity using docking-based inverse virtual screening. The best receptors (targets) for the most promising ligands were analyzed in detail to address molecular interactions and potential drugs’ mode of action. The results provide insight into the mechanisms of action of the compounds and their targets, and pave the way for new strategies to finding novel compounds or optimize already existing ones.

  10. Discoba protein sequences for protein structure predictions

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, csv +1
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard John Wheeler; Richard John Wheeler (2021). Discoba protein sequences for protein structure predictions [Dataset]. http://doi.org/10.5281/zenodo.5682928
    Explore at:
    csv, txt, application/gzipAvailable download formats
    Dataset updated
    Nov 13, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard John Wheeler; Richard John Wheeler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comprehensive database of Discoba protein sequences, gathered for the purpose of improving protein structure predictions of Discoba species (including Trypanosoma and Leishmania) by AlphaFold and RoseTTAFold. Originally gathered for use with: https://github.com/zephyris/discoba_alphafold

  11. CAIRA: Catalytic associated irregular residue analyser

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Becker-Pauly; Nele David; Tomas Koudelka; Corentin Joos; Franka Scharfenberg; Malina Rüffer; Fred Armbrust; Dimitris Georgiadis; Fabrice Beau; Lea Stahmer; Sascha Rahn; Andreas Tholey; Claus Pietrzik; Kira Bickenbach (2024). CAIRA: Catalytic associated irregular residue analyser [Dataset]. http://doi.org/10.5061/dryad.c59zw3rj5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    National and Kapodistrian University of Athens
    Johannes Gutenberg University Mainz
    Kiel University
    CEA Paris-Saclay
    Authors
    Christoph Becker-Pauly; Nele David; Tomas Koudelka; Corentin Joos; Franka Scharfenberg; Malina Rüffer; Fred Armbrust; Dimitris Georgiadis; Fabrice Beau; Lea Stahmer; Sascha Rahn; Andreas Tholey; Claus Pietrzik; Kira Bickenbach
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    For the semi-automated identification of potential cleavage specificity-modulating SNVs we developed a program, called CAIRA (Catalytic Associated Irregular Residue Analyser). Entering the UniProt ID of a protease, CAIRA takes the predicted AlphaFold structure, generates a user-definable radius (e.g. 20 Å) around the active site and filters the COSMIC database for cancer-associated SNVs within this radius, which might have cleavage specificity-modulating effects. CAIRA can be used with all types of proteases (metallo, serine, aspartyl and cysteine proteases). To further assess their potential impact, SNV-affected amino acids are labelled in the protein structure of a downloadable pdb-file. Methods CAIRA was programmed using python. Protein structures were obtained from AlphaFold. Cleavage specificity logos were generated as described above using cleavage site specificity data from MEROPS the Peptidase Database. Cancer-associated SNVs data was taken from the COSMIC database (cancer.sanger.ac.uk;). Catalytically important amino acid residue(s) within the active site were extracted from UniProt annotation and are used as the centre of the sphere. For proteases with more than one catalytically important amino acid residue within the active site (aspartyl, serine and cysteine proteases) CAIRA uses the middle between the relevant residues as the centre of the sphere. In the beta version, the ‘use binding cavity (beta)’ button can also be additionally used to switch to a version in which the binding cavity is imitated and SNVs can be output within it. For this purpose, cylinders of adjustable size are placed around the chemical bonds of the backbone of the protease's propeptide and amino acids of the protease that lie with at least one atom within these cylinders are filtered out and checked for annotated SNVs. The logo of CAIRA was designed with the help of Adobe Firefly.

  12. Supplementary material for "Surface frustration re-patterning underlies the...

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jan 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Derbyshire; Sylvain Raffaele; Sylvain Raffaele; Mark Derbyshire (2023). Supplementary material for "Surface frustration re-patterning underlies the structural landscape and evolvability of fungal orphan candidate effectors" [Dataset]. http://doi.org/10.5281/zenodo.7506581
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jan 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mark Derbyshire; Sylvain Raffaele; Sylvain Raffaele; Mark Derbyshire
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Tables

    Table S1. List of fungal genomes analyzed in this work, associated references and properties.

    Table S2. List of all secreted proteins less than 300 amino-acids from the 20 fungal genomes. The table includes Signalp4.0 output, mature sequence, Espritz % disorder, pfam domains, AlphaFold top prediction pLDDT and the associated pdb file in Dataset S1.

    Table S3. Top Hits to pdb database for all OCE structures. 'network_node_name' corresponds to the portein identifier in the OCE structure similarity network provided in Dataset S3. 'Hidef_raw_community' corresponds to groups of structural OCE analogs identified by HiDEF community detection performed on the network provided in Dataset S3.

    Table S4. Table S4. List of the 62 major OCE folds with associated statistics. Columns I to AB provide the number of occurrences per species. Note that the actual number of members per species might be underestimated due to the stringent pipeline used for OCE identification (excluding proteins larger than 300 amino acids or containing PFAMs for instance).

    Table S5. Relative surface exposure, conformational flexibility and conservation data mapped on residues of members of the Alt-A1 and BoNT families. RMSD, root mean square deviation for all aligned atoms; Conservation, percentage conservation in multiple structure alignment.

    Table S6. Assignment of NCBI accessions to MMseqs clusters and assignment of MMseqs clusters to HMM matching-based super-clusters.

    Table S7. Co-mutation occurrences and associated p-values in two OCE clades from the Alt-A1 and KP6 families.

    Table S8. Amino acid properties inferred from mutation scans and frustration analyses in Alt-A1 cluster yellow1 and KP6 cluster 43. 'Number of aa variants' corresponds to the number of different amino acids found at each position (deletion counts as 1). 'Alanine scan ∆Z' and 'Deletion scan ∆Z' correspond to the difference between Z-score for the native protein agains itself and Z-score for the native protein against mutant at each position (either Alanine replacement or 5-aa deletion). 'Destabilization factor' is the average of column E and F. 'Stabilization factor' corresponds to the difference between expected structural variation due to destabilization factor and the observed structural variation in multiple mutants. 'netEffect' is difference between column G and H. 'Max co-mutation %' is the highest frequency of co-mutation observed with other residues in natural variants, with 'Min co-mutation p-value (Bonferroni corrected)' the associated p-value.Table S9. Sequence and delta Z of natural variants and mutants from AA1_cl25

    Table S9. List of natural variants and in silico mutants from the Alt-A1 cluster 25 analyzed in this work, including protein sequence and structure comparison scores (comparison with the reconstructed clade ancestor n0).

    Table S10. List of natural variants and in silico mutants from the KP6 cluster 43 analyzed in this work, including protein sequence and structure comparison scores (comparison with the reconstructed clade ancestor n0).

    Table S11. Summary statistics for the phylogenetic trees of 15 OCE clades analyzed for structure and frustration evolution.

    Table S12. Mapping of structural and frustration data onto phylogenetic trees for 15 OCE clades. The corresponding trees and protein structures are provided in Dataset S7.

    Datasets

    Dataset S1. AlphaFold rank1 models for 3 927 OCEs (.pdb format).

    Dataset S2. Pairwise structure comparison for 3 911 OCE. DALI matrix output containing pairwise Z-scores.

    Dataset S3. Network file including 2 561 OCEs with 3 or more vertices of Z-score weight 5.2 or more, in .sif and .xgmml formats.

    Dataset S4. Videos illustrating the mapping of relative surface exposure and structural variability in Alt-A1 and BoNT groups, amino-acids conservation, co-selected mutation patches and residue net stabilization effects on Alt-A1 clade 25 ancestor and KP6 cluster 43 ancestor. Color scales are as in Figure 2 and 3 respectively (.mp4 format).

    Dataset S5. Phylogenetic trees (.nwk), ancestral (.fasta) and modern variant (.faa) sequences, and AlphaFold best protein models (.pdb) for members of KP6 cluster 43 and Alt-A1 cluster 25. The archive includes 140 Alt-A1 protein structure and 128 KP6 protein structures.

    Dataset S6. Best predicted structures for 917 natural variants and mutants of AA1_cl25 and 801 natural variants and mutants of KP6_cl43 (.pdb format).

    Dataset S7. Phylogenetic trees (.nwk) and AlphaFold best protein models (.pdb) for 15 OCE clades. The file includes 2 598 protein structures distributed from clades AA1_s (139), AA1_t (135), AA1_y1 (140), AA1_y2 (90), AA1_y3 (128), BoNT_s (291), CIP_s (167), CIP_t (231), crystallin (233), GNK2 (189), KP6_cl3 (203), KP6_cl26 (111), KP6_cl43 (123), KP6_cl96 (231), KP6_cl242 (187).

    Text and Figures

    Text S1. Contains supplementary methods, results and figures S1 to S13.

  13. Metadata supporting the AFDB90v4 annotated sequence similarity network

    • zenodo.org
    application/gzip, bin +3
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Janani Durairaj; Janani Durairaj; Andrew M. Waterhouse; Toomas Mets; Toomas Mets; Tetiana Brodiazhenko; Minhal Abdullah; Minhal Abdullah; Gabriel Studer; Gabriel Studer; Gerardo Tauriello; Gerardo Tauriello; Mehmet Akdel; Mehmet Akdel; Antonina Andreeva; Antonina Andreeva; Alex Bateman; Alex Bateman; Tanel Tenson; Tanel Tenson; Vasili Hauryliuk; Vasili Hauryliuk; Torsten Schwede; Torsten Schwede; Joana Pereira; Joana Pereira; Andrew M. Waterhouse; Tetiana Brodiazhenko (2023). Metadata supporting the AFDB90v4 annotated sequence similarity network [Dataset]. http://doi.org/10.5281/zenodo.8121336
    Explore at:
    csv, application/gzip, tsv, json, binAvailable download formats
    Dataset updated
    Jul 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Janani Durairaj; Janani Durairaj; Andrew M. Waterhouse; Toomas Mets; Toomas Mets; Tetiana Brodiazhenko; Minhal Abdullah; Minhal Abdullah; Gabriel Studer; Gabriel Studer; Gerardo Tauriello; Gerardo Tauriello; Mehmet Akdel; Mehmet Akdel; Antonina Andreeva; Antonina Andreeva; Alex Bateman; Alex Bateman; Tanel Tenson; Tanel Tenson; Vasili Hauryliuk; Vasili Hauryliuk; Torsten Schwede; Torsten Schwede; Joana Pereira; Joana Pereira; Andrew M. Waterhouse; Tetiana Brodiazhenko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 at a high predicted accuracy and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4. The dataset deposited here corresponds to the metadata generated, and that makes the base of the similarity network constructed and its interpretation. These files are either generated or processed using the code available at https://github.com/ProteinUniverseAtlas/AFDB90v4.

    This repository further contains the detailed, individual sequence similarity networks (in CLANS format) generated for the 3 example protein (super)families described in the text.

    The full content of this repository includes:

    1. AFDBv4_pLDDT_diggestion_UniRef50_2023-02-01.csv: table listing all uniref50 clusters in UniProt, including information on structural representatives from AFDB. Each column provides different annotations, including functional brightness, median and best pLDDT, brightness and structural representatives, etc.

    2. AFDBv4_DUF_dark_diggestion_UniRef50_2023-02-06.csv: table listing all uniref50 clusters in UniProt and whether they include proteins mapped to known domains of unknown function (DUF).

    3. AFDBv4_90.fasta: fasta file with the sequences of all UniRef50 clusters selected, and used for the all-against-all mmseqs searches that make the base of the network.

    4. AFDB90v4_data.csv: the subset of file (1) that corresponds to the AFDB90v4 dataset, including columns such as functional brightness, median and best pLDDT, brightness and structural representatives, etc.

    5. AFDB90v4_data_with_graph_labels.csv: table listing each individual uniref50 cluster included in the AFDB90v4 dataset, together with their mapping to communities, and connected components.

    6. AFDB90v4_cc_data.csv: table of uniref50 clusters in connected components, including their annotations, and the columns in file (5).

    7. AFDB90v4_cc_data_uniprot_community_taxonomy_map.csv: mapping of each uniprotAC entry to their corresponding component, community and taxonomy.

    8. AFDB90v4_subgraphs_summary.csv: table summarising the properties of individual connected components, including the average brightness, the number of members, the number of unique protein sequences, the median length, and the number of communities.

    9. communities_summary.csv: table summarising the properties of individual communities, including average brightness, the number of members, the number of unique protein sequences, the median length, the most common superkingdom represented, the average structure outlier score, etc.

    10. communities_edge_list-coordinates.csv: the coordinates of each community in the graphical representation. Singleton communities or singleton UniRef50 clusters are not included.

    11. communities_edge_list_no_duplicates.csv: list of edges making the graph.

    12. node_class.json: map between each Uniref50 cluster and its corresponding component and community.

    13. subgraphs.tar.gz: tar file containing gml files for each individual connected component.

    14. AFDB90v4_outlier_scores.tsv: table containing the outlier scores for each community representative.

    15. AFDB90v4_dark_galaxies_summary.csv: table containing the summary of all dark connected components, including average brightness, median length, representatives, number of communities, etc.

    16. AFDB90v4_uniprot_naming_assessment_counts.csv: table listing the per-component semantic diversity scores, as well as the major source of the titles of the proteins included and their count.

    17. uniprot_naming_assessment.tar.gz: tar file containing the per community assessment of predicted protein names in UniProt as of February 2023.

    18. CLANS_files.tar.gz: stores the 3 sequence similarity networks, in CLANS format, constructed for the analysis of the sequence diversity and sequence similarities of the proteins in components, 27, 159 and 3314. These CLANS files make the base of panel A in all figures 3 and 4 and extended data figure 5.

  14. f

    Data from: Protein pKa Prediction by Tree-Based Machine Learning

    • figshare.com
    • acs.figshare.com
    txt
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ada Y. Chen; Juyong Lee; Ana Damjanovic; Bernard R. Brooks (2023). Protein pKa Prediction by Tree-Based Machine Learning [Dataset]. http://doi.org/10.1021/acs.jctc.1c01257.s007
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    ACS Publications
    Authors
    Ada Y. Chen; Juyong Lee; Ana Damjanovic; Bernard R. Brooks
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Protonation states of ionizable protein residues modulate many essential biological processes. For correct modeling and understanding of these processes, it is crucial to accurately determine their pKa values. Here, we present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction tool PROPKA and 15% better than the published result from the pKa prediction method DelPhiPKa. The overall root-mean-square error (RMSE) for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys, and Tyr), and 0.63 when considering Asp, Glu, His, and Lys only. We provide pKa predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pKa values close to the physiological pH.

  15. Flexible Protein-Protein Docking Benchmark(FD1.0)

    • zenodo.org
    zip
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ming Qin; Ming Qin (2024). Flexible Protein-Protein Docking Benchmark(FD1.0) [Dataset]. http://doi.org/10.5281/zenodo.14004828
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ming Qin; Ming Qin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To effectively assess the capabilities of various methods in flexible protein-protein docking, it is essential for a protein-protein docking dataset to encompass not only the structures of the heterodimer but also that of unbound monomers. Existing datasets such as DB5.5 and AB-Benchmark, while useful, are relatively limited in scale. In contrast, the Database of Interacting Protein Structures (DIPS) contains up to 42,826 binary protein complex structures but lacks the unbound state structures of the monomers. This limitation restricts its applicability to evaluations of rigid docking models rather than flexible ones. Consequently, the impact of large-scale docking datasets on methods for flexible protein-protein docking has not been thoroughly explored. To address this gap, we introduce the Flexible Protein-Protein Docking Benchmark (FD1.0), which, to our knowledge, is currently the largest dataset dedicated to flexible protein-protein docking. By providing a large and well-characterized dataset, FD1.0 aims to foster innovation in the development of flexible docking algorithms. It allows researchers to rigorously test and refine their methods, facilitating more accurate predictions of protein interactions, which are essential for understanding biological functions and designing therapeutic interventions.

    In our analysis of the DIPS dataset, we identified several critical issues: (1) Multiple three-dimensional structures correspond to a single protein sequence, introducing substantial noise and affecting fair comparisons among baselines, especially for models reliant on 3D structural data. (2) The DIPS training set, primarily consisting of homo-multimers, fails to capture the diversity of interface types fully. Moreover, protein-protein docking predictions are most valuable for elucidating mechanisms of protein-protein interactions (PPIs), which predominantly involve heterodimers. Homomers, often synthesized directly rather than through docking, do not accurately represent typical PPI scenarios. (3) A significant number of docking cases in DIPS involve the interaction of one polymeric protein with another, further complicating the dataset.

    As a cornerstone for the flexible docking dataset, it is imperative to acquire the structures of protein monomers in their unbound state. Specifically, this can be achieved through protein structure prediction methods, such as AlphaFold2, and the aggregation of structural data from sources including electron microscopy. Additionally, acknowledging the deficiencies of the DIPS dataset, several guidelines were established in the construction process of the FD1.0 dataset: (1) Each protein monomer is associated with a unique three-dimensional structure, reducing dataset noise. (2) We ensured that the similarity score (as determined by MMSeq) between docking monomers does not exceed 0.6, thereby filtering out homodimeric pairs from the dataset. (3) Unlike DIPS, a certain proportion of cases in the which dataset actually involve docking of two protein multimer. Current methods for predicting multimeric structures, such as AlphaFold Multimer, still do not achieve satisfactory results (AlphaFold3's license prohibits its use for docking purposes). However, current methods for predicting monomeric structures have reached a high level of accuracy. Therefore, we filtered out such cases, ensuring that each docking instance involves only protein monomers, guaranteeing the quality of the dataset. By adhering to these standardized construction criteria and through the collection, cleaning, and organization of data from various sources, including the Protein Data Bank and existing datasets, we compiled 3721 entries. Following the DIPS division ratio, these entries were divided into training, validation, and test sets of 3546, 98, and 77, respectively.

  16. DIPS-Plus: The Enhanced Database of Interacting Protein Structures for...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jul 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Morehead; Alex Morehead; Chen Chen; Chen Chen; Ada Sedova; Ada Sedova; Jianlin Cheng; Jianlin Cheng (2023). DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction (Supplementary Data) [Dataset]. http://doi.org/10.5281/zenodo.8140981
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alex Morehead; Alex Morehead; Chen Chen; Chen Chen; Ada Sedova; Ada Sedova; Jianlin Cheng; Jianlin Cheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains supplementary replication data for the paper titled "DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction". In particular, it contains a new version of our `final_raw_dips.tar.gz` protein pair representations which now contain (1) residue-level annotations for intrinsic disorder regions (IDRs) as well as (2) a copy of each protein pair representation in the HDF5 file format for programming language-agnostic read capabilities. In addition, this record also contains (3) raw MSAs (in HDF5 file format) generated for each protein pair using Jackhmmer and AlphaFold's small version of the Big Fantastic Database (BFD). Lastly, this record contains (4) PDB metadata derived for each DIPS-Plus complex using Graphein's PDBManager API as well as (5) structure-based (i.e., FoldSeek-based) training and validation splits of the dataset's complexes in the form of respective text files containing the file paths of complexes assigned to each split.

  17. DPAM Domain Classification of Human Proteins against ECOD Reference

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Dec 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Schaeffer; Jing Zhang; Lisa Kinch; Jimin Pei; Qian Cong; Nick Grishin; Richard Schaeffer; Jing Zhang; Lisa Kinch; Jimin Pei; Qian Cong; Nick Grishin (2022). DPAM Domain Classification of Human Proteins against ECOD Reference [Dataset]. http://doi.org/10.5281/zenodo.6998803
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard Schaeffer; Jing Zhang; Lisa Kinch; Jimin Pei; Qian Cong; Nick Grishin; Richard Schaeffer; Jing Zhang; Lisa Kinch; Jimin Pei; Qian Cong; Nick Grishin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Domain definitions of AlphaFold classifications of the human proteome (v1) from the AlphaFold Database. Also included are classifications of Danio rerio, Mus musculus, Pan paniscus, Drosophila melanogaster, Caenorhabditis elegans used for comparative analysis to human. See README file for descriptions of file formats.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Céline Marquet; Céline Marquet; Anastasia Grekova; Leen Houri; Michael Heinzinger; Michael Heinzinger; Burkhard Rost; Anastasia Grekova; Leen Houri; Burkhard Rost (2024). Prediction and Visualization of Human Transmembrane Proteins using AlphaFold and Protein Language Models [Dataset]. http://doi.org/10.5281/zenodo.6816083
Organization logo

Prediction and Visualization of Human Transmembrane Proteins using AlphaFold and Protein Language Models

Explore at:
png, bin, txt, application/gzipAvailable download formats
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Céline Marquet; Céline Marquet; Anastasia Grekova; Leen Houri; Michael Heinzinger; Michael Heinzinger; Burkhard Rost; Anastasia Grekova; Leen Houri; Burkhard Rost
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure “TMvis_project_overview.png” provides a graphical overview for each step described above.

TMvis Folder Structure: TMvis is separated into “alpha” containing predicted alpha-helical TMPs, and “beta” containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of:
- “UniprotID.fasta” with UniProt ID, sequence, TMbed per-residue prediction
- “AF-UniprotID-F1-model_v2.pdb” with the AlphaFold structure
- “AF-UniprotID-F1-model_v2.cif” with the AlphaFold structure
- “AF-UniprotID-F1-model_v2_ANVIL.pdb” with predicted ANVIL membrane embedding
- “AF-UniprotID-F1-model_v2_ppm.pdb” predicted PPM3 membrane embedding

TMvis
|
├── alpha
│ │
│ ├── A0A087X1C5
│ │ ├── A0A087X1C5.fasta
│ │ ├── AF-A0A087X1C5-F1-model_v2.pdb
│ │ ├── AF-A0A087X1C5-F1-model_v2.cif
│ │ ├── AF-A0A087X1C5-F1-model_v2_ANVIL.pdb
│ │ └── AF-A0A087X1C5-F1-model_v2_ppm.PDB
│ └── ...
└── beta
└── P45880

TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook “TMvis.ipynb”. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.

——————————————————————————————————————————————————————————————————————————

References:

[1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. “TMbed – Transmembrane Proteins Predicted through Language Model Embeddings.” bioRxiv.

[2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.

[3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480–D489.

[4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.

[5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. “AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.” Nucleic Acids Research 50 (D1): D439–44.

[6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. “Membrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.” Protein Engineering, Design & Selection: PEDS 29 (3): 87–91.

[7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. “OPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.” Nucleic Acids Research 40 (Database issue): D370–76.

——————————————————————————————————————————————————————————————————————————

License:

This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).

Search
Clear search
Close search
Google apps
Main menu