100+ datasets found

AlphaFold Protein Structure Database
console.cloud.google.com
Updated Mar 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=de (2023). AlphaFold Protein Structure Database [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold?hl=de
Explore at:
Dataset updated
Mar 13, 2023
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
License
Description
The AlphaFold Protein Structure Database is a collection of protein structure predictions made using the machine learning model AlphaFold. AlphaFold was developed by DeepMind , and this database was created in partnership with EMBL-EBI . For information on how to interpret, download and query the data, as well as on which proteins are included / excluded, and change log, please see our main dataset guide and FAQs . To interactively view individual entries or to download proteomes / Swiss-Prot please visit https://alphafold.ebi.ac.uk/ . The current release aims to cover most of the over 200M sequences in UniProt (a commonly used reference set of annotated proteins). The files provided for each entry include the structure plus two model confidence metrics (pLDDT and PAE). The files can be found in the Google Cloud Storage bucket gs://public-datasets-deepmind-alphafold-v4 with metadata in the BigQuery table bigquery-public-data.deepmind_alphafold.metadata . If you use this data, please cite: Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021) Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021) This public dataset is hosted in Google Cloud Storage and is available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage.
r
AlphaFold Protein Structure Database
rrid.site
scicrunch.org
Updated Jul 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). AlphaFold Protein Structure Database [Dataset]. http://identifiers.org/RRID:SCR_023662
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_023662
Dataset updated
Jul 26, 2025
Description
Database of protein structure predictions by AlphaFold that are freely and openly available to global scientific community. Included are nearly all catalogued proteins known to science. Provides programmatic access to and interactive visualization of predicted atomic coordinates, per residue and pairwise model confidence estimates and predicted aligned errors.
h
AF2_Beta_Strand_Database
huggingface.co
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haowen Zhao (2024). AF2_Beta_Strand_Database [Dataset]. https://huggingface.co/datasets/hz3519/AF2_Beta_Strand_Database
Explore at:
Dataset updated
Aug 9, 2024
Authors
Haowen Zhao
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AlphaFold 2 Beta Strand Database

Dataset Summary and Creation

The AlphaFold 2 (AF2) Beta Strand Database is a database for high-confidence scored beta strand pairs as predicted by Alphafold 2, a revolutionary protein structure prediction system. All 214 million protein structures from the Alphafold Protein Structure Database (Alphafold DB) were analyzed and well-aligned pairs of amino acid sequences, which exhibited beta-strand conformations, were collected using specific… See the full description on the dataset page: https://huggingface.co/datasets/hz3519/AF2_Beta_Strand_Database.
Z
Prediction and Visualization of Human Transmembrane Proteins using AlphaFold...
data.niaid.nih.gov
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rost, Burkhard (2024). Prediction and Visualization of Human Transmembrane Proteins using AlphaFold and Protein Language Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6816082
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Rost, Burkhard
Marquet, Céline
Grekova, Anastasia
Houri, Leen
Heinzinger, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure “TMvis_project_overview.png” provides a graphical overview for each step described above.

TMvis Folder Structure: TMvis is separated into “alpha” containing predicted alpha-helical TMPs, and “beta” containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of: - “UniprotID.fasta” with UniProt ID, sequence, TMbed per-residue prediction - “AF-UniprotID-F1-model_v2.pdb” with the AlphaFold structure - “AF-UniprotID-F1-model_v2.cif” with the AlphaFold structure - “AF-UniprotID-F1-model_v2_ANVIL.pdb” with predicted ANVIL membrane embedding - “AF-UniprotID-F1-model_v2_ppm.pdb” predicted PPM3 membrane embedding

TMvis
|
├── alpha
│ │
│ ├── A0A087X1C5
│ │ ├── A0A087X1C5.fasta
│ │ ├── AF-A0A087X1C5-F1-model_v2.pdb
│ │ ├── AF-A0A087X1C5-F1-model_v2.cif
│ │ ├── AF-A0A087X1C5-F1-model_v2_ANVIL.pdb
│ │ └── AF-A0A087X1C5-F1-model_v2_ppm.PDB
│ └── ...
└── beta
└── P45880

TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook “TMvis.ipynb”. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.

——————————————————————————————————————————————————————————————————————————

References:

[1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. “TMbed – Transmembrane Proteins Predicted through Language Model Embeddings.” bioRxiv.

[2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.

[3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480–D489.

[4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.

[5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. “AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.” Nucleic Acids Research 50 (D1): D439–44.

[6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. “Membrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.” Protein Engineering, Design & Selection: PEDS 29 (3): 87–91.

[7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. “OPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.” Nucleic Acids Research 40 (Database issue): D370–76.

——————————————————————————————————————————————————————————————————————————

License:

This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).
The Encyclopedia of Domains (TED) structural domains assignments for...
zenodo.org
application/gzip, bz2 +1
Updated Oct 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13908086
Explore at:
application/gzip, zip, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.13908086
Dataset updated
Oct 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 31, 2024
Description
Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy, and putative CATH SuperFamily or Fold assignments, for all 365 million domains (~324 million domains in TED100 and ~40 million domains in TED-redundant).

For all chains in the chain-level TED-redundant files, the file contains boundary predictions, consensus level and information on the TED100 representative.

For both TED100 and TED-redundant we provide domain boundary predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 PDB files for potentially novel folds identified during the TED classification process, with an annotation table sorted by novelty, as well as 6,433 highly symmetrical folds representatives.

Please use the gunzip command to extract files with a '.gz' extension and "tar -xzvf file.tar.gz" to open .tar.gz files .

CATH annotations have been assigned using the Foldseek algorithm applied in various modes, and the Foldclass algorithm, both of which are used to report significant structural similarity to a known CATH domain.

Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

Changelog Version 5:

Add: ted_365m.domain_summary.cath.globularity.taxid.tsv.tar.gz - This table, in the same format as the previous ted_100_324m.domain_summary.cath.globularity.taxid.tsv.tar.gz, contains per-domain annotations for the whole of TED, including metadata on domain quality metrics such as secondary structure elements counts, globularity scores, average pLDDT and taxonomical assignments.

Add: high_symmetry_folds_set.domain_summary.tsv.gz - subset of ted_365m.domain_summary.cath.globularity.taxid.tsv containing information on 6,433 high symmetry folds in TED. The entries are sorted in descending order by Z-score obtained from SymD.

Add: high_symmetry_folds_set_models.tar.gz - TED domain models in PDB format for 6,433 high symmetry folds in TED.

Add: ISP_data.tar.gz - Raw data for Interacting SuperFamily Pairs calculations used in the manuscript. A more detailed description of the ISP data is available below as well as within the tar.gz file.

Add: ted_redundant_40m_domain_id.list.gz - list of TED_domain_ID in TED redundant

Add: ted_100_324m_domain_id.list.gz - list of TED_domain_ID in TED100

Fix/Replace: A domain-level summary of TED, now consolidated into ted_365m.domain_summary.cath.globularity.taxid.tsv, is consistent with the protocol used in the manuscript. As Foldclass and Foldseek T-level hits provide all 4 CATH digits, we removed the H portion of the CATH code from each prediction at the T-level.
Previously, the following columns
14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300
15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment.
16. cath_assignment_method - Method used to assign a CATH label, either Foldseek or Foldclass
sometimes showed an additional label with a T-level prediction by Foldclass in the case of T-level assignments obtained by Foldseek, e.g.
3.40.30,3.40.30 T foldseek,foldclass
This has now been corrected to reflect the TED protocol, with Foldclass T-level assignments applied only to domains where a T-level assignment could not be applied using Foldseek, e.g.
domain-x 3.40.30 T foldseek
domain-y 3.20.20 T foldclass

Thus, in the current version of the data, CATH assignments label can only be
H-level assignment by Foldseek (i.e. 3.40.50.300 H foldseek)
T-level assignment by Foldseek (i.e. 3.40.30 T foldseek)
T-level assignment by Foldclass (i.e. 3.40.30 T foldclass)
or no assignment (- - - )

This dataset contains:

ted_214m_per_chain_segmentation.tsv
The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
1. AFDB_model_ID: chain identifier from AFDB in the format AF-

ted_365m_domain_boundaries_consensus_level.tsv.gz
The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
1. TED_ID: TED domain identifier in the format AF-

ted_100_324m_domain_id.list.gz - list of ~324 million domain identifiers in TED100, one per line in the format AF-

ted_redundant_40m_domain_id.list.gz - list of ~40 million domain identifiers in TED redundant, one per line in the format AF-

ted_365m.domain_summary.cath.globularity.taxid.tsv, novel_folds_set.domain_summary.tsv and high_symmetry_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv). novel_folds_set.domain_summary.tsv is sorted by novelty
Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

1. ted_id - TED domain identifier in the format AF-

ted_324m_seq_clustering.cathlabels.tsv.gz
The file contains the results of the domain sequences clustering with MMseqs2.
Columns:
1. Cluster_representative
2. Cluster_member
3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv.gz

The file ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz contains a header with the
c
Reciprocal Best Structure Hits (RBSH)
repository.cam.ac.uk
bin
Updated Sep 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monzon, Vivian; Paysan-Lafosse, Typhaine; Wood, Valerie; Bateman, Alex (2022). Reciprocal Best Structure Hits (RBSH) [Dataset]. http://doi.org/10.17863/CAM.87873
Explore at:
bin(171535 bytes), bin(155431 bytes), bin(79489 bytes), bin(84547 bytes), bin(39107 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.87873
Dataset updated
Sep 22, 2022
Dataset provided by
Apollo
University of Cambridge
Authors
Monzon, Vivian; Paysan-Lafosse, Typhaine; Wood, Valerie; Bateman, Alex
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this work, we are using AlphaFold structure models to find the closest homologues proteins between Homo sapiens and D. melanogaster, C. elegans, S. cerevisiae and S. pombe as well as between S. cerevisiae and S. pombe. We are using the structure aligner Foldseek to run all against all and search for the best scoring hit in both directions to detect the Reciprocal Best Structure Hits (RBSH). We compare the results to protein pairs detected by their sequence similarity as Reciprocal Best Hits (RBH) and verify the results using the PANTHER family classification files. \( \ \) Note: This dataset is an updated version of the dataset at https://doi.org/10.17863/CAM.85487.
r
AlphaFold Unmasked data sets
demo.researchdata.se
figshare.scilifelab.se
+1more
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Claudio Mirabello; Björn Wallner; Björn Nystedt; Marta Carroni (2025). AlphaFold Unmasked data sets [Dataset]. http://doi.org/10.17044/SCILIFELAB.24198669
Explore at:
Unique identifier
https://doi.org/10.17044/SCILIFELAB.24198669
Dataset updated
Jan 27, 2025
Dataset provided by
Linköping University
Authors
Claudio Mirabello; Björn Wallner; Björn Nystedt; Marta Carroni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here are deposited all of the predictions generated for the test cases presented in "AlphaFold Unmasked: integration of experiments and predictions with a smarter template mechanism" (doi: https://doi.org/10.1101/2023.09.20.558579) along with the log files necessary to reproduce the experiments.

Each tar.gz file includes one or more AlphaFold experiments, where multiple predictions have been generated either with AlphaFold-Multimer (standard pipeline, v2.2 and/or v2.3 parameters) or with AF_unmasked. An experiment is made of a set of 3D structure predictions (.pdb files) along with the ancillary data generated by AlphaFold (pickle files) and the corresponding inputs (Multiple Sequence Alignments, sequences). Scripts to reproduce the results are included along with the log files generated during the experiments.

H1111, H1142, T1109 and T1110 are multimeric prediction targets from CASP15 (https://predictioncenter.org/casp15/) chosen because most or all predictors failed to correctly predict these complexes in the 2021 edition of CASP.

Rubisco, NF1 and ClpB are examples of large and/or challenging targets where Cryo-EM data is available to be integrated in the prediction pipeline.

The PDB benchmark is made of a set of protein heterodimeric structures deposited in the PDB before January 2022, i.e. before AlphaFold v2.3 was trained and released. These heterodimers have been redundancy reduced by structural similarity (MMalign score threshold: 0.4) to increase their diversity
The comparison of the AlphaFold and SwissModel Repository databases
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Mar 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arthur Zalevsky; Arthur Zalevsky (2023). The comparison of the AlphaFold and SwissModel Repository databases [Dataset]. http://doi.org/10.5281/zenodo.7709897
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7709897
Dataset updated
Mar 9, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Arthur Zalevsky; Arthur Zalevsky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset supplements the code at https://github.com/aozalevsky/alphafold2_vs_swissmodel for the comparison of the AlphaFold2 database (https://alphafold.ebi.ac.uk) with the SwissModel Repository (https://swissmodel.expasy.org/repository). Results of the analysis were published as part of the AlphaFold community review https://www.nature.com/articles/s41594-022-00849-w
S
AF-M predictions accompanying the manuscript: Predictomes: A...
data.sbgrid.org
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schmid, Ernst; Walter, Johannes; Schmid, Ernst; Walter, Johannes (2025). AF-M predictions accompanying the manuscript: Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions [Dataset]. http://doi.org/10.15785/SBGRID/1155
Explore at:
Unique identifier
https://doi.org/10.15785/SBGRID/1155
Dataset updated
Feb 4, 2025
Dataset provided by
SBGrid Data Bank
Authors
Schmid, Ernst; Walter, Johannes; Schmid, Ernst; Walter, Johannes
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
AF-M predictions accompanying the manuscript: Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions : The set of all AlphaFold multimer (AF-M) v2.3 pairwise structure predictions accompanying the publication: Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions. This dataset includes prediction pairs used for training random forest classifiers including SPOC, pairs used for 30 ranking experiments, all pairs that belong to the genome maintenance matrix on predictomes.org, and three proteome wide in-silico interaction screens conducted with human DONSON, human STK19, and human USP37. All pairs were generated with ColabFold v1.5.2. All our predictions used AF-M multimer version 3 weights models 1, 2, and 4 with 3 recycles, templates enabled, 1 ensemble, no dropout, and no AMBER relaxation. The Multiple Sequence Alignments (MSAs) (unpaired + paired) supplied to AF-M were generated by the MMSeqs2 server using default settings. Sequences run were generally capped at 3,600 amino acids total to avoid memory exhaustion on GPUs. ;
c
Reciprocal Best Structure Hits
repository.cam.ac.uk
bin
Updated Jun 14, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monzon, Vivian; Paysan-Lafosse, Typhaine; Wood, Valerie; Bateman, Alex (2022). Reciprocal Best Structure Hits [Dataset]. http://doi.org/10.17863/CAM.85487
Explore at:
bin(181809 bytes), bin(39365 bytes), bin(86555 bytes), bin(200706 bytes), bin(90557 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.85487
Dataset updated
Jun 14, 2022
Dataset provided by
Apollo
University of Cambridge
Authors
Monzon, Vivian; Paysan-Lafosse, Typhaine; Wood, Valerie; Bateman, Alex
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this work, we are using AlphaFold structure models to find the closest homologues proteins between Homo sapiens and D. melanogaster, C. elegans, S. cerevisiae and S. pombe as well as between S. cerevisiae and S. pombe. We are using the structure aligner Foldseek to run all against all and search for the best scoring hit in both directions to detect the Reciprocal Best Structure Hits (RBSH). We compare the results to protein pairs detected by their sequence similarity as Reciprocal Best Hits (RBH) and verify the results using the PANTHER family classification files. \( \ \) Note: This dataset is an earlier version of a more up-to-date dataset at https://doi.org/10.17863/CAM.87873
afdb_clusters v1.0: AlphaFold-derived structure-based dataset for...
zenodo.org
application/gzip
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Zielezinski; Andrzej Zielezinski; Adam Gudyś; Adam Gudyś; Sebastian Deorowicz; Sebastian Deorowicz (2025). afdb_clusters v1.0: AlphaFold-derived structure-based dataset for benchmarking MSA tools [Dataset]. http://doi.org/10.5281/zenodo.16082639
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16082639
Dataset updated
Jul 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Zielezinski; Andrzej Zielezinski; Adam Gudyś; Adam Gudyś; Sebastian Deorowicz; Sebastian Deorowicz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 1,166 protein families derived from AlphaFold Database Clusters. The families vary in size, ranging from approximately 1,000 to 680,000 sequences.

For each family, the dataset provides:

protein sequences (FASTA format)

download URLs for AlphaFold-predicted PDB structures corresponding to each protein sequence

These paired sequences and structures enable structure-based benchmarking of multiple sequence alignment (MSA) tools using the Local Distance Difference Test (LDDT) score, computed with the FoldMason tool.

Directory structure

The dataset contains two main directories:

fasta/ – protein sequences for each cluster [FASTA format]

pdb_urls/ – text files containing download URLs for AlphaFold PDB structures for each sequence in the cluster [TXT format]

A metadata file (metadata.tsv) is also included, providing detailed information for each cluster.

Metadata

A metadata file (metadata.tsv) provides:

cluster_id – Cluster identifier

seqs_count – total number of sequences in the cluster

min_seq_length – minimum sequence length within the cluster

mean_seq_length – average sequence length within the cluster

max_seq_length – maximum sequence length within the cluster
f
DataSheet1_Automated identification of chalcogen bonds in AlphaFold protein...
frontiersin.figshare.com
pdf
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliviero Carugo; Kristina Djinović-Carugo (2023). DataSheet1_Automated identification of chalcogen bonds in AlphaFold protein structure database files: is it possible?.PDF [Dataset]. http://doi.org/10.3389/fmolb.2023.1155629.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fmolb.2023.1155629.s001
Dataset updated
Jul 6, 2023
Dataset provided by
Frontiers
Authors
Oliviero Carugo; Kristina Djinović-Carugo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Protein structure prediction and structural biology have entered a new era with an artificial intelligence-based approach encoded in the AlphaFold2 and the analogous RoseTTAfold methods. More than 200 million structures have been predicted by AlphaFold2 from their primary sequences and the models as well as the approach itself have naturally been examined from different points of view by experimentalists and bioinformaticians. Here, we assessed the degree to which these computational models can provide information on subtle structural details with potential implications for diverse applications in protein engineering and chemical biology and focused the attention on chalcogen bonds formed by disulphide bridges. We found that only 43% of the chalcogen bonds observed in the experimental structures are present in the computational models, suggesting that the accuracy of the computational models is, in the majority of the cases, insufficient to allow the detection of chalcogen bonds, according to the usual stereochemical criteria. High-resolution experimentally derived structures are therefore still necessary when the structure must be investigated in depth based on fine structural aspects.
Z
Data from: Application of AlphaFold on metamorphic proteins - dataset
data.niaid.nih.gov
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karolina Mitusińska (2024). Application of AlphaFold on metamorphic proteins - dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8093512
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Weronika Bagrowska
Artur Góra
Karolina Mitusińska
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset comprises a set of five structures of metamorphic proteins used for the study.
d
UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural...
dataone.org
search.dataone.org
+2more
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emre Brookes; Mattia Rocco (2025). UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural small angle scattering and SESCA circular dichroism (CD) calculations on AlphaFold predicted structures [Dataset]. http://doi.org/10.5061/dryad.jq2bvq89s
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.jq2bvq89s
Dataset updated
Jul 17, 2025
Dataset provided by
Dryad Digital Repository
Authors
Emre Brookes; Mattia Rocco
Time period covered
Jan 1, 2021
Description
Recent spectacular advances by AI programs in 3D structure predictions from protein sequences have revolutionized the field in terms of accuracy and speed. The resulting "folding frenzy" has already produced predicted protein structure databases for the entire human and other organisms' proteomes. However, rapidly ascertaining a predicted structure's reliability based on measured properties in solution should be considered. Shape-sensitive hydrodynamic parameters such as the diffusion and sedimentation coefficients (D0t(20,w),s0(20,w)) and the intrinsic viscosity ([Î·]) can provide a rapid assessment of the overall structure likeliness, and SAXS would yield the structure-related pair-wise distance distribution function p(r) vs. r. Using the extensively validated UltraScan SOlution MOdeler (USâ€‘SOMO) suite, a database was implemented calculating from AlphaFold structures the corresponding D0t(20,w), s0(20,w), [Î·], p(r) vs. r, and other parameters. Circular dichroism spectra were computed u..., Production of this dataset required three major steps: collect the AlphaFold entries and additional metadata; prepare the structures for hydrodynamic, structural and CD calculations; and compute the hydrodynamic, structural and CD propertiesBriefly, each entry in the entire AlphaFold database was first compared with the corresponding entry in the UniProt database to find the (putative) initiator methionine, signal peptide and transit peptide regions, which were subsequently removed from the AlphaFold PDB files. Additional variants were created when propeptides were found. Potential disulfides were identified (subsequently allowing a better evaluation of the partial specific volume and of M) and written as SSBOND records in the cured PDBs, together with HELIX and SHEET information identified using the DSSP implementation in UCSF Chimera (Pettersen et al, 2004. Journal of computational chemistry, 25(13), pp.1605-1612). Batch-mode US-SOMO was then used to calculate the mass M, The translat..., This is a tar archive of all datasets for each AlphaFold entry. This includes a csv file containing all hydrodynamic parameters, a pdb file containing the cured pdb structure, an mmCIF file containing the cured pdb structure and a data file containing the circular dichroism spectrum, and a p(r) vs r dat file.Use "tar xf somoaf_all_data.tar" to extract the primary archive.This will result in 1,002,038 individual .txz file, each representing one UniProt accession code and containing 5 files.When propepties are identified and removed, the extracted file name will contain a -pp#, where # is a list of the propepties removed.For example, to extract the data from an individual txz file, use "tar Jxf xxxx.txz", where xxxx is replaced by the appropriate name containing the accession code. Further details are in the provided README.md file.
e
AlphaFold
ebi.ac.uk
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AlphaFold [Dataset]. https://www.ebi.ac.uk/ebisearch/metadata.ebi?db=alphafold
Explore at:
Description
AlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.
Data from: af3cli: Streamlining AlphaFold3 Input Preparation
acs.figshare.com
zip
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Döpner; Stefan Kemnitz; Mark Doerr; Lukas Schulig (2025). af3cli: Streamlining AlphaFold3 Input Preparation [Dataset]. http://doi.org/10.1021/acs.jcim.5c00276.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00276.s001
Dataset updated
Apr 9, 2025
Dataset provided by
ACS Publications
Authors
Philipp Döpner; Stefan Kemnitz; Mark Doerr; Lukas Schulig
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
With the release of AlphaFold3, modeling capabilities have expanded beyond protein structure prediction to embrace the inherent complexity of biomolecular systems, including nucleic acids, ions, small molecules, and their interactions. The increased complexity of these assemblies is reflected in the input file generation process, presenting a significant hurdle for researchers without advanced computational expertise. While AlphaFold Server comes with a user-friendly graphical user interface, it supports only a subset of the features of AlphaFold3. To address this, we present af3cli, an open-source tool designed to facilitate the generation of AlphaFold3 input files, specifically tailored to the standalone version of AlphaFold3 and its unrestricted functionality. Featuring a user-friendly command-line interface and an accompanying Python library, af3cli simplifies the input generation process while maintaining flexibility and customization, which makes af3cli especially useful for fast (automated) generation of a large number of input files since it enables direct incorporation of FASTA files, keeps track of IDs, and validates the JSON file. Through practical examples, we demonstrate its capabilities for constructing input data for diverse biological structures, ranging from simple proteins to complex systems, and demonstrate its seamless integration into both manual and automated workflows.
f
RefSeq virus protein structure prediction database
uvaauas.figshare.com
zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
W.E.W. Schravesande; Adriaan Verhage; M.V. Cligge; Raoul Frijters; H.A. van den Burg (2025). RefSeq virus protein structure prediction database [Dataset]. http://doi.org/10.21942/uva.28417079.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.21942/uva.28417079.v1
Dataset updated
Mar 19, 2025
Dataset provided by
University of Amsterdam / Amsterdam University of Applied Sciences
Authors
W.E.W. Schravesande; Adriaan Verhage; M.V. Cligge; Raoul Frijters; H.A. van den Burg
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Custom Virus database A custom foldseek target database was created, including all protein sequences derived from plant-infecting viruses currently found in the NCBI RefSeq database. In total, 8,191 protein sequences were extracted and used as template for protein structure predictions. Colabfold v1.5.2 (using localcolabfold), which is based upon AlphaFold v2.3.1(40), was used for protein model prediction. Setting: --random-seed 101 --num-seeds 3 --use-dropout --num-models 1 --num-recycle 8 --recycle-early-stop-tolerance 0.5No templates were used during the protein model prediction. The uniref30_2302 and colabfold_envdb_202108 databases were used to generate the multiple sequence alignments (https://colabfold.mmseqs.com/)The predicted structures were filtered based on the pLDDT value, resulting in a set of 7545 protein structures with a pLDDT ≥ 50.## Filesmodelling_stats.txt < Tab seperated file containing the modelling statistics for each structure predictionpdb_files/all < folder containing all pdb files resulting from the structure predictionpdb_files/pLDDT50 < folder containing all pdb files resulting from the structure prediction having a pLDDT score of 50 or higherVIRAL_PROTEIN_PLANT_REFSEQ.fasta < fasta file contain all protein sequences extracted from plant infecting viral genomes uploaded in the NCBI RefSeq database
Z
Supplementary Data for "Using AlphaFold and Experimental Structures for the...
data.niaid.nih.gov
zenodo.org
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coskun, Dilek (2023). Supplementary Data for "Using AlphaFold and Experimental Structures for the Prediction of the Structure and Binding Affinities of GPCR Complexes via Induced Fit Docking and Free Energy Perturbation" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10037622
Explore at:
Dataset updated
Oct 24, 2023
Dataset provided by
Coskun, Dilek
Rodrigues, Joao
Lihan, Muyun
Miller, Edward
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Supplementary data for publication "Using AlphaFold and Experimental Structures for the Prediction of the Structure and Binding Affinities of GPCR Complexes via Induced Fit Docking and Free Energy Perturbation". Includes:

All input structures used in the the retrospective benchmark dataset as well as the (at most) 5 best scoring output models. Input structures and output models for IFD-MD predictions of SSTR2, SSTR4, and SSTR5 complexes. Output FEP+ maps (in fmp format) for SSTR2, SSTR4, and SSTR5 best models (representative runs shown in publication).
Data from: MsmRho AlphaFold predictions
ourarchive.otago.ac.nz
figshare.com
Updated Jun 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Magalhaes Moreira (2024). MsmRho AlphaFold predictions [Dataset]. https://ourarchive.otago.ac.nz/esploro/outputs/dataset/MsmRho-AlphaFold-predictions/9926653800801891
Explore at:
Dataset updated
Jun 24, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Sofia Magalhaes Moreira
Time period covered
Jun 24, 2024
Description
This folder contains the files in cif format generated by AlphaFold 3 to build Figure 4.1 of the thesis of Sofia Megalhães Moreira - https://hdl.handle.net/10523/43234. The data is embargoed in Figshare until 24 June 2026.
s
Predictomes
scicrunch.org
rrid.site
Updated Aug 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Predictomes [Dataset]. http://identifiers.org/RRID:SCR_026691
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_026691
Dataset updated
Aug 18, 2025
Description
Interactive database of protein protein interactions modeled by AlphaFold multimer. Classifier-curated database of AlphaFold-modeled protein-protein interactions.

Facebook

Twitter

Click to copy link

Link copied

Cite

https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=de (2023). AlphaFold Protein Structure Database [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold?hl=de

AlphaFold Protein Structure Database

Explore at:

Dataset updated

Mar 13, 2023

Dataset provided by

BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/

License

Description

The AlphaFold Protein Structure Database is a collection of protein structure predictions made using the machine learning model AlphaFold. AlphaFold was developed by DeepMind , and this database was created in partnership with EMBL-EBI . For information on how to interpret, download and query the data, as well as on which proteins are included / excluded, and change log, please see our main dataset guide and FAQs . To interactively view individual entries or to download proteomes / Swiss-Prot please visit https://alphafold.ebi.ac.uk/ . The current release aims to cover most of the over 200M sequences in UniProt (a commonly used reference set of annotated proteins). The files provided for each entry include the structure plus two model confidence metrics (pLDDT and PAE). The files can be found in the Google Cloud Storage bucket gs://public-datasets-deepmind-alphafold-v4 with metadata in the BigQuery table bigquery-public-data.deepmind_alphafold.metadata . If you use this data, please cite: Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021) Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021) This public dataset is hosted in Google Cloud Storage and is available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage.

Clear search

Close search

Google apps

Main menu

AlphaFold Protein Structure Database

AlphaFold Protein Structure Database

AF2_Beta_Strand_Database

Prediction and Visualization of Human Transmembrane Proteins using AlphaFold...

The Encyclopedia of Domains (TED) structural domains assignments for...

Dataset description:

Changelog Version 5:

This dataset contains:

Reciprocal Best Structure Hits (RBSH)

AlphaFold Unmasked data sets

The comparison of the AlphaFold and SwissModel Repository databases

AF-M predictions accompanying the manuscript: Predictomes: A...

Reciprocal Best Structure Hits

afdb_clusters v1.0: AlphaFold-derived structure-based dataset for...

Directory structure

Metadata

DataSheet1_Automated identification of chalcogen bonds in AlphaFold protein...

Data from: Application of AlphaFold on metamorphic proteins - dataset

UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural...

AlphaFold

Data from: af3cli: Streamlining AlphaFold3 Input Preparation

RefSeq virus protein structure prediction database

Supplementary Data for "Using AlphaFold and Experimental Structures for the...

Data from: MsmRho AlphaFold predictions

Predictomes

AlphaFold Protein Structure Database