The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc).
Dataset Description
Dataset Summary
This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins. This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz. Supported Tasks and Leaderboards: None Languages: English
Dataset Structure
Data Instances
Data Fields: id, description, sequence Data⦠See the full description on the dataset page: https://huggingface.co/datasets/damlab/uniprot.
Collection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current database was downloaded on 27.09.2019 and has the data fields (columns) as described below:# 1 Entry# 2 Entry name# 3 Status# 4 Protein names# 5 Gene names# 6 Organism# 7 Length# 8 Cross-reference (KO)# 9 Taxonomic lineage (PHYLUM)# 10 Taxonomic lineage (SPECIES) # This field carries current and old* taxonomic classifications.# 11 Taxonomic lineage (GENUS)# 12 Taxonomic lineage (KINGDOM)# 13 Taxonomic lineage (SUPERKINGDOM)# 14 Cross-reference (OrthoDB)# 15 Cross-reference (eggNOG)*Details about the classification used in UNIPROT can be found at the link: https://www.uniprot.org/help/taxonomy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure βTMvis_project_overview.pngβ provides a graphical overview for each step described above.
TMvis Folder Structure: TMvis is separated into βalphaβ containing predicted alpha-helical TMPs, and βbetaβ containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of: - βUniprotID.fastaβ with UniProt ID, sequence, TMbed per-residue prediction - βAF-UniprotID-F1-model_v2.pdbβ with the AlphaFold structure - βAF-UniprotID-F1-model_v2.cifβ with the AlphaFold structure - βAF-UniprotID-F1-model_v2_ANVIL.pdbβ with predicted ANVIL membrane embedding - βAF-UniprotID-F1-model_v2_ppm.pdbβ predicted PPM3 membrane embedding
TMvis
|
βββ alpha
β β
β βββ A0A087X1C5
β β βββ A0A087X1C5.fasta
β β βββ AF-A0A087X1C5-F1-model_v2.pdb
β β βββ AF-A0A087X1C5-F1-model_v2.cif
β β βββ AF-A0A087X1C5-F1-model_v2_ANVIL.pdb
β β βββ AF-A0A087X1C5-F1-model_v2_ppm.PDB
β βββ ...
βββ beta
βββ P45880
TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook βTMvis.ipynbβ. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
References:
[1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. βTMbed β Transmembrane Proteins Predicted through Language Model Embeddings.β bioRxiv.
[2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.
[3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480βD489.
[4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. βHighly Accurate Protein Structure Prediction with AlphaFold.β Nature 596 (7873): 583β89.
[5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. βAlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.β Nucleic Acids Research 50 (D1): D439β44.
[6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. βMembrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.β Protein Engineering, Design & Selection: PEDS 29 (3): 87β91.
[7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. βOPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.β Nucleic Acids Research 40 (Database issue): D370β76.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
License:
This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).
Data set of manually annotated chordata-specific proteins as well as those that are widely conserved. The program keeps existing human entries up-to-date and broadens the manual annotation to other vertebrate species, especially model organisms, including great apes, cow, mouse, rat, chicken, zebrafish, as well as Xenopus laevis and Xenopus tropicalis. A draft of the complete human proteome is available in UniProtKB/Swiss-Prot and one of the current priorities of the Chordata protein annotation program is to improve the quality of human sequences provided. To this aim, they are updating sequences which show discrepancies with those predicted from the genome sequence. Dubious isoforms, sequences based on experimental artifacts and protein products derived from erroneous gene model predictions are also revisited. This work is in part done in collaboration with the Hinxton Sequence Forum (HSF), which allows active exchange between UniProt, HAVANA, Ensembl and HGNC groups, as well as with RefSeq database. UniProt is a member of the Consensus CDS project and thye are in the process of reviewing their records to support convergence towards a standard set of protein annotation. They also continuously update human entries with functional annotation, including novel structural, post-translational modification, interaction and enzymatic activity data. In order to identify candidates for re-annotation, they use, among others, information extraction tools such as the STRING database. In addition, they regularly add new sequence variants and maintain disease information. Indeed, this annotation program includes the Variation Annotation Program, the goal of which is to annotate all known human genetic diseases and disease-linked protein variants, as well as neutral polymorphisms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).
Database that contains publicly available protein sequences with stable and unique identifiers (UPI) which are never removed, changed or reassigned. UniParc tracks sequence changes in the source databases and archives the history of all changes. Information other than protein sequence must be retrieved from the UniParc source databases using the database cross-references.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Functional annotation from protein sequence database of Uniprot, Myristica fatua
Databases which provide clustered sets of sequences from UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records are all displayed in the entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% (UniRef90) or 50% (UniRef50) sequence identity to the longest sequence (UniRef seed sequence). All the sequences in each cluster are ranked to facilitate the selection of a representative sequence for the cluster.
The Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Protein sequence and structure data
This data set contains data from Uniprot (in the files called protein_sequence, protein_synonyms, protein_names, organism_synonyms) and PDB (in the files called PDB and PDB_chain) as used by the Aquaria web resource at the time of download (2022-02-08).
The PSSH2 data set
PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.
Calculating PSSH2
The Swissprot and PDB data was downloaded in November 2021. Generating PSSH2: We used UniRef30_2021_03 (originally called UniRef30_2021_06) from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30%. The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until December 2021.
PDB based sequence-to-structure alignments
In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These are contained in the PDBchain data.
This data covers sequences and PDB structures in the timeframe until February 2022.
Evaluating PSSH2
The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:
The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)
The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).
The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").
The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in November 2021 (202111) as well as previous data from October 2020 (202010), February 2020 (202002) and September 2017 (201709). The results are collected in PSSH CATH validation.csv.
Known errors
Due to processing error, the profile of pdb structure 5fia A / B (sequence md5 052667679fc644184f40063c7602c9e1) is incomplete in the pdb_full hhblits database which led to further errors in generating sequence based alignments for sequences for 1vtm P (sequence md5 c844aff103449363cb8489c78c58ebf1) and 434t A / B (sequence md5 d67aa1c3a36492c719cb48b5e7ecc624).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predicted isoelectric point for all UniProtKB/TrEMBL proteins (April 2016) done using 18 different algorithms. Over 63 millions of protein sequences. Compressed using 7zip **Primary reference: Kozlowski, LP (2016) Proteome-pI: proteome isoelectric point database. Nucleic Acids Research doi: 10.1093/nar/gkw978 **www: http://isoelectricpointdb.org
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of human protein variations collected from the UniProt/Swiss-Prot database.
The GOA (Gene Ontology Annotation) project provides high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI). This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Proteins from multiple databases were used in algal genome annotation. We first downloaded the protein data of green algae (Chlorophyta) and red algae (Rhodophyta) on UniProt for 10 higher-quality assemblies gene predictions. We refer to these protein sequences for RefSeq genomes gene prediction as seed_algae_mix, whose role is to complete gene prediction of higher quality genomes quickly and accurately. These predicted results obtained in the RefSeq genomes will be used as the query sequences to query the NR (RefSeq non-redundant proteins) database, Then the searched target sequence is extracted as the NR_extract part. Subsequently, we retrieved the protein sequences on Uniprot of all lineages of algae, please note that we did not select only the protein sequences of the 17 lineages to be annotated, but all algae from the 21 lineages searched on NCBI taxonomy, a total of 2,432,633 (1.17GB) algae protein sequences named total_algae_mix were for future predictions. The plants part of OrthoDB V10.1 and BUSCO proteins of 7 lineages were also merged into pr_total_mix.
Central repository for collection of functional information on proteins, with accurate and consistent annotation. In addition to capturing core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and experimental and computational data. The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database which brings together experimental results, computed features, and scientific conclusions. UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization that await full manual annotation. Users may browse by taxonomy, keyword, gene ontology, enzyme class or pathway.
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc).