Facebook
TwitterUniProtKB/Swiss-Prot is the expertly curated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants.
The Universal Protein Resource (UniProt, http://www.uniprot.org) consortium is an initiative of the SIB Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) to provide the scientific community with a central resource for protein sequences and functional information. The UniProt consortium maintains the UniProt KnowledgeBase (UniProtKB), updated every 4 weeks, and several supplementary databases including the UniProt Reference Clusters (UniRef) and the UniProt Archive (UniParc).
The Swiss-Prot section of the UniProt KnowledgeBase (UniProtKB/Swiss-Prot) contains publicly available expertly manually annotated protein sequences obtained from a broad spectrum of organisms. Plant protein entries are produced in the frame of the Plant Proteome Annotation Program (PPAP), with an emphasis on characterized proteins of Arabidopsis thaliana and Oryza sativa. High level annotations provided by UniProtKB/Swiss-Prot are widely used to predict annotation of newly available proteins through automatic pipelines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The UniProt Knowledgebase (UniProtKB) is a comprehensive resource for protein sequence and functional information with extensive cross-references to more than 120 external databases. Besides amino acid sequence and a description, it also provides taxonomic data and citation information.
Facebook
TwitterCollection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.
Facebook
TwitterUploaded UniProt reviewed proteins database with all columns for easier using in kaggle notebooks. All columns have description, but if you will have any questions, you can check UniProt Help where every column have a full explanation.
For UniProt Species Proteomes check this dataset.
License: Creative Commons Attribution 4.0 International (CC BY 4.0) License
Facebook
TwitterUniProtKB is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current database was downloaded on 27.09.2019 and has the data fields (columns) as described below:# 1 Entry# 2 Entry name# 3 Status# 4 Protein names# 5 Gene names# 6 Organism# 7 Length# 8 Cross-reference (KO)# 9 Taxonomic lineage (PHYLUM)# 10 Taxonomic lineage (SPECIES) # This field carries current and old* taxonomic classifications.# 11 Taxonomic lineage (GENUS)# 12 Taxonomic lineage (KINGDOM)# 13 Taxonomic lineage (SUPERKINGDOM)# 14 Cross-reference (OrthoDB)# 15 Cross-reference (eggNOG)*Details about the classification used in UNIPROT can be found at the link: https://www.uniprot.org/help/taxonomy
Facebook
TwitterThe Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc).
Facebook
TwitterCentral repository for collection of functional information on proteins, with accurate and consistent annotation. In addition to capturing core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and experimental and computational data. The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database which brings together experimental results, computed features, and scientific conclusions. UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization that await full manual annotation. Users may browse by taxonomy, keyword, gene ontology, enzyme class or pathway.
Facebook
TwitterThe UniprotKB/SwissProt database contains protein sequence information.
Facebook
TwitterCurated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants.
Facebook
TwitterThe UniProt Knowledgebase (UniProtKB) is a comprehensive resource for protein sequence and functional information with extensive cross-references to more than 120 external databases. This collection is a subset of UniProtKB, and provides a means to reference isoform information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of the type protein from the database UniProtKB - version 2021_04
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predicted isoelectric point for all UniProtKB/TrEMBL proteins (April 2016) done using 18 different algorithms. Over 63 millions of protein sequences. Compressed using 7zip **Primary reference: Kozlowski, LP (2016) Proteome-pI: proteome isoelectric point database. Nucleic Acids Research doi: 10.1093/nar/gkw978 **www: http://isoelectricpointdb.org
Facebook
TwitterThe cross-references section of UniProtKB entries displays explicit and implicit links to databases such as nucleotide sequence databases, model organism databases and genomics and proteomics resources.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description follows is from the official UniProt embeddings page, which also hosts this dataset originally.
Protein embeddings are a way to encode functional and structural properties of a protein, mostly from its sequence only, in a machine-friendly format (vector representation). Generating such embeddings is computationally expensive, but once computed they can be leveraged for different tasks, such as sequence similarity search, sequence clustering, and sequence classification.
UniProt provided raw embeddings (mean pooled, per-protein using the ProtT5 model) for UniProtKB/Swiss-Prot.
Note: Protein sequences longer than 12k residues are excluded due to limitation of GPU memory (this concerns only a handful of proteins).
Sample code The embeddings.h5 files store the embeddings as key-value pairs. The key is the protein accession number and the value is the embeddings vector. The following code snippet shows how to read and iterate over an embeddings file in python.
import numpy as np
import h5py
with h5py.File("path/to/embeddings.h5", "r") as file:
print(f"number of entries: {len(file.items())}")
for sequence_id, embedding in file.items():
print(
f" id: {sequence_id}, "
f" embeddings shape: {embedding.shape}, "
f" embeddings mean: {np.array(embedding).mean()}"
)
Sample output (SARS-CoV-2 embeddings from release 2022_04) per-protein file:
number of entries: 17 id: A0A663DJA2, embeddings shape: (1024,), embeddings mean: 0.0006136894226074219 id: P0DTC1, embeddings shape: (1024,), embeddings mean: 0.0011968612670898438 id: P0DTC2, embeddings shape: (1024,), embeddings mean: 0.001041412353515625
SOURCE: https://www.uniprot.org/help/embeddings https://www.uniprot.org/help/downloads#embeddings Reviewed (Swiss-Prot) - per-protein: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/embeddings/uniprot_sprot/per-protein.h5
Facebook
TwitterNEWT is the taxonomy database maintained by the UniProt group. It integrates taxonomy data compiled in the NCBI database and data specific to the UniProt Knowledgebase. Browse by hierarchy, List all, or Complete proteomes. Organisms are classified in a hierarchical tree structure. Our taxonomy database contains every node (taxon) of the tree. UniProtKB taxonomy data is manually curated: next to manually verified organism names, we provide a selection of external links, organism strains and viral host information. Species with protein sequences stored in the UniProt Knowledgebase are named according to UniProt nomenclature. We endeavour to maintain a list of manually curated species names for which protein sequence data is available. In particular, we have adopted a systematic convention for naming viral and bacterial strains and isolates. Links to external sites are chosen by the UniProt taxonomy team and show pictures and various scientific data of interest (taxonomy, biology, physiology,...).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overall, 25 descriptors (features) are calculated for 3797 unique proteins.The legend for each descriptor is given in the associated header file.Columns 1-5 provide protein identifiers:- ORF, - SGD Gene Name, - UniprotKB, - Matching PDB structure?- PDB code of closest structureColumns 6-8 correspond to protein expression:- Integrated abundance in ppm,- log10 abundance,- bins of abundance (5 bins)Columns 9-16 contain evolutionary rates averaged over:- Full sequence- Disordered residues- Not Disordered residues- Domain residues- Not Domain residues- Residues with PDB coordinates- Surface residues (>25% relative ASA)- Buried residues (
Facebook
TwitterThe UniProtKB Sequence/Annotation Version Archive (UniSave) is a repository of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entry versions. Entries can be retrieved by entering a primary accession number or an entry name and pressing the Go! button. The result of the query is a list of entry versions with the UniProtKB database name, entry status, primary accession number, entry name, entry version, sequence version, release number and the release date, ordered by the release date, the latest version first. The entry version status can be ''''incorporated'''', ''''active'''', ''''changed'''', ''''replaced'''' or ''''deleted''''. An incorporated entry version is the first entry version added into UniProtKB, an active entry version is part of the latest public release, a changed entry version has been superseded by a newer entry version, a replaced entry has become secondary to another entry, and a deleted entry has been removed from the UniProtKB without becoming secondary to any other entry. For replaced entry versions, the status ''''Replaced'''' can be clicked to return all entries, which have the given entry as a secondary entry. If a date is provided as part of the query then only the version of the entry that was current at that date is displayed. Entries can be viewed by clicking ''''View'''' in the query results table. The ''''>'''' links can be used to access the earlier and later entry versions. The ''''Back to List'''' link returns the user to the query results table. Selecting ''''UniProtKB'''' or ''''Fasta'''' and pressing ''''Save'''' downloads the entry in flat file or fasta format. Comparison between entry versions is straightforward: selecting two entries and clicking the ''''Compare Selected'''' button will show the differences between the two entries. Whenever comparisons are made a Smith-Waterman sequence alignment is computed using SSEARCH, and displayed at the bottom of the entry. The actual alignment is displayed only when the sequences are not identical.
Facebook
TwitterThe subcellular locations in which a protein is found are described in UniProtKB entries with a controlled vocabulary, which includes also membrane topology and orientation terms. You may search in subcellular locations or list them all along with their definitions (490). By default, searching the subcellular locations will look for matches in both name and definition.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
UniProt RDF
Dataset Description
Comprehensive protein knowledgebase with functional annotations Original Source: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_reviewed_eukaryota_opisthokonta_metazoa_33208_0.rdf.xz
Dataset Summary
This dataset contains RDF triples from UniProt RDF converted to HuggingFace dataset format for easy use in machine learning pipelines.
Format: Originally rdf, converted to HuggingFace Dataset Size: 0.392 GB… See the full description on the dataset page: https://huggingface.co/datasets/CleverThis/uniprot.
Facebook
TwitterUniProtKB/Swiss-Prot is the expertly curated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants.
The Universal Protein Resource (UniProt, http://www.uniprot.org) consortium is an initiative of the SIB Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) to provide the scientific community with a central resource for protein sequences and functional information. The UniProt consortium maintains the UniProt KnowledgeBase (UniProtKB), updated every 4 weeks, and several supplementary databases including the UniProt Reference Clusters (UniRef) and the UniProt Archive (UniParc).
The Swiss-Prot section of the UniProt KnowledgeBase (UniProtKB/Swiss-Prot) contains publicly available expertly manually annotated protein sequences obtained from a broad spectrum of organisms. Plant protein entries are produced in the frame of the Plant Proteome Annotation Program (PPAP), with an emphasis on characterized proteins of Arabidopsis thaliana and Oryza sativa. High level annotations provided by UniProtKB/Swiss-Prot are widely used to predict annotation of newly available proteins through automatic pipelines.