The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current database was downloaded on 27.09.2019 and has the data fields (columns) as described below:# 1 Entry# 2 Entry name# 3 Status# 4 Protein names# 5 Gene names# 6 Organism# 7 Length# 8 Cross-reference (KO)# 9 Taxonomic lineage (PHYLUM)# 10 Taxonomic lineage (SPECIES) # This field carries current and old* taxonomic classifications.# 11 Taxonomic lineage (GENUS)# 12 Taxonomic lineage (KINGDOM)# 13 Taxonomic lineage (SUPERKINGDOM)# 14 Cross-reference (OrthoDB)# 15 Cross-reference (eggNOG)*Details about the classification used in UNIPROT can be found at the link: https://www.uniprot.org/help/taxonomy
Collection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.
Dataset Description
Dataset Summary
This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins. This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz. Supported Tasks and Leaderboards: None Languages: English
Dataset Structure
Data Instances
Data Fields: id, description, sequence Data… See the full description on the dataset page: https://huggingface.co/datasets/damlab/uniprot.
Central repository for collection of functional information on proteins, with accurate and consistent annotation. In addition to capturing core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and experimental and computational data. The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database which brings together experimental results, computed features, and scientific conclusions. UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization that await full manual annotation. Users may browse by taxonomy, keyword, gene ontology, enzyme class or pathway.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The UniProt Knowledgebase (UniProtKB) is a comprehensive resource for protein sequence and functional information with extensive cross-references to more than 120 external databases. Besides amino acid sequence and a description, it also provides taxonomic data and citation information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overall, 25 descriptors (features) are calculated for 3797 unique proteins.The legend for each descriptor is given in the associated header file.Columns 1-5 provide protein identifiers:- ORF, - SGD Gene Name, - UniprotKB, - Matching PDB structure?- PDB code of closest structureColumns 6-8 correspond to protein expression:- Integrated abundance in ppm,- log10 abundance,- bins of abundance (5 bins)Columns 9-16 contain evolutionary rates averaged over:- Full sequence- Disordered residues- Not Disordered residues- Domain residues- Not Domain residues- Residues with PDB coordinates- Surface residues (>25% relative ASA)- Buried residues (
Curated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of the type protein from the database UniProtKB - version 2025_03
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Taxonomic groups and the identified number of sequences deposited to the UniProtKB database that contain domain architectures similar to (and including) the integrin α (β-propeller) superfamily.
UniProt Knowledge Base of protein sequences. The UniProt Knowledgebase is the central hub for the collection of functional information on proteins.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predicted isoelectric point for all UniProtKB/TrEMBL proteins (April 2016) done using 18 different algorithms. Over 63 millions of protein sequences. Compressed using 7zip **Primary reference: Kozlowski, LP (2016) Proteome-pI: proteome isoelectric point database. Nucleic Acids Research doi: 10.1093/nar/gkw978 **www: http://isoelectricpointdb.org
NEWT is the taxonomy database maintained by the UniProt group. It integrates taxonomy data compiled in the NCBI database and data specific to the UniProt Knowledgebase. Browse by hierarchy, List all, or Complete proteomes. Organisms are classified in a hierarchical tree structure. Our taxonomy database contains every node (taxon) of the tree. UniProtKB taxonomy data is manually curated: next to manually verified organism names, we provide a selection of external links, organism strains and viral host information. Species with protein sequences stored in the UniProt Knowledgebase are named according to UniProt nomenclature. We endeavour to maintain a list of manually curated species names for which protein sequence data is available. In particular, we have adopted a systematic convention for naming viral and bacterial strains and isolates. Links to external sites are chosen by the UniProt taxonomy team and show pictures and various scientific data of interest (taxonomy, biology, physiology,...).
The UniProtKB Sequence/Annotation Version Archive (UniSave) is a repository of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entry versions. Entries can be retrieved by entering a primary accession number or an entry name and pressing the Go! button. The result of the query is a list of entry versions with the UniProtKB database name, entry status, primary accession number, entry name, entry version, sequence version, release number and the release date, ordered by the release date, the latest version first. The entry version status can be ''''incorporated'''', ''''active'''', ''''changed'''', ''''replaced'''' or ''''deleted''''. An incorporated entry version is the first entry version added into UniProtKB, an active entry version is part of the latest public release, a changed entry version has been superseded by a newer entry version, a replaced entry has become secondary to another entry, and a deleted entry has been removed from the UniProtKB without becoming secondary to any other entry. For replaced entry versions, the status ''''Replaced'''' can be clicked to return all entries, which have the given entry as a secondary entry. If a date is provided as part of the query then only the version of the entry that was current at that date is displayed. Entries can be viewed by clicking ''''View'''' in the query results table. The ''''>'''' links can be used to access the earlier and later entry versions. The ''''Back to List'''' link returns the user to the query results table. Selecting ''''UniProtKB'''' or ''''Fasta'''' and pressing ''''Save'''' downloads the entry in flat file or fasta format. Comparison between entry versions is straightforward: selecting two entries and clicking the ''''Compare Selected'''' button will show the differences between the two entries. Whenever comparisons are made a Smith-Waterman sequence alignment is computed using SSEARCH, and displayed at the bottom of the entry. The actual alignment is displayed only when the sequences are not identical.
The UniProt Knowledgebase (UniProtKB) is a comprehensive resource for protein sequence and functional information with extensive cross-references to more than 120 external databases. Besides amino acid sequence and a description, it also provides taxonomic data and citation information.
The UniProt Knowledgebase (UniProtKB) is a comprehensive resource for protein sequence and functional information with extensive cross-references to more than 120 external databases. This collection is a subset of UniProtKB, and provides a means to reference isoform information.
An annotation program which aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI). It is a central dataset for other major multi-species databases, such as Ensembl and NCBI. Because of the multi-species nature of the UniProtKB, UniProtKB-GOA assists in the curation of 200,000 species. This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups. Gene Association Files can be accessed from the Downloads section of the website.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This PostgreSQL database contains structured information extracted from the UniProt API (retrieved in April 2025). It includes:
126,582 proteins
123,518 sequences
494.072 embeddings generated with ProtT5, ProSTT5, ESM2, Ankh
623,134 GO term annotations (with evidence codes: EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC)
Associated biological metadata
The data was extracted using the protein-information-system repository and is used within the FANTASIA pipeline for automated functional annotation of protein sequences.
The following UniProt API filter was applied to retrieve annotations:
https://www.uniprot.org/uniprotkb?query=%28+go_exp%3A*+OR+go_ida%3A*+OR+go_ipi%3A*+OR+go_imp%3A*+OR+go_igi%3A*+OR+go_iep%3A*+OR+go_tas%3A*+OR+go_ic%3A*%29
To ensure embedding reproducibility, all sequences were processed with batch size = 1, avoiding discrepancies caused by padding artifacts common in PLMs like ProtT5.
To initialize the database, either of the following methods can be used:
Option 1: Using pg_restore
pg_restore -U usuario -h localhost -p 5432 -d BioData ./BioData_backup_2025_hq.tar
Option 2: Using the FANTASIA CLI
fantasia initialize --embeddings_url https://zenodo.org/records/15704357/files/PIS_2025_ankh_exp.tar?download=1
Notes:
Data set of manually annotated chordata-specific proteins as well as those that are widely conserved. The program keeps existing human entries up-to-date and broadens the manual annotation to other vertebrate species, especially model organisms, including great apes, cow, mouse, rat, chicken, zebrafish, as well as Xenopus laevis and Xenopus tropicalis. A draft of the complete human proteome is available in UniProtKB/Swiss-Prot and one of the current priorities of the Chordata protein annotation program is to improve the quality of human sequences provided. To this aim, they are updating sequences which show discrepancies with those predicted from the genome sequence. Dubious isoforms, sequences based on experimental artifacts and protein products derived from erroneous gene model predictions are also revisited. This work is in part done in collaboration with the Hinxton Sequence Forum (HSF), which allows active exchange between UniProt, HAVANA, Ensembl and HGNC groups, as well as with RefSeq database. UniProt is a member of the Consensus CDS project and thye are in the process of reviewing their records to support convergence towards a standard set of protein annotation. They also continuously update human entries with functional annotation, including novel structural, post-translational modification, interaction and enzymatic activity data. In order to identify candidates for re-annotation, they use, among others, information extraction tools such as the STRING database. In addition, they regularly add new sequence variants and maintain disease information. Indeed, this annotation program includes the Variation Annotation Program, the goal of which is to annotate all known human genetic diseases and disease-linked protein variants, as well as neutral polymorphisms.
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.