100+ datasets found

UniProtKB/Swiss-Prot Protein Embeddings
kaggle.com
zip
Updated Apr 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Ofer (2023). UniProtKB/Swiss-Prot Protein Embeddings [Dataset]. https://www.kaggle.com/datasets/danofer/uniprotkbswiss-prot-protein-embeddings/data
Explore at:
zip(2087271680 bytes)Available download formats
Dataset updated
Apr 23, 2023
Authors
Dan Ofer
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description follows is from the official UniProt embeddings page, which also hosts this dataset originally.

Protein embeddings are a way to encode functional and structural properties of a protein, mostly from its sequence only, in a machine-friendly format (vector representation). Generating such embeddings is computationally expensive, but once computed they can be leveraged for different tasks, such as sequence similarity search, sequence clustering, and sequence classification.

UniProt provided raw embeddings (mean pooled, per-protein using the ProtT5 model) for UniProtKB/Swiss-Prot.

Note: Protein sequences longer than 12k residues are excluded due to limitation of GPU memory (this concerns only a handful of proteins).

Sample code The embeddings.h5 files store the embeddings as key-value pairs. The key is the protein accession number and the value is the embeddings vector. The following code snippet shows how to read and iterate over an embeddings file in python.

import numpy as np import h5py with h5py.File("path/to/embeddings.h5", "r") as file: print(f"number of entries: {len(file.items())}") for sequence_id, embedding in file.items(): print( f" id: {sequence_id}, " f" embeddings shape: {embedding.shape}, " f" embeddings mean: {np.array(embedding).mean()}" )

Sample output (SARS-CoV-2 embeddings from release 2022_04) per-protein file:

number of entries: 17 id: A0A663DJA2, embeddings shape: (1024,), embeddings mean: 0.0006136894226074219 id: P0DTC1, embeddings shape: (1024,), embeddings mean: 0.0011968612670898438 id: P0DTC2, embeddings shape: (1024,), embeddings mean: 0.001041412353515625

SOURCE: https://www.uniprot.org/help/embeddings https://www.uniprot.org/help/downloads#embeddings Reviewed (Swiss-Prot) - per-protein: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/embeddings/uniprot_sprot/per-protein.h5
r
UniprotKB/SwissProt
resodate.org
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Boutet; Lieberherr; Tognolli; Schneider; Bansal; Bridge; Poux; Bougueleret; Xenarios (2024). UniprotKB/SwissProt [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdW5pcHJvdGtiLXN3aXNzcHJvdA==
Explore at:
Dataset updated
Dec 16, 2024
Dataset provided by
Leibniz Data Manager
Authors
Boutet; Lieberherr; Tognolli; Schneider; Bansal; Bridge; Poux; Bougueleret; Xenarios
Description
The UniprotKB/SwissProt database contains protein sequence information.
d
Peptide Sequence Database
dknet.org
scicrunch.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Peptide Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_005764
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005764
Dataset updated
Jan 29, 2022
Description
The Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.
s
UniProt
scicrunch.org
dknet.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). UniProt [Dataset]. http://identifiers.org/RRID:SCR_002380
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002380
Dataset updated
Jan 29, 2022
Description
Collection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.
n
ExPASy Biochemical Pathways
neuinfo.org
scicrunch.org
+1more
Updated Jan 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ExPASy Biochemical Pathways [Dataset]. http://identifiers.org/RRID:SCR_007944
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007944
Dataset updated
Jan 29, 2022
Description
The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE. It is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include format and content enhancements, cross-references to additional databases, new documentation files and improvements to TrEMBL, a computer-annotated supplement to SWISS-PROT.
r
UniProtKB
rrid.site
dknet.org
+2more
Updated Nov 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). UniProtKB [Dataset]. http://identifiers.org/RRID:SCR_004426/resolver?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004426 https://identifiers.org/RRID:SCR_004426/resolver?q=&i=rrid
Dataset updated
Nov 12, 2025
Description
Central repository for collection of functional information on proteins, with accurate and consistent annotation. In addition to capturing core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and experimental and computational data. The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database which brings together experimental results, computed features, and scientific conclusions. UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization that await full manual annotation. Users may browse by taxonomy, keyword, gene ontology, enzyme class or pathway.
Z
PSSH2 - database of protein sequence-to-structure homologies (including...
data.niaid.nih.gov
zenodo.org
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Schafferhans; Sean O'Donoghue; Neblina Sikta; Sandeep Kaur (2022). PSSH2 - database of protein sequence-to-structure homologies (including Sars-CoV-2 structures) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4279163
Explore at:
Dataset updated
Feb 11, 2022
Dataset provided by
Garvan Institute of Medical Research
HSWT
Authors
Andrea Schafferhans; Sean O'Donoghue; Neblina Sikta; Sandeep Kaur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Protein sequence and structure data

This data set contains data from Uniprot (in the files called protein_sequence, protein_synonyms, protein_names, organism_synonyms) and PDB (in the files called PDB and PDB_chain) as used by the Aquaria web resource at the time of download (2022-02-08).

The PSSH2 data set

PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

Calculating PSSH2

The Swissprot and PDB data was downloaded in November 2021. Generating PSSH2: We used UniRef30_2021_03 (originally called UniRef30_2021_06) from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30%. The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until December 2021.

PDB based sequence-to-structure alignments

In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These are contained in the PDBchain data.

This data covers sequences and PDB structures in the timeframe until February 2022.

Evaluating PSSH2

The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:

The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)

The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).

The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").

The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in November 2021 (202111) as well as previous data from October 2020 (202010), February 2020 (202002) and September 2017 (201709). The results are collected in PSSH CATH validation.csv.

Known errors

Due to processing error, the profile of pdb structure 5fia A / B (sequence md5 052667679fc644184f40063c7602c9e1) is incomplete in the pdb_full hhblits database which led to further errors in generating sequence based alignments for sequences for 1vtm P (sequence md5 c844aff103449363cb8489c78c58ebf1) and 434t A / B (sequence md5 d67aa1c3a36492c719cb48b5e7ecc624).
PSSH2 - database of protein sequence-to-structure homologies - Sars-CoV-2...
zenodo.org
application/gzip, csv
Updated Feb 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Schafferhans; Andrea Schafferhans; Sean O'Donoghue; Sean O'Donoghue (2022). PSSH2 - database of protein sequence-to-structure homologies - Sars-CoV-2 subset [Dataset]. http://doi.org/10.5281/zenodo.4916895
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4916895
Dataset updated
Feb 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrea Schafferhans; Andrea Schafferhans; Sean O'Donoghue; Sean O'Donoghue
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The PSSH2 data set

PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

This dataset contains a subset of the usual PSSH2 database, including only the proteins relevant to visualise Sars-CoV-2 structures.
It contains Swissprot and PDB data used for generating PSSH2 along with the PSSH2 data itself. This consists of the sequence-to-structure alignments used in Aquaria (aquaria.ws) and also for the Covid19 resource of Aquaria (http://aquaria.ws/covid).

Calculating PSSH2

The main bunch of Swissprot and PDB data was downloaded in October 2020, but incremental updates, especially as related to Covid19 were added until April 2021.
Generating PSSH2: We used Uniclust30 from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30% (http://gwdu111.gwdg.de/~compbiol/uniclust/2020_03/UniRef30_2020_03_hhsuite.tar.gz). The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until April 2021.
mESC shotgun and positional proteomics based on deep proteome sequence...
data.niaid.nih.gov
ebi.ac.uk
xml
Updated Feb 25, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerben Menschaert; Gerben Menschaert (2013). mESC shotgun and positional proteomics based on deep proteome sequence database (derived from RIBOseq data) [Dataset]. https://data.niaid.nih.gov/resources?id=pxd000124
Explore at:
xmlAvailable download formats
Dataset updated
Feb 25, 2013
Dataset provided by
Faculty of Bioscience Engineering
Authors
Gerben Menschaert; Gerben Menschaert
Variables measured
Proteomics
Description
Shotgun and positional proteomics study of a mouse embryonic stem cell line. We devised a proteogenomic approach constructing a custom protein sequence search space, built from both SwissProt and RIBO-seq derived translation products, applicable for LC-MSMS spectrum identification. To record the impact of using the constructed deep proteome database we performed two alternative MS-based proteomic strategies: (I) a regular shotgun proteomic and (II) an N-terminal COFRADIC approach. The obtained fragmentation spectra were searched against the custom database (combination of UniProtKB-SwissProt and RIBO-seq derived translation sequences) using three different search engines: OMSSA (version 2.1.9), X!Tandem (TORNADO, version 2010.01.01.04) and Mascot (version 2.3). The first two were run from the SearchGUI graphical user interface (version 1.10.4). A combination of X!Tandem and Mascot was used for the N-terminal COFRADIC analysis, a combination of all three search engines for the shotgun proteome analysis. Note that OMMSA cannot cope with the protease setting semi-ArgC/P needed to analyze N-terminal COFRADIC data.For the shotgun proteome data, trypsin was set as cleavage enzyme allowing for one missed cleavage, and singly to triply charged precursors or singly to quadruple charged precursors were taken into account respectively for the Mascot or X!Tandem/OMSSA search engines, and the precursor and fragment mass tolerance were set to respectively 10 ppm and 0.5 Da. Methionine oxidation to methionine-sulfoxide, pyroglutamate formation of N-terminal glutamine and acetylation (protein N-terminus) were set as variable modifications. For the N-terminal COFRADIC analysis the protease setting semi-ArgC/P (Arg-C specificity with arginine-proline cleavage allowed) was used. No missed cleavages were allowed and the precursor and fragment mass tolerance were also set to respectively 10 ppm and 0.5 Da. Carbamidomethylation of cysteine and methionine oxidation to methionine-sulfoxide and 13C3D2-acetylation of lysines were set as fixed modifications. Peptide N-terminal acetylation or 13C3D2-acetylation and pyroglutamate formation of N-terminal glutamine were set as variable modifications and instrument setting was put on ESI-TRAP. Protein and peptide identification in addition to data interpretation was done using the PeptideShaker algorithm (http://code.google.com/p/peptide-shaker, version 0.18.3), setting the false discovery rate to 1% at all levels (protein, peptide, and peptide to spectrum matching). Aforementioned tools and algorithms (SearchGui, X!Tandem, OMSSA, and PeptideShaker) are freely available as open source.
r
NCBI Protein Database
rrid.site
neuinfo.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2001). NCBI Protein Database [Dataset]. http://identifiers.org/RRID:SCR_003257
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_003257
Dataset updated
Jan 29, 2022
Description
Databases of protein sequences and 3D structures of proteins. Collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.
Hilsa protein database
figshare.com
txt
Updated Nov 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Molbio Lab BMB (2022). Hilsa protein database [Dataset]. http://doi.org/10.6084/m9.figshare.21579144.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21579144.v1
Dataset updated
Nov 18, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Molbio Lab BMB
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Using a Hilsa transcriptome data and TransDecoder (version-5.5.0) tool this protein sequences were predicted. Then it was annotated using homology-based similarity search against the latest Swiss-Prot database..
e
Data from: PROSITE
prosite.expasy.org
identifiers.org
+7more
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE [Dataset]. https://prosite.expasy.org/
Explore at:
Dataset updated
Oct 15, 2025
Description
PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].
f
The GenBank Non-Redundant Protein Sequence Database (NRDB)
fungidb.org
piroplasmadb.org
+1more
Updated Aug 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). The GenBank Non-Redundant Protein Sequence Database (NRDB) [Dataset]. https://fungidb.org/fungidb/app/record/dataset/DS_a7163a9f0d
Explore at:
Dataset updated
Aug 16, 2019
Description
The GenBank non-redundant protein sequence database (NRDB) is a component of the NCBI BLAST databases and contains entries from GenPept, Swissprot, PIR, PDF, PDB and NCBI RefSeq.
Z
CPBI_seqdb_demo sample QFO sequence library
data.niaid.nih.gov
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William R. Pearson (2020). CPBI_seqdb_demo sample QFO sequence library [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_377027
Explore at:
Dataset updated
Jan 21, 2020
Dataset provided by
U. of Virginia
Authors
William R. Pearson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A medium-sized (approx 1 million entry) protein sequence database constructed from the NCBI 'nr' (Jan, 2017) database selecting Uniprot (SwissProt), RefSeq, and PDB entries for 66 species (taxon_id's) from the Quest for Orthologs organism set. These files are designed to be used in conjunction with scripts and SQL files to construct the seqdb_demo database, as described in a Current Protocols in Bioinformatics Unit 3.9 revised Spring, 2017. The files are:

qfo_demo.gz - a fasta-format sequence library with the curren NR Defline format (gzip compressed)

qfo_prot.accession2taxonid.gz, qfo_pdb.accession2taxid.gz- tables that map accessions to taxon_id's and gi-numbers, similar to that available in the NCBI pub/taxonomy/accession2taxid/prot.accession2taxid and pdb.accession2taxid files (gzip compressed).
n
Alternative Splicing Database
neuinfo.org
dknet.org
+2more
Updated Feb 1, 2001
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2001). Alternative Splicing Database [Dataset]. http://identifiers.org/RRID:SCR_007555
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007555
Dataset updated
Feb 1, 2001
Description
It has been established with the intention of assembling in a central, publicly accessible site information about alternatively spliced genes, their products and expression patterns. Version 2.1 of ASDB consists of two divisions, ASDB(proteins) , which contains amino acid sequences, and ASDB(nucleotides) with genomic sequences. SWISS-PROT uses two formats for description of alternative splicing Thus the protein sequences were selected from SWISS-PROT using full text search for both the words alternative splicing (usually in the CC lines) and varsplic (in the FT lines). In order to group proteins that could arise by alternative splicing of the same gene, we developed the clustering procedure. Two proteins were linked if they had a common fragment of at least 20 amino acids, and clusters were initially defined as maximum connected groups of linked proteins. It turned out that some clusters were chimeric, in the sense that they contained members of multi-gene families, but not alternatively spliced variants of one gene. Therefore the multiple alignments were subject to additional analysis aimed at detection of chimeric clusters. Each cluster is represented by multiple alignment of its members constructed using CLUSTALW. The distribution of cluster size, representation of species and other relevant statistics of ASDB(proteins) can be accessed through the links below. This processing covers the cases when alternatively spliced variants are described in separate SWISS-PROT entries. The other kinds of ASDB records, originating from the SWISS-PROT entries with the varsplic field in the feature table, usually describe the proteins that are not part of any cluster. In these cases, the information on the variable fragments of the several proteins which result from the alternative splicing of a single gene is contained in the entry itself. ASDB(proteins) entries are marked with different symbols to allow for easy differentiation among the three types: those proteins which are part of the ASDB clusters and the corresponding multialignments, those which have the information on different variants in the associated SWISS-PROT entries, and those for which the information on the variants is not available at the present time. ASDB contains internal links between entries and/or clusters, as well as external links to Medline, GenBank and SWISS-PROT entries. The ASDB(nucleotides) division was generated by collecting all GenBank entries containing the words alternative splicing and further selection of those entries that contain complete gene sequences (all CDS fields are complete, i.e. they do not have continuation signs). Sponsors: This work was supported by the Director, Office of Energy Research, Office of Biological and Environmental Research, of the US Department of Energy under Contract No. DE-ACO3-76SF00098. Additional support came from grants from the Russian Fund of Basic Research (99-04-48347), the Russian State Scientific Program Human Genome (65/99), and the Merck Genome Research Institute (244).
r
RESID
rrid.site
scicrunch.org
+2more
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). RESID [Dataset]. http://identifiers.org/RRID:SCR_003505
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_003505
Dataset updated
Oct 26, 2025
Description
A comprehensive collection of annotations and structures for protein modifications including amino-terminal, carboxyl-terminal and peptide chain cross-link post-translational modifications. It provides: systematic and alternate names, atomic formulas and masses, enzyme activities generating the modifications, keywords, literature citations, Gene Ontology cross-references, Protein Information Resource (PIR) and SWISS-PROT protein sequence database feature table annotations, structure diagrams and molecular models. Each RESID Database entry presents a chemically unique modification and shows how that modification is currently annotated in the protein sequence databases, Swiss-Prot and the Protein Information Resource (PIR). The RESID Database provides a table of corresponding equivalent feature annotations that is used in the UniProt project, an international effort to combine the resources of the Swiss-Prot, TrEMBL and PIR. As an annotation tool, the RESID Database is used in standardizing and enhancing modification descriptions in the feature tables of Swiss-Prot entries.
Z
Sars-CoV-2 structures -- sequence-to-alignments derived from PDB and from...
data.niaid.nih.gov
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Schafferhans; Sean O'Donoghue; Neblina Sikta; Sandeep Kaur (2021). Sars-CoV-2 structures -- sequence-to-alignments derived from PDB and from PSSH2, plus dark regions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4934860
Explore at:
Dataset updated
Jun 15, 2021
Dataset provided by
Garvan Institute of Medical Research
HSWT
Authors
Andrea Schafferhans; Sean O'Donoghue; Neblina Sikta; Sandeep Kaur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aquaria Coverage map

In "SARS-CoV-2 structural coverage map reveals viral protein interactions, hijacking, and mimicry" we introduce a novel concept to visually organize a complex dataset of a large numbers of models: a one-stop visualization summarizing what is known - and not known - about the 3D structure of the viral proteome. This tailored visualization — called the SARS-CoV-2 structural coverage map — helps researchers find structural models related to specific research questions and can be viewed in the Aquaria-COVID resource. Aquaria_COVID_Coverage_Map.csv summarises the most basic information of this map, specifying dark and non-dark regions, as well as number of residues predicted to be disordered in these regions - predicted by Meta-Disorder (Schlessinger et al, 2006).

The PSSH2 data set

The sequence-to-structure alignments were generated using a modified version of the Aquaria sequence-to-structure processing pipeline (O’Donoghue et al, 2015), making up a subset of the PSSH2 database. PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences. seq_to_struc_alignemnts_PSSH2.csv.gz contains a subset of the usual PSSH2 database, including only the proteins relevant to visualise Sars-CoV-2 structures. Protein sequences and PDB structures are identified by the MD5 hashes of their respective sequences. PDB_chain_identifier_mappings.csv and swissprot_identifier_mappings.csv detail which entries in Swissprot and PDB chain are referred to by the MD5 hashes in the PSSH2 data set.

Calculating PSSH2

The main bunch of Swissprot and PDB data was downloaded in October 2020, but incremental updates, especially as related to Covid19 were added until April 2021. Generating PSSH2: We used Uniclust30 from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30% (http://gwdu111.gwdg.de/~compbiol/uniclust/2020_03/UniRef30_2020_03_hhsuite.tar.gz). The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until April 2021.

PDB based sequence-to-structure alignments

In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These alignments are summarised in PDB_chain_alignments_pssh2Format.csv.
s
Data from: SYSTERS
scicrunch.org
rrid.site
+2more
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). SYSTERS [Dataset]. http://identifiers.org/RRID:SCR_007955
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007955
Dataset updated
Oct 26, 2025
Description
SYSTERS is a database of protein sequences grouped into homologous families and superfamilies. The SYSTERS project aims to provide a meaningful partitioning of the whole protein sequence space by a fully automatic procedure. A refined two-step algorithm assigns each protein to a family and a superfamily. The sequence data underlying SYSTERS release 4 now comprise several protein sequence databases derived from completely sequenced genomes (ENSEMBL, TAIR, SGD and GeneDB), in addition to the comprehensive Swiss-Prot/TrEMBL databases. To augment the automatically derived results, information from external databases like Pfam and Gene Ontology are added to the web server. Furthermore, users can retrieve pre-processed analyses of families like multiple alignments and phylogenetic trees. New query options comprise a batch retrieval tool for functional inference about families based on automatic keyword extraction from sequence annotations. A new access point, PhyloMatrix, allows the retrieval of phylogenetic profiles of SYSTERS families across organisms with completely sequenced genomes. Gene, Human, Vertebrate, Genome, Human ORFs
r
GTOP - Genomes To Protein structures
rrid.site
scicrunch.org
+3more
Updated Nov 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). GTOP - Genomes To Protein structures [Dataset]. http://identifiers.org/RRID:SCR_007698
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007698
Dataset updated
Nov 30, 2025
Description
GTOP is a database consists of data analyses of proteins identified by various genome projects. This database mainly uses sequence homology analyses and features extensive utilization of information on three-dimensional structures. GTOP is built by the Laboratory of Gene-Product Informatics at the National Institute of Genetics. This research is supported by the Japan Science and Technology Corporation and Grants-in-Aid for Scientific Research (Genomes in category C) from the Ministry of Education, Science, Sports and Culture of Japan. We use the following methods: Prediction of 3D structure Sequence homology search of PDB, using REVERSE PSI-BLAST. Functional predictions (family classifications) Sequence homology search of Swiss-Prot, a well-annotated sequence database, with the use of BLAST. Other analytical methods We are also carrying out the following analyses: Motif Analysis(PROSITE) Family classification(Pfam) Prediction of transmembrane helix domains(SOSUI) Prediction of coiled-coil regions(Multicoil) Repetitive sequence analysis(RepAlign)
n
Histone Database
neuinfo.org
dknet.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Histone Database [Dataset]. http://identifiers.org/RRID:SCR_007711
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007711
Dataset updated
Jan 29, 2022
Description
Histone Database is a database of histones and their corresponding sequences. Sequence- and text-based searches were performed on NCBI's redundant and non-redundant (nr) peptide sequence databases. These databases are derived from GenBank, EMBL, and DDBJ translated DNA coding regions, plus protein sequences from the PDB (Protein Data Bank), SWISS-PROT, the PIR (Protein Information Resource), and the PRF (Protein Research Foundation). :Users can search by keyword, sequence fragment, category, organism, and redundancy of the set.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dan Ofer (2023). UniProtKB/Swiss-Prot Protein Embeddings [Dataset]. https://www.kaggle.com/datasets/danofer/uniprotkbswiss-prot-protein-embeddings/data

UniProtKB/Swiss-Prot Protein Embeddings

Swiss-Prot Protein sequence Embeddings

Explore at:

zip(2087271680 bytes)Available download formats

Dataset updated

Apr 23, 2023

Authors

Dan Ofer

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Description follows is from the official UniProt embeddings page, which also hosts this dataset originally.

Protein embeddings are a way to encode functional and structural properties of a protein, mostly from its sequence only, in a machine-friendly format (vector representation). Generating such embeddings is computationally expensive, but once computed they can be leveraged for different tasks, such as sequence similarity search, sequence clustering, and sequence classification.

UniProt provided raw embeddings (mean pooled, per-protein using the ProtT5 model) for UniProtKB/Swiss-Prot.

Note: Protein sequences longer than 12k residues are excluded due to limitation of GPU memory (this concerns only a handful of proteins).

Sample code The embeddings.h5 files store the embeddings as key-value pairs. The key is the protein accession number and the value is the embeddings vector. The following code snippet shows how to read and iterate over an embeddings file in python.

import numpy as np
import h5py

with h5py.File("path/to/embeddings.h5", "r") as file:
  print(f"number of entries: {len(file.items())}")
  for sequence_id, embedding in file.items():
    print(
      f" id: {sequence_id}, "
      f" embeddings shape: {embedding.shape}, "
      f" embeddings mean: {np.array(embedding).mean()}"
    )

Sample output (SARS-CoV-2 embeddings from release 2022_04) per-protein file:

number of entries: 17 id: A0A663DJA2, embeddings shape: (1024,), embeddings mean: 0.0006136894226074219 id: P0DTC1, embeddings shape: (1024,), embeddings mean: 0.0011968612670898438 id: P0DTC2, embeddings shape: (1024,), embeddings mean: 0.001041412353515625

SOURCE: https://www.uniprot.org/help/embeddings https://www.uniprot.org/help/downloads#embeddings Reviewed (Swiss-Prot) - per-protein: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/embeddings/uniprot_sprot/per-protein.h5

Clear search

Close search

Google apps

Main menu

UniProtKB/Swiss-Prot Protein Embeddings

UniprotKB/SwissProt

Peptide Sequence Database

UniProt

ExPASy Biochemical Pathways

UniProtKB

PSSH2 - database of protein sequence-to-structure homologies (including...

PSSH2 - database of protein sequence-to-structure homologies - Sars-CoV-2...

mESC shotgun and positional proteomics based on deep proteome sequence...

NCBI Protein Database

Hilsa protein database

Data from: PROSITE

The GenBank Non-Redundant Protein Sequence Database (NRDB)

CPBI_seqdb_demo sample QFO sequence library

Alternative Splicing Database

RESID

Sars-CoV-2 structures -- sequence-to-alignments derived from PDB and from...

Data from: SYSTERS

GTOP - Genomes To Protein structures

Histone Database

UniProtKB/Swiss-Prot Protein Embeddings

Swiss-Prot Protein sequence Embeddings