Facebook
TwitterCATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
Facebook
TwitterPROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using a Hilsa transcriptome data and TransDecoder (version-5.5.0) tool this protein sequences were predicted. Then it was annotated using homology-based similarity search against the latest Swiss-Prot database..
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification. The dataset includes the following columns: ID_Protein, a unique identifier for each protein; Sequence, a string of amino acids; Molecular_Weight, molecular weight calculated from the sequence; Isoelectric_Point, estimated isoelectric point based on the sequence composition; Hydrophobicity, average hydrophobicity calculated from the sequence; Total_Charge, sum of the charges of the amino acids in the sequence; Polar_Proportion, percentage of polar amino acids in the sequence; Nonpolar_Proportion, percentage of nonpolar amino acids in the sequence; Sequence_Length, total number of amino acids in the sequence; and Class, the functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other. While this is a simulated dataset, it was inspired by patterns observed in real protein datasets such as UniProt, a comprehensive database of protein sequences and annotations; the Kyte-Doolittle Scale, calculations of hydrophobicity; and Biopython, a tool for analyzing biological sequences. This dataset is ideal for training classification models for proteins, exploratory analysis of physicochemical properties of proteins, and building machine learning pipelines in bioinformatics. The dataset was created through sequence generation, where amino acid chains were randomly generated with lengths between 50 and 300 residues, property calculation using the Biopython library, and class assignment with classes randomly assigned for classification purposes. However, the sequences and properties do not represent real proteins but follow patterns observed in natural proteins, and the functional classes are simulated and do not correspond to actual biological characteristics. The dataset is divided into two subsets: Training, which includes 16,000 samples (proteinas_train.csv), and Testing, which includes 4,000 samples (proteinas_test.csv). This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To enable the identification of mutated peptide sequences in complex biological samples, in this work, a cancer protein database with mutation information collected from several public resources such as COSMIC, IARC P53, OMIM and UniProtKB, was developed. In-house developed Perl-scripts were used to search and process the data, and to translate each gene-level mutation into a mutated peptide sequence. The cancer mutation database comprises a total of 872,125 peptide entries from 25,642 protein IDs. A description line for each entry provides the parent protein ID and name, the cDNA- and protein-level mutation site and type, the originating database, and the cancer tissue type and corresponding hits. The database is FASTA formatted to enable data retrieval by commonly used tandem MS search engines.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Protein Structure Initiative - TargetTrack protein target registration database (795 MB, gzipped tarball)
The Protein Structure Initiative was a high-throughput structural genomics effort from 2000-2015 focused on developing technologies to enable greater coverage of protein structure space. Over its 15-year tenure, over 100 investigators at 35 centers (see ContributingCenters.xls) declared over 350,000 protein sequences (targets) that they would study using state-of-the-art protein production and structure determination methods. Many of these targets were selected through bioinformatics-based methods to serve as representatives for sequence and structure clusters.
From 2003-2010, these selected sequences and some basic identifying metadata were kept in a database called TargetDB, created at the Research Collaboratory for Structural Bioinformatics at Rutgers University. In 2008, a second database named PepcDB was created to track detailed experimental trial history and the standard protocols used by the PSI centers. These two databases became the principal structural genomics target databases, and were rolled into the PSI Structural Biology Knowledgebase in 2008.
As part of the third phase of the PSI, TargetDB and PepcDB were merged into a single resource, TargetTrack, to facilitate one-stop access to the data as well as expanding the schema to include new required data items. Participating centers deposited the latest status on their active targets and the protocols that were used (along with any deviations) on a weekly or quarterly basis. TargetTrack provided a variety of pre-computed data downloads on a weekly basis as well.
In July 2017, the Structural Biology Knowledgebase ceased operations. The files provided in this tarball represent the final datafiles generated by TargetTrack (timestamp June 30, 2017). Please read the README included in this dataset for descriptions of each file.
The entire TargetTrack datafile in XML format can be found in /TargetTrack XML files/tt.xml.gz
Key documentation can be found in the /Documentation folder.
TargetTrack schema: targetTrack-v1.4.1.pdf
Spreadsheet with TargetTrack enumerations for relevant fields: targetTrackEnumeratedDataItems-v1.4.1-1.xls
Image depicted the XML data schema: targetTrack-v1.4.1.jpg
These files are 868 MB in total size, uncompressed.
To open the tarball, use the command 'tar -zxvf TargetTrack-1Jul2017.tar.gz'
-- created by the PSI Structural Biology Knowledgebase, July 5, 2017
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.
In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.
For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.
Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:
For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).
We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.
Please use the gunzip command to extract files with a '.gz' extension.
CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work presents the ribosomal protein database of Proteus vulgaris ATCC 49132. Original data for the work came from the annotated proteome data of the bacterium downloaded from UniProt. Using an in-house MATLAB ribosomal protein database analysis software, the original proteome data file was parsed to extract protein name and amino acid sequence of all ribosomal proteins in the species. The database also includes calculated variables such as number of residues, molecular weight, and nucleotide sequence. Overall, the presented database could serve as a ribosomal protein mass fingerprint for use in microbial identification, or it could be used in fundamental studies seeking to uncover new insights into ribosomal protein biology.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A medium-sized (approx 1 million entry) protein sequence database constructed from the NCBI 'nr' (Jan, 2017) database selecting Uniprot (SwissProt), RefSeq, and PDB entries for 66 species (taxon_id's) from the Quest for Orthologs organism set. These files are designed to be used in conjunction with scripts and SQL files to construct the seqdb_demo database, as described in a Current Protocols in Bioinformatics Unit 3.9 revised Spring, 2017. The files are:
qfo_demo.gz - a fasta-format sequence library with the curren NR Defline format (gzip compressed)
qfo_prot.accession2taxonid.gz, qfo_pdb.accession2taxid.gz- tables that map accessions to taxon_id's and gi-numbers, similar to that available in the NCBI pub/taxonomy/accession2taxid/prot.accession2taxid and pdb.accession2taxid files (gzip compressed).
Facebook
TwitterA protein database which connects multiple disparate bioinformatics tools and systems text mining, data mining, analysis and visualization tools, and databases and ontologies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the updated Diamond2GO reference database built on 12th August 2025.
It is a DIAMOND-formatted protein database (`.dmnd`) consisting of over 27 million sequences derived from the NCBI `nr` dataset, filtered to include only those with Gene Ontology (GO) annotations, and redundancy reduction using MMseqs2 (95% similarity). This version improves sensitivity and annotation coverage compared to the original 2023 release used in the published D2GO manuscript, and the earlier 2025 release.
This database is intended for use with the Diamond2GO tool, which enables rapid GO-term annotation and enrichment analysis for high-throughput sequencing datasets.
For reproducibility of results published using the earlier version (699,409 sequences), please refer to the [v1.0.0 release] https://github.com/rhysf/Diamond2GO/releases/tag/6a035ce
Facebook
TwitterPublic global Protein Data Bank archive of macromolecular structural data overseen by organizations that act as deposition, data processing and distribution centers for PDB data. Members are: RCSB PDB (USA), PDBe (Europe) and PDBj (Japan), and BMRB (USA). This site provides information about services provided by individual member organizations and about projects undertaken by wwPDB. Data available via websites of its member organizations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work presents the ribosomal protein database of Providencia rettgeri strain Dmel1. Original data for the work came from the annotated proteome data of the bacterium downloaded from UniProt. Using an in-house MATLAB ribosomal protein database analysis software, the original proteome data file was parsed to extract protein name and amino acid sequence of all ribosomal proteins in the species. The database also includes calculated variables such as number of residues, molecular weight, and nucleotide sequence. Overall, the presented database could serve as a ribosomal protein mass fingerprint for use in microbial identification, or it could be used in fundamental studies seeking to uncover new insights into ribosomal protein biology.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purposes of the database are to consolidate sequences from undersampled animal taxa and provide usable search tools. The publication associated with LukProt can be found here: https://doi.org/10.1093/gbe/evae231.
The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo).
Proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format:
(A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY
where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number YYYYYY, and each taxon XXXXXX. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed.
A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/.
Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference:
Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1
Holozoa
(excluding Metazoa)
31 40 39 43
Ctenophora 2 2 35 38
Porifera 4 5 30 47
Placozoa 2 2 3 6
Cnidaria 3 5 65 88
Bilateria 51 51 94 142
Included with the database are:
ready to use main database files:
LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB
to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence.
LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB
LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB
LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB
auxiliary database files:
LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command: cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2, uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB
LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different
BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis
OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification)
OMArk_output.zip – a folder with the results of all OMArk analyses
metadata:
README.md – a README file describing the metadata
LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice)
LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include:
the LukProt taxonomy in various formats
supporting scripts for data manipulation and visualization
a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience.
other files - see README
changelog.md – database changelog
Words of caution:
The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates.
Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details.
The taxonomy is different to UniEuk/EukMap, but UniEuk data were integrated where possible.
A few NCBI taxids are missing and will be added in due course.
Proteomes from NCBI and UniProt will be updated to current versions.
A number of proteomes present in some metadata, are unpublished and were held back.
While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established.
Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.
Acknowledgements:
Andrew E. Allen Lab for creating the original PhyloDB.
Daniel Richter et al. for creating EukProt and keeping it updated.
Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science.
All the authors of the original data.
National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.
Facebook
TwitterSUPFAM is a database that consists of clusters of potentially related homologous protein domain families, with and without three-dimensional structural information, forming superfamilies. The present release (Release 3.0) of SUPFAM uses homologous families in Pfam (Version 23.0) and SCOP (Release 1.69) which are examples of sequence -alignment and structure classification databases respectively. The two steps involved in setting up of SUPFAM database are * Relating Pfam and SCOP families using a new profile-profile alignment algorithm AlignHUSH. This results in identifying many Pfam families which could be related to a family or superfamily of known structural information. * An all-against-all match among Pfam families with yet unknown structure resulting in identification of related Pfam families forming new potential superfamilies. The SUPFAM database can be used in either the Browse mode or Search mode. In Browse mode you can browse through the Superfamilies, Pfam families or SCOP families. In each of these modes you will be presented with a full list which can be easily browsed. In Search mode, you can search for Pfam families, SCOP families or Superfamilies based on keywords or SCOP/Pfam identifiers of families and superfamilies., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CATH is a classification of protein structures downloaded from the Protein Data Bank. We group protein domains into superfamilies when there is sufficient evidence they have diverged from a common ancestor. The files contained in this dataset correspond to the version 4.2 release of the CATH classification.
Facebook
TwitterCollection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.
Facebook
Twitterhttp://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
The algorithm of extracting motifs from a family or subfamily is still a hot spot in bioinformatics. It not only contributes to understand functions of proteins and predicts the classification which a unknown protein sequence belongs to, but also helps to study the protein-protein interaction. In this paper, we present a novel algorithm to extract motifs of a subfamily, which is based on feature selection and position connection. Position connection is applied to generate motifs, which is the hybrid method with mechanism of vote decision-making to construct the classifier of the ligase subfamilies. Through testing in the database, more than 95.87% predictive accuracy is achieved. The result demonstrates that this novel method is practical. In addition, the method illuminates that motifs play an important role to classify proteins and research the characteristics of the subfamilies or families of protein database. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📖 Context & Inspiration
This dataset was created to provide a comprehensive, easily accessible archive of Protein Data Bank (PDB) files—vital for researchers and developers working in structural biology, molecular modeling, and bioinformatics. The files were gathered from a public directory-style server, where PDB structures are organized into subfolders based on naming conventions.
The goal was to simplify access to this valuable data by restructuring it in a more machine-learning-friendly format, eliminating the need for repetitive scraping or manual downloads. Inspiration came from real-world bottlenecks in data preprocessing for protein folding prediction models, drug design simulations, and structure-based machine learning pipelines.
🌐 Source
All files were sourced from a publicly available FTP-style archive mirroring the PDB repository. The original structure was preserved to ensure traceability and compatibility with existing workflows.
⚠️ Note: This dataset is intended for research and educational use. Ensure compliance with any licensing or usage terms defined by the PDB archive.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Libraries of structural prototypes that abstract protein local structures are known as structural alphabets and have proven to be very useful in various aspects of protein structure analyses and predictions. One such library, Protein Blocks, is composed of 16 standard 5-residues long structural prototypes. This form of analyzing proteins involves drafting its structure as a string of Protein Blocks. Predicting the local structure of a protein in terms of protein blocks is the general objective of this work. A new approach, PB-kPRED is proposed towards this aim. It involves (i) organizing the structural knowledge in the form of a database of pentapeptide fragments extracted from all protein structures in the PDB and (ii) applying a knowledge-based algorithm that does not rely on any secondary structure predictions and/or sequence alignment profiles, to scan this database and predict most probable backbone conformations for the protein local structures. Though PB-kPRED uses the structural information from homologues in preference, if available. The predictions were evaluated rigorously on 15,544 query proteins representing a non-redundant subset of the PDB filtered at 30% sequence identity cut-off. We have shown that the kPRED method was able to achieve mean accuracies ranging from 40.8% to 66.3% depending on the availability of homologues. The impact of the different strategies for scanning the database on the prediction was evaluated and is discussed. Our results highlights the usefulness of the method in the context of proteins without any known structural homologues. A scoring function that gives a good estimate of the accuracy of prediction was further developed. This score estimates very well the accuracy of the algorithm (R2 of 0.82). An online version of the tool is provided freely for non-commercial usage at http://www.bo-protscience.fr/kpred/.
Facebook
TwitterCATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.