100+ datasets found

c
Protein Structural Domain Classification
cathdb.info
ec.i4cologne.com
+3more
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
Explore at:
Unique identifier
https://identifiers.org/MIR:00100005
Dataset updated
Sep 30, 2024
Description
CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
e
Data from: PROSITE
prosite.expasy.org
identifiers.org
+7more
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE [Dataset]. https://prosite.expasy.org/
Explore at:
Dataset updated
Oct 15, 2025
Description
PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].
Hilsa protein database
figshare.com
txt
Updated Nov 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Molbio Lab BMB (2022). Hilsa protein database [Dataset]. http://doi.org/10.6084/m9.figshare.21579144.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21579144.v1
Dataset updated
Nov 18, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Molbio Lab BMB
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Using a Hilsa transcriptome data and TransDecoder (version-5.5.0) tool this protein sequences were predicted. Then it was annotated using homology-based similarity search against the latest Swiss-Prot database..
Bioinformatics Simulated
kaggle.com
zip
Updated Jan 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
willian oliveira (2025). Bioinformatics Simulated [Dataset]. https://www.kaggle.com/willianoliveiragibin/bioinformatics-simulated
Explore at:
zip(2644480 bytes)Available download formats
Dataset updated
Jan 7, 2025
Authors
willian oliveira
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification. The dataset includes the following columns: ID_Protein, a unique identifier for each protein; Sequence, a string of amino acids; Molecular_Weight, molecular weight calculated from the sequence; Isoelectric_Point, estimated isoelectric point based on the sequence composition; Hydrophobicity, average hydrophobicity calculated from the sequence; Total_Charge, sum of the charges of the amino acids in the sequence; Polar_Proportion, percentage of polar amino acids in the sequence; Nonpolar_Proportion, percentage of nonpolar amino acids in the sequence; Sequence_Length, total number of amino acids in the sequence; and Class, the functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other. While this is a simulated dataset, it was inspired by patterns observed in real protein datasets such as UniProt, a comprehensive database of protein sequences and annotations; the Kyte-Doolittle Scale, calculations of hydrophobicity; and Biopython, a tool for analyzing biological sequences. This dataset is ideal for training classification models for proteins, exploratory analysis of physicochemical properties of proteins, and building machine learning pipelines in bioinformatics. The dataset was created through sequence generation, where amino acid chains were randomly generated with lengths between 50 and 300 residues, property calculation using the Biopython library, and class assignment with classes randomly assigned for classification purposes. However, the sequences and properties do not represent real proteins but follow patterns observed in natural proteins, and the functional classes are simulated and do not correspond to actual biological characteristics. The dataset is divided into two subsets: Training, which includes 16,000 samples (proteinas_train.csv), and Testing, which includes 4,000 samples (proteinas_test.csv). This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
XMAn-A Homo sapiens Mutated Cancer Peptides Database
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iulia M. Lazar; Xu Yang (2023). XMAn-A Homo sapiens Mutated Cancer Peptides Database [Dataset]. http://doi.org/10.6084/m9.figshare.2825557.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2825557.v4
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Iulia M. Lazar; Xu Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To enable the identification of mutated peptide sequences in complex biological samples, in this work, a cancer protein database with mutation information collected from several public resources such as COSMIC, IARC P53, OMIM and UniProtKB, was developed. In-house developed Perl-scripts were used to search and process the data, and to translate each gene-level mutation into a mutated peptide sequence. The cancer mutation database comprises a total of 872,125 peptide entries from 25,642 protein IDs. A description line for each entry provides the parent protein ID and name, the cDNA- and protein-level mutation site and type, the originating database, and the cancer tissue type and corresponding hits. The database is FASTA formatted to enable data retrieval by commonly used tandem MS search engines.
Protein Structure Initiative - TargetTrack 2000-2017 - all data files
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Helen M. Berman, Margaret J. Gabanyi, Andrei Kouranov, David I. Micallef, John Westbrook; Protein Structure Initiative network of investigators; Helen M. Berman, Margaret J. Gabanyi, Andrei Kouranov, David I. Micallef, John Westbrook; Protein Structure Initiative network of investigators (2020). Protein Structure Initiative - TargetTrack 2000-2017 - all data files [Dataset]. http://doi.org/10.5281/zenodo.821654
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.821654
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Helen M. Berman, Margaret J. Gabanyi, Andrei Kouranov, David I. Micallef, John Westbrook; Protein Structure Initiative network of investigators; Helen M. Berman, Margaret J. Gabanyi, Andrei Kouranov, David I. Micallef, John Westbrook; Protein Structure Initiative network of investigators
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Protein Structure Initiative - TargetTrack protein target registration database (795 MB, gzipped tarball)

The Protein Structure Initiative was a high-throughput structural genomics effort from 2000-2015 focused on developing technologies to enable greater coverage of protein structure space. Over its 15-year tenure, over 100 investigators at 35 centers (see ContributingCenters.xls) declared over 350,000 protein sequences (targets) that they would study using state-of-the-art protein production and structure determination methods. Many of these targets were selected through bioinformatics-based methods to serve as representatives for sequence and structure clusters.

From 2003-2010, these selected sequences and some basic identifying metadata were kept in a database called TargetDB, created at the Research Collaboratory for Structural Bioinformatics at Rutgers University. In 2008, a second database named PepcDB was created to track detailed experimental trial history and the standard protocols used by the PSI centers. These two databases became the principal structural genomics target databases, and were rolled into the PSI Structural Biology Knowledgebase in 2008.

As part of the third phase of the PSI, TargetDB and PepcDB were merged into a single resource, TargetTrack, to facilitate one-stop access to the data as well as expanding the schema to include new required data items. Participating centers deposited the latest status on their active targets and the protocols that were used (along with any deviations) on a weekly or quarterly basis. TargetTrack provided a variety of pre-computed data downloads on a weekly basis as well.

In July 2017, the Structural Biology Knowledgebase ceased operations. The files provided in this tarball represent the final datafiles generated by TargetTrack (timestamp June 30, 2017). Please read the README included in this dataset for descriptions of each file.

The entire TargetTrack datafile in XML format can be found in /TargetTrack XML files/tt.xml.gz

Key documentation can be found in the /Documentation folder.
TargetTrack schema: targetTrack-v1.4.1.pdf
Spreadsheet with TargetTrack enumerations for relevant fields: targetTrackEnumeratedDataItems-v1.4.1-1.xls
Image depicted the XML data schema: targetTrack-v1.4.1.jpg

These files are 868 MB in total size, uncompressed.
To open the tarball, use the command 'tar -zxvf TargetTrack-1Jul2017.tar.gz'

-- created by the PSI Structural Biology Knowledgebase, July 5, 2017
The Encyclopedia of Domains (TED) structural domains assignments for...
zenodo.org
application/gzip, bz2 +1
Updated Oct 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
Explore at:
application/gzip, bz2, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13369203
Dataset updated
Oct 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

Please use the gunzip command to extract files with a '.gz' extension.

CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.

This dataset contains:

ted_214m_per_chain_segmentation.tsv
The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
1. AFDB_model_ID: chain identifier from AFDB in the format AF-

ted_365m_domain_boundaries_consensus_level.tsv.gz
The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
1. TED_ID: TED domain identifier in the format AF-

ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).

ted_324m_seq_clustering.cathlabels.tsv
The file contains the results of the domain sequences clustering with MMseqs2.
Columns:
1. Cluster_representative
2. Cluster_member
3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

novel_folds_set.domain_summary.tsv is sorted by novelty.
1. ted_id - TED domain identifier in the format AF-

Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
The files contain a header with the following fields. Each column is tab-separated (.tsv).
1. TED_redundant_id - TED chain identifier in the format AF-

and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
The file contains a header with the following fields. Each column is tab-separated (.tsv).
1. TED_redundant_id - TED chain identifier in the format AF-

novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.

All per-tool domain boundaries predictions are in the same format with the following columns.
1. TED_chainID - TED chain identifier in the format AF-

Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

Merizo predicts one continuous domain and a discontinuous domain,
Domain1 (discontinuous): 10-52_289-394
segment1: 10-52
segment2: 289-394
Domain 2 (continuous):
segment 1: 53-288

ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.

cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.

ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)

gofocus_data.tar.bz2 - GOFocus model weights
Ribosomal protein database of Proteus vulgaris ATCC 49132
figshare.com
xlsx
Updated Feb 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenfa Ng (2021). Ribosomal protein database of Proteus vulgaris ATCC 49132 [Dataset]. http://doi.org/10.6084/m9.figshare.14071577.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14071577.v1
Dataset updated
Feb 22, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wenfa Ng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work presents the ribosomal protein database of Proteus vulgaris ATCC 49132. Original data for the work came from the annotated proteome data of the bacterium downloaded from UniProt. Using an in-house MATLAB ribosomal protein database analysis software, the original proteome data file was parsed to extract protein name and amino acid sequence of all ribosomal proteins in the species. The database also includes calculated variables such as number of residues, molecular weight, and nucleotide sequence. Overall, the presented database could serve as a ribosomal protein mass fingerprint for use in microbial identification, or it could be used in fundamental studies seeking to uncover new insights into ribosomal protein biology.
Z
CPBI_seqdb_demo sample QFO sequence library
data.niaid.nih.gov
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William R. Pearson (2020). CPBI_seqdb_demo sample QFO sequence library [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_377027
Explore at:
Dataset updated
Jan 21, 2020
Dataset provided by
U. of Virginia
Authors
William R. Pearson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A medium-sized (approx 1 million entry) protein sequence database constructed from the NCBI 'nr' (Jan, 2017) database selecting Uniprot (SwissProt), RefSeq, and PDB entries for 66 species (taxon_id's) from the Quest for Orthologs organism set. These files are designed to be used in conjunction with scripts and SQL files to construct the seqdb_demo database, as described in a Current Protocols in Bioinformatics Unit 3.9 revised Spring, 2017. The files are:

qfo_demo.gz - a fasta-format sequence library with the curren NR Defline format (gzip compressed)

qfo_prot.accession2taxonid.gz, qfo_pdb.accession2taxid.gz- tables that map accessions to taxon_id's and gi-numbers, similar to that available in the NCBI pub/taxonomy/accession2taxid/prot.accession2taxid and pdb.accession2taxid files (gzip compressed).
s
iPTMnet
scicrunch.org
rrid.site
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
iPTMnet [Dataset]. http://identifiers.org/RRID:SCR_014416
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_014416
Description
A protein database which connects multiple disparate bioinformatics tools and systems text mining, data mining, analysis and visualization tools, and databases and ontologies.
Diamond2GO Database – Version 2025-08-12 (nr_clean_d2go)
zenodo.org
bin
Updated Aug 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rhys Farrer; Rhys Farrer (2025). Diamond2GO Database – Version 2025-08-12 (nr_clean_d2go) [Dataset]. http://doi.org/10.5281/zenodo.16818512
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16818512
Dataset updated
Aug 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rhys Farrer; Rhys Farrer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 28, 2025
Description
This is the updated Diamond2GO reference database built on 12th August 2025.

It is a DIAMOND-formatted protein database (`.dmnd`) consisting of over 27 million sequences derived from the NCBI `nr` dataset, filtered to include only those with Gene Ontology (GO) annotations, and redundancy reduction using MMseqs2 (95% similarity). This version improves sensitivity and annotation coverage compared to the original 2023 release used in the published D2GO manuscript, and the earlier 2025 release.

This database is intended for use with the Diamond2GO tool, which enables rapid GO-term annotation and enrichment analysis for high-throughput sequencing datasets.

For reproducibility of results published using the earlier version (699,409 sequences), please refer to the [v1.0.0 release] https://github.com/rhysf/Diamond2GO/releases/tag/6a035ce
r
Worldwide Protein Data Bank (wwPDB)
rrid.site
scicrunch.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Worldwide Protein Data Bank (wwPDB) [Dataset]. http://identifiers.org/RRID:SCR_006555/resolver
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006555 https://identifiers.org/RRID:SCR_006555/resolver
Dataset updated
Jan 29, 2022
Description
Public global Protein Data Bank archive of macromolecular structural data overseen by organizations that act as deposition, data processing and distribution centers for PDB data. Members are: RCSB PDB (USA), PDBe (Europe) and PDBj (Japan), and BMRB (USA). This site provides information about services provided by individual member organizations and about projects undertaken by wwPDB. Data available via websites of its member organizations.
Ribosomal protein database of Providencia rettgeri strain Dmel1
figshare.com
xlsx
Updated Feb 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenfa Ng (2021). Ribosomal protein database of Providencia rettgeri strain Dmel1 [Dataset]. http://doi.org/10.6084/m9.figshare.14099357.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14099357.v1
Dataset updated
Feb 24, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Wenfa Ng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work presents the ribosomal protein database of Providencia rettgeri strain Dmel1. Original data for the work came from the annotated proteome data of the bacterium downloaded from UniProt. Using an in-house MATLAB ribosomal protein database analysis software, the original proteome data file was parsed to extract protein name and amino acid sequence of all ribosomal proteins in the species. The database also includes calculated variables such as number of residues, molecular weight, and nucleotide sequence. Overall, the presented database could serve as a ribosomal protein mass fingerprint for use in microbial identification, or it could be used in fundamental studies seeking to uncover new insights into ribosomal protein biology.
Z
LukProt - an animal evolution-centric eukaryotic protein database
data.niaid.nih.gov
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sobala, Łukasz F. (2025). LukProt - an animal evolution-centric eukaryotic protein database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7089120
Explore at:
Dataset updated
Feb 7, 2025
Dataset provided by
Hirszfeld Institute of Immunology and Experimental Therapy, PAS
Authors
Sobala, Łukasz F.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purposes of the database are to consolidate sequences from undersampled animal taxa and provide usable search tools. The publication associated with LukProt can be found here: https://doi.org/10.1093/gbe/evae231.

The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo).

Proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format:

(A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY

where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number YYYYYY, and each taxon XXXXXX. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed.

A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/.

Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference:

Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1

Holozoa

(excluding Metazoa)

31 40 39 43

Ctenophora 2 2 35 38

Porifera 4 5 30 47

Placozoa 2 2 3 6

Cnidaria 3 5 65 88

Bilateria 51 51 94 142

Included with the database are:

ready to use main database files:

LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB

to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence.

LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB

LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB

LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB

auxiliary database files:

LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command: cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2, uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB

LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different

BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis

OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification)

OMArk_output.zip – a folder with the results of all OMArk analyses

metadata:

README.md – a README file describing the metadata

LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice)

LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include:

the LukProt taxonomy in various formats

supporting scripts for data manipulation and visualization

a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience.

other files - see README

changelog.md – database changelog

Words of caution:

The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates.

Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details.

The taxonomy is different to UniEuk/EukMap, but UniEuk data were integrated where possible.

A few NCBI taxids are missing and will be added in due course.

Proteomes from NCBI and UniProt will be updated to current versions.

A number of proteomes present in some metadata, are unpublished and were held back.

While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established.

Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.

Acknowledgements:

Andrew E. Allen Lab for creating the original PhyloDB.

Daniel Richter et al. for creating EukProt and keeping it updated.

Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science.

All the authors of the original data.

National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.
d
SUPFAM
dknet.org
neuinfo.org
+2more
Updated Jan 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). SUPFAM [Dataset]. http://identifiers.org/RRID:SCR_005304
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005304
Dataset updated
Jan 29, 2022
Description
SUPFAM is a database that consists of clusters of potentially related homologous protein domain families, with and without three-dimensional structural information, forming superfamilies. The present release (Release 3.0) of SUPFAM uses homologous families in Pfam (Version 23.0) and SCOP (Release 1.69) which are examples of sequence -alignment and structure classification databases respectively. The two steps involved in setting up of SUPFAM database are * Relating Pfam and SCOP families using a new profile-profile alignment algorithm AlignHUSH. This results in identifying many Pfam families which could be related to a family or superfamily of known structural information. * An all-against-all match among Pfam families with yet unknown structure resulting in identification of related Pfam families forming new potential superfamilies. The SUPFAM database can be used in either the Browse mode or Search mode. In Browse mode you can browse through the Superfamilies, Pfam families or SCOP families. In each of these modes you will be presented with a full list which can be easily browsed. In Search mode, you can search for Pfam families, SCOP families or Superfamilies based on keywords or SCOP/Pfam identifiers of families and superfamilies., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.
u
CATH protein domain classification (version 4.2)
rdr.ucl.ac.uk
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Sillitoe; Natalie Dawson; Christine Orengo (2023). CATH protein domain classification (version 4.2) [Dataset]. http://doi.org/10.5522/04/7937330.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5522/04/7937330.v1
Dataset updated
May 30, 2023
Dataset provided by
University College London
Authors
Ian Sillitoe; Natalie Dawson; Christine Orengo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CATH is a classification of protein structures downloaded from the Protein Data Bank. We group protein domains into superfamilies when there is sufficient evidence they have diverged from a common ancestor. The files contained in this dataset correspond to the version 4.2 release of the CATH classification.
d
UniProt
dknet.org
neuinfo.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). UniProt [Dataset]. http://identifiers.org/RRID:SCR_002380
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002380
Dataset updated
Jan 29, 2022
Description
Collection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.
m
Data from: A novel protein motif finding algorithm for classification of the...
bridges.monash.edu
researchdata.edu.au
pdf
Updated Nov 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun, Deng-Kuan; Zhang, Tong-Liang; Ding, Yong-Sheng (2017). A novel protein motif finding algorithm for classification of the ligase subfamilies [Dataset]. http://doi.org/10.4225/03/5a1371c69c0e3
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.4225/03/5a1371c69c0e3
Dataset updated
Nov 21, 2017
Dataset provided by
Monash University
Authors
Sun, Deng-Kuan; Zhang, Tong-Liang; Ding, Yong-Sheng
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
The algorithm of extracting motifs from a family or subfamily is still a hot spot in bioinformatics. It not only contributes to understand functions of proteins and predicts the classification which a unknown protein sequence belongs to, but also helps to study the protein-protein interaction. In this paper, we present a novel algorithm to extract motifs of a subfamily, which is based on feature selection and position connection. Position connection is applied to generate motifs, which is the hybrid method with mechanism of vote decision-making to construct the classifier of the ligase subfamilies. Through testing in the database, more than 95.87% predictive accuracy is achieved. The result demonstrates that this novel method is practical. In addition, the method illuminates that motifs play an important role to classify proteins and research the characteristics of the subfamilies or families of protein database. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
Protein Data Bank
kaggle.com
zip
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmet Can GÜNAY (2025). Protein Data Bank [Dataset]. https://www.kaggle.com/datasets/ahmetcangunay1/protein-data-bank
Explore at:
zip(5079269900 bytes)Available download formats
Dataset updated
May 3, 2025
Authors
Ahmet Can GÜNAY
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📖 Context & Inspiration

This dataset was created to provide a comprehensive, easily accessible archive of Protein Data Bank (PDB) files—vital for researchers and developers working in structural biology, molecular modeling, and bioinformatics. The files were gathered from a public directory-style server, where PDB structures are organized into subfolders based on naming conventions.

The goal was to simplify access to this valuable data by restructuring it in a more machine-learning-friendly format, eliminating the need for repetitive scraping or manual downloads. Inspiration came from real-world bottlenecks in data preprocessing for protein folding prediction models, drug design simulations, and structure-based machine learning pipelines.

🌐 Source

All files were sourced from a publicly available FTP-style archive mirroring the PDB repository. The original structure was preserved to ensure traceability and compatibility with existing workflows.

⚠️ Note: This dataset is intended for research and educational use. Ensure compliance with any licensing or usage terms defined by the PDB archive.
n
Data from: Knowledge-based prediction of protein backbone conformation using...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Oct 23, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iyanar Vetrivel; Swapnil Mahajan; Manoj Tyagi; Lionel Hoffmann; Yves-Henri Sanejouand; Narayanaswamy Srinivasan; Alexandre de Brevern; Frédéric Cadet; Bernard Offmann (2018). Knowledge-based prediction of protein backbone conformation using a structural alphabet [Dataset]. http://doi.org/10.5061/dryad.3f5q5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3f5q5
Dataset updated
Oct 23, 2018
Dataset provided by
University of Reunion Island
Nantes Université
Authors
Iyanar Vetrivel; Swapnil Mahajan; Manoj Tyagi; Lionel Hoffmann; Yves-Henri Sanejouand; Narayanaswamy Srinivasan; Alexandre de Brevern; Frédéric Cadet; Bernard Offmann
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Libraries of structural prototypes that abstract protein local structures are known as structural alphabets and have proven to be very useful in various aspects of protein structure analyses and predictions. One such library, Protein Blocks, is composed of 16 standard 5-residues long structural prototypes. This form of analyzing proteins involves drafting its structure as a string of Protein Blocks. Predicting the local structure of a protein in terms of protein blocks is the general objective of this work. A new approach, PB-kPRED is proposed towards this aim. It involves (i) organizing the structural knowledge in the form of a database of pentapeptide fragments extracted from all protein structures in the PDB and (ii) applying a knowledge-based algorithm that does not rely on any secondary structure predictions and/or sequence alignment profiles, to scan this database and predict most probable backbone conformations for the protein local structures. Though PB-kPRED uses the structural information from homologues in preference, if available. The predictions were evaluated rigorously on 15,544 query proteins representing a non-redundant subset of the PDB filtered at 30% sequence identity cut-off. We have shown that the kPRED method was able to achieve mean accuracies ranging from 40.8% to 66.3% depending on the availability of homologues. The impact of the different strategies for scanning the database on the prediction was evaluated and is discussed. Our results highlights the usefulness of the method in the context of proteins without any known structural homologues. A scoring function that gives a good estimate of the accuracy of prediction was further developed. This score estimates very well the accuracy of the algorithm (R2 of 0.82). An online version of the tool is provided freely for non-commercial usage at http://www.bo-protscience.fr/kpred/.