63 datasets found
  1. c

    Protein Structural Domain Classification

    • cathdb.info
    • ec.i4cologne.com
    • +3more
    Updated Sep 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
    Explore at:
    Dataset updated
    Sep 30, 2024
    Description

    CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

  2. Benchmarks against protein structural databases.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert K. Bradley; Adam Roberts; Michael Smoot; Sudeep Juvekar; Jaeyoung Do; Colin Dewey; Ian Holmes; Lior Pachter (2023). Benchmarks against protein structural databases. [Dataset]. http://doi.org/10.1371/journal.pcbi.1000392.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Robert K. Bradley; Adam Roberts; Michael Smoot; Sudeep Juvekar; Jaeyoung Do; Colin Dewey; Ian Holmes; Lior Pachter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on the BAliBASE 3 [24] and SABmark 1.65 [25] databases. Probalign has the highest accuracy on the commonly-used BAliBASE 3 dataset and FSA in default mode has superior accuracy on the BAliBASE 3+fp and SABmark 1.65 datasets (note that only FSA and AMAP explicitly attempt to maximize the expected accuracy). FSA has higher positive predictive values than any other program on all datasets and can additionally achieve high sensitivity when run in maximum-sensitivity mode. The BAliBASE 3+fp dataset, which mirrors BAliBASE 3 but includes a single non-homologous sequence in each alignment, was designed to test the robustness of alignment programs to incomplete homology. Traditional alignment programs, designed to maximize sensitivity, suffer greatly-increased mis-alignment when even a single non-homologous sequence is introduced; in contrast, FSA is robust to the non-homologous sequence and has an unchanged positive predictive value. Remarkably, FSA was the only tested program with a mis-alignment rate of

  3. b

    Data from: Structural Database of Allergenic Proteins

    • bioregistry.io
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Structural Database of Allergenic Proteins [Dataset]. https://bioregistry.io/sdap
    Explore at:
    Dataset updated
    Jan 29, 2023
    Description

    SDAP is a Web server that integrates a database of allergenic proteins with various bioinformatics tools for performing structural studies related to allergens and characterization of their epitopes.

  4. Protein Structure Initiative - TargetTrack 2000-2017 - all data files

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helen M. Berman, Margaret J. Gabanyi, Andrei Kouranov, David I. Micallef, John Westbrook; Protein Structure Initiative network of investigators; Helen M. Berman, Margaret J. Gabanyi, Andrei Kouranov, David I. Micallef, John Westbrook; Protein Structure Initiative network of investigators (2020). Protein Structure Initiative - TargetTrack 2000-2017 - all data files [Dataset]. http://doi.org/10.5281/zenodo.821654
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Helen M. Berman, Margaret J. Gabanyi, Andrei Kouranov, David I. Micallef, John Westbrook; Protein Structure Initiative network of investigators; Helen M. Berman, Margaret J. Gabanyi, Andrei Kouranov, David I. Micallef, John Westbrook; Protein Structure Initiative network of investigators
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Protein Structure Initiative - TargetTrack protein target registration database (795 MB, gzipped tarball)

    The Protein Structure Initiative was a high-throughput structural genomics effort from 2000-2015 focused on developing technologies to enable greater coverage of protein structure space. Over its 15-year tenure, over 100 investigators at 35 centers (see ContributingCenters.xls) declared over 350,000 protein sequences (targets) that they would study using state-of-the-art protein production and structure determination methods. Many of these targets were selected through bioinformatics-based methods to serve as representatives for sequence and structure clusters.

    From 2003-2010, these selected sequences and some basic identifying metadata were kept in a database called TargetDB, created at the Research Collaboratory for Structural Bioinformatics at Rutgers University. In 2008, a second database named PepcDB was created to track detailed experimental trial history and the standard protocols used by the PSI centers. These two databases became the principal structural genomics target databases, and were rolled into the PSI Structural Biology Knowledgebase in 2008.

    As part of the third phase of the PSI, TargetDB and PepcDB were merged into a single resource, TargetTrack, to facilitate one-stop access to the data as well as expanding the schema to include new required data items. Participating centers deposited the latest status on their active targets and the protocols that were used (along with any deviations) on a weekly or quarterly basis. TargetTrack provided a variety of pre-computed data downloads on a weekly basis as well.

    In July 2017, the Structural Biology Knowledgebase ceased operations. The files provided in this tarball represent the final datafiles generated by TargetTrack (timestamp June 30, 2017). Please read the README included in this dataset for descriptions of each file.

    The entire TargetTrack datafile in XML format can be found in /TargetTrack XML files/tt.xml.gz

    Key documentation can be found in the /Documentation folder.
    TargetTrack schema: targetTrack-v1.4.1.pdf
    Spreadsheet with TargetTrack enumerations for relevant fields: targetTrackEnumeratedDataItems-v1.4.1-1.xls
    Image depicted the XML data schema: targetTrack-v1.4.1.jpg

    These files are 868 MB in total size, uncompressed.
    To open the tarball, use the command 'tar -zxvf TargetTrack-1Jul2017.tar.gz'

    -- created by the PSI Structural Biology Knowledgebase, July 5, 2017

  5. n

    Data from: Knowledge-based prediction of protein backbone conformation using...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Oct 23, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iyanar Vetrivel; Swapnil Mahajan; Manoj Tyagi; Lionel Hoffmann; Yves-Henri Sanejouand; Narayanaswamy Srinivasan; Alexandre de Brevern; Frédéric Cadet; Bernard Offmann (2018). Knowledge-based prediction of protein backbone conformation using a structural alphabet [Dataset]. http://doi.org/10.5061/dryad.3f5q5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 23, 2018
    Dataset provided by
    Nantes Université
    University of Reunion Island
    Authors
    Iyanar Vetrivel; Swapnil Mahajan; Manoj Tyagi; Lionel Hoffmann; Yves-Henri Sanejouand; Narayanaswamy Srinivasan; Alexandre de Brevern; Frédéric Cadet; Bernard Offmann
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Libraries of structural prototypes that abstract protein local structures are known as structural alphabets and have proven to be very useful in various aspects of protein structure analyses and predictions. One such library, Protein Blocks, is composed of 16 standard 5-residues long structural prototypes. This form of analyzing proteins involves drafting its structure as a string of Protein Blocks. Predicting the local structure of a protein in terms of protein blocks is the general objective of this work. A new approach, PB-kPRED is proposed towards this aim. It involves (i) organizing the structural knowledge in the form of a database of pentapeptide fragments extracted from all protein structures in the PDB and (ii) applying a knowledge-based algorithm that does not rely on any secondary structure predictions and/or sequence alignment profiles, to scan this database and predict most probable backbone conformations for the protein local structures. Though PB-kPRED uses the structural information from homologues in preference, if available. The predictions were evaluated rigorously on 15,544 query proteins representing a non-redundant subset of the PDB filtered at 30% sequence identity cut-off. We have shown that the kPRED method was able to achieve mean accuracies ranging from 40.8% to 66.3% depending on the availability of homologues. The impact of the different strategies for scanning the database on the prediction was evaluated and is discussed. Our results highlights the usefulness of the method in the context of proteins without any known structural homologues. A scoring function that gives a good estimate of the accuracy of prediction was further developed. This score estimates very well the accuracy of the algorithm (R2 of 0.82). An online version of the tool is provided freely for non-commercial usage at http://www.bo-protscience.fr/kpred/.

  6. d

    3D-Genomics Database

    • dknet.org
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome

  7. Structural Protein Sequences

    • kaggle.com
    zip
    Updated Feb 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/shahir/protein-data-set
    Explore at:
    zip(28782775 bytes)Available download formats
    Dataset updated
    Feb 3, 2018
    Authors
    SHAHIR
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

    The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

    The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

    Content

    There are two data files. Both are arranged on "structureId" of the protein:

    • pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

    • data_seq.csv contains >400,000 protein structure sequences.

    Acknowledgements

    Original data set down loaded from http://www.rcsb.org/pdb/

    Inspiration

    Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.

  8. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
    Explore at:
    application/gzip, bz2, zipAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

    Please use the gunzip command to extract files with a '.gz' extension.

    CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
    Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • ted_324m_seq_clustering.cathlabels.tsv
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id - TED domain identifier in the format AF-
    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The files contain a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID - TED chain identifier in the format AF-
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.
    • cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.
    • ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)
    • gofocus_data.tar.bz2 - GOFocus model weights
  9. u

    Single mutation protein structure pairs extracted from the PDB with...

    • fdr.uni-hamburg.de
    gz
    Updated Sep 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sieg Jochen; Rarey Matthias (2023). Single mutation protein structure pairs extracted from the PDB with MicroMiner [Dataset]. http://doi.org/10.25592/uhhfdm.13411
    Explore at:
    gzAvailable download formats
    Dataset updated
    Sep 30, 2023
    Dataset provided by
    Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146 Hamburg, Germany
    Authors
    Sieg Jochen; Rarey Matthias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This page provides the single mutation data extracted with MicroMiner from the PDB. The data contains amino acid pairs in protein structures from the PDB, exemplifying single mutations’ local structural changes for single chains and pairs for protein–protein interfaces. Mutations to non-standard residues are also provided.
    See the MicroMiner publication for details:

    Sieg, J.; Rarey, M. Searching similar local 3D micro-environments in protein structure databases with MicroMiner, 2023 (accepted in Briefings in Bioinformatics)

    Data content:

    • pdb_all_monomer.tsv
      • all single mutations in monomer/single chains
      • 255853767 pairs/lines
      • 15GB
    • filtered_single_mutations_pdb_monomer.tsv
      • redundancy and similarity filtered pdb_all_monomer.tsv
      • 4868765 pairs/lines
      • 324MB
    • single_mutations_pdb_monomer_non_standard_aa.tsv
      • only single mutations containing non-standard in monomer/single chains
      • 350969 pairs/lines
      • 21MB
    • pdb_all_ppi.tsv
      • all single mutations at PPIs
      • 45752145 pairs/lines
      • 2.7GB
    • filtered_single_mutations_pdb_ppi.tsv
      • redundancy and similarity filtered pdb_all_ppi.tsv
      • 799130 pairs/lines
      • 54MB
    • single_mutations_pdb_ppi_non_standard_aa.tsv
      • only single mutations containing non-standard residues at PPIs
      • 114671 pairs/lines
      • 6.9MB

    A row in the TSV files describes the residue position of the single mutation in the wild-type (query) and mutant (hit). Multiple local structural and sequential similarity measures are provided, computed from the residue 3D micro-environments. The column fullSeqId contains the global sequence similarity. The first two rows of a TSV file look this:

    queryName  queryChain  queryAA  queryPos  hitName  hitChain  hitAA  hitPos  siteIdentity  siteBackBoneRMSD  siteAllAtomRMSD  nofSiteResidues  alignmentLDDT  fullSeqId
    10GS  A  CYS  47  2J9H  A  ALA  48  0.938  0.223  0.431  16.0  0.996  0.976  0.976

    queryName: query PDB-ID

    queryChain: query chain ID

    queryAA: query amino acid type (three letter code)

    queryPos: query sequence position of the amino acid residue

    hitName: hit PDB-ID

    hitChain: hit chain ID

    hitAA: hit amino acid type (three letter code)

    hitPos: hit sequence position of the amino acid residue

    siteIdentity: sequence identity of the aligned micro-environments

    siteBackBoneRMSD: Calpha-RMSD of the aligned micro-environments

    siteAllAtomRMSD: all-atom-RMSD of the aligned micro-environments

    nofSiteResidues: number of residues in the micro-environments

    alignmentLDDT: mean LDDT score of all residues in the aligned micro-environments

    fullSeqId: global sequence identity of the query chain and hit chain (as specified by the chain IDs)

    This work was supported by the German Federal Ministry of Education and Research as part of de.NBI [grant number 031L0105] and protP.S.I. [grant number 031B0405B].

  10. e

    Data from: PROSITE

    • prosite.expasy.org
    • identifiers.org
    • +7more
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE [Dataset]. https://prosite.expasy.org/
    Explore at:
    Dataset updated
    Oct 15, 2025
    Description

    PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].

  11. Large-scale mapping of bioactive peptides in structural and sequence space

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    tiff
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agustina E. Nardo; M. Cristina Añón; Gustavo Parisi (2023). Large-scale mapping of bioactive peptides in structural and sequence space [Dataset]. http://doi.org/10.1371/journal.pone.0191063
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Agustina E. Nardo; M. Cristina Añón; Gustavo Parisi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Health-enhancing potential bioactive peptide (BP) has driven an interest in food proteins as well as in the development of predictive methods. Research in this area has been especially active to use them as components in functional foods. Apparently, BPs do not have a given biological function in the containing proteins and they do not evolve under independent evolutionary constraints. In this work we performed a large-scale mapping of BPs in sequence and structural space. Using well curated BP deposited in BIOPEP database, we searched for exact matches in non-redundant sequences databases. Proteins containing BPs, were used in fold-recognition methods to predict the corresponding folds and BPs occurrences were mapped. We found that fold distribution of BP occurrences possibly reflects sequence relative abundance in databases. However, we also found that proteins with 5 or more than 5 BP in their sequences correspond to well populated protein folds, called superfolds. Also, we found that in well populated superfamilies, BPs tend to adopt similar locations in the protein fold, suggesting the existence of hotspots. We think that our results could contribute to the development of new bioinformatics pipeline to improve BP detection.

  12. G

    Structural Bioinformatics Software Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Structural Bioinformatics Software Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/structural-bioinformatics-software-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Structural Bioinformatics Software Market Outlook



    As per our latest research, the global Structural Bioinformatics Software market size reached USD 1.48 billion in 2024, demonstrating robust demand across biopharmaceutical research, drug discovery, and academic sectors. The market is experiencing a healthy compound annual growth rate (CAGR) of 10.2% and is forecasted to attain a value of USD 3.58 billion by 2033. This growth can be attributed to the rapid advancements in computational biology, the increasing adoption of artificial intelligence and machine learning in protein structure prediction, and the surge in drug development activities globally.




    One of the primary growth drivers for the Structural Bioinformatics Software market is the intensifying focus on precision medicine and personalized therapeutics. With the global pharmaceutical industry placing increasing emphasis on developing targeted therapies, there is a critical need for advanced software tools that can model, predict, and analyze complex biomolecular structures. These tools are pivotal for understanding protein-ligand interactions, predicting the effects of mutations, and identifying novel druggable targets. The integration of high-throughput sequencing data with structural bioinformatics platforms has further accelerated the pace of discovery, enabling researchers to move from raw data to actionable insights with unprecedented speed and accuracy.




    Another significant factor propelling the market is the evolution of computational power and cloud-based infrastructure. The exponential increase in available biological data, coupled with the complexity of protein folding and molecular dynamics simulations, demands scalable and high-performance computing resources. Cloud-based structural bioinformatics solutions have democratized access to sophisticated algorithms and databases, making them available to a broader range of users, including smaller biotech firms and academic labs. This shift has not only reduced the barriers to entry but also fostered greater collaboration and innovation in the field, as researchers can now share data, workflows, and results seamlessly across geographies.




    The market is also benefiting from heightened collaboration between academia, research organizations, and industry players. Public-private partnerships, government funding initiatives, and global consortia are fueling the development of next-generation structural bioinformatics platforms. These collaborations are focused on addressing critical challenges such as protein structure prediction, functional annotation, and molecular modeling. The emergence of open-source software and community-driven databases has further enriched the ecosystem, providing researchers with access to a wealth of curated data and cutting-edge analytical tools. As the field continues to evolve, the synergy between computational advancements and experimental validation is expected to drive the adoption of structural bioinformatics software across diverse end-user segments.



    Structure-Based Drug Design is an integral component of the drug discovery process, leveraging the detailed knowledge of the three-dimensional structure of biological targets to design more effective therapeutic agents. This approach utilizes advanced computational tools to model the interactions between drug candidates and their targets, allowing researchers to optimize binding affinity and selectivity. By focusing on the structural aspects of drug-target interactions, Structure-Based Drug Design enhances the precision and efficiency of the drug development pipeline, ultimately leading to the creation of more targeted and effective treatments. The integration of this methodology with structural bioinformatics software is revolutionizing the way researchers approach complex biological challenges, offering new avenues for innovation and discovery.




    From a regional perspective, North America remains the dominant market for structural bioinformatics software, accounting for the largest share in 2024, followed closely by Europe and the Asia Pacific region. The robust presence of leading pharmaceutical and biotechnology companies, coupled with significant investments in research and development, has established North America as a global innovation hub. Meanwhi

  13. Protein-centric rate of sequence evolution according to Rate4Site on...

    • figshare.com
    txt
    Updated Feb 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    emmanuel levy; Benjamin Dubreuil (2021). Protein-centric rate of sequence evolution according to Rate4Site on orthogroups of 14 fungal species [Dataset]. http://doi.org/10.6084/m9.figshare.13735537.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 9, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    emmanuel levy; Benjamin Dubreuil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overall, 25 descriptors (features) are calculated for 3797 unique proteins.The legend for each descriptor is given in the associated header file.Columns 1-5 provide protein identifiers:- ORF, - SGD Gene Name, - UniprotKB, - Matching PDB structure?- PDB code of closest structureColumns 6-8 correspond to protein expression:- Integrated abundance in ppm,- log10 abundance,- bins of abundance (5 bins)Columns 9-16 contain evolutionary rates averaged over:- Full sequence- Disordered residues- Not Disordered residues- Domain residues- Not Domain residues- Residues with PDB coordinates- Surface residues (>25% relative ASA)- Buried residues (

  14. d

    Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB...

    • dknet.org
    • rrid.site
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) [Dataset]. http://identifiers.org/RRID:SCR_012820
    Explore at:
    Dataset updated
    Sep 2, 2024
    Description

    Collection of structural data of biological macromolecules. Database of information about 3D structures of large biological molecules, including proteins and nucleic acids. Users can perform queries on data and analyze and visualize results.

  15. d

    SUPFAM

    • dknet.org
    • neuinfo.org
    • +2more
    Updated Jan 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). SUPFAM [Dataset]. http://identifiers.org/RRID:SCR_005304
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    SUPFAM is a database that consists of clusters of potentially related homologous protein domain families, with and without three-dimensional structural information, forming superfamilies. The present release (Release 3.0) of SUPFAM uses homologous families in Pfam (Version 23.0) and SCOP (Release 1.69) which are examples of sequence -alignment and structure classification databases respectively. The two steps involved in setting up of SUPFAM database are * Relating Pfam and SCOP families using a new profile-profile alignment algorithm AlignHUSH. This results in identifying many Pfam families which could be related to a family or superfamily of known structural information. * An all-against-all match among Pfam families with yet unknown structure resulting in identification of related Pfam families forming new potential superfamilies. The SUPFAM database can be used in either the Browse mode or Search mode. In Browse mode you can browse through the Superfamilies, Pfam families or SCOP families. In each of these modes you will be presented with a full list which can be easily browsed. In Search mode, you can search for Pfam families, SCOP families or Superfamilies based on keywords or SCOP/Pfam identifiers of families and superfamilies., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.

  16. D

    Protein Structure Modeling Service Market Report | Global Forecast From 2025...

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Protein Structure Modeling Service Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-protein-structure-modeling-service-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 12, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Protein Structure Modeling Service Market Outlook



    The global protein structure modeling service market size was valued at approximately USD 1.5 billion in 2023 and is projected to reach around USD 3.2 billion by 2032, growing at a CAGR of 8.2% during the forecast period. This remarkable growth is fueled by the increasing demand for drug discovery and development, advancements in bioinformatics tools, and the growing adoption of protein engineering techniques across various end-user industries.



    One of the primary growth factors driving the protein structure modeling service market is the escalating importance of protein structure analysis in drug discovery. As pharmaceutical and biotechnology companies continue to innovate, there is a pressing need to understand the structural aspects of proteins to design effective therapeutics. The ability to model protein structures accurately accelerates the drug development process, reduces costs, and enhances the success rate of new drug candidates. The integration of advanced computational tools and algorithms further boosts market expansion by providing more accurate and reliable protein models.



    Another significant growth driver is the rise of personalized medicine and targeted therapies. As the medical field moves towards more individualized treatment plans, understanding the unique protein structures of patients becomes critical. Protein structure modeling services provide the necessary insights to develop targeted drugs that are tailored to specific protein configurations, thereby enhancing treatment efficacy and minimizing side effects. This personalized approach to medicine is expected to spur substantial demand for protein structure modeling services in the coming years.



    The increasing collaboration between academic research institutions and commercial entities is also contributing to the market's growth. As academic and research institutes focus on fundamental protein research, they often partner with pharmaceutical companies to translate their findings into practical applications. These collaborations facilitate the sharing of resources, knowledge, and technological advancements, thereby driving the demand for protein structure modeling services. Additionally, funding from government bodies and private organizations for protein research further propels market development.



    Regionally, North America holds a dominant position in the protein structure modeling service market, largely due to the presence of major pharmaceutical companies, advanced healthcare infrastructure, and significant R&D investments. However, the Asia Pacific region is expected to witness the fastest growth during the forecast period, attributed to the burgeoning biotechnology sector, increasing healthcare expenditures, and growing focus on drug discovery and development. The European market also shows substantial potential, driven by robust research activities and favorable government initiatives supporting biotechnology advancements.



    Service Type Analysis



    Within the protein structure modeling service market, the service type segment is categorized into homology modeling, threading, ab initio, and hybrid methods. Homology modeling holds the largest share in this segment due to its widespread use and reliability. Homology modeling, also known as comparative modeling, involves predicting an unknown protein structure based on its similarity to known structures. This method is highly effective when there is a significant sequence similarity, making it a preferred choice for many researchers. Advancements in algorithms and computational power have further enhanced the accuracy and speed of homology modeling, contributing to its dominance.



    Threading, also known as fold recognition, is another important service type in the market. This method is used when homology modeling is not feasible due to low sequence similarity. Threading involves aligning the target sequence with a database of known structures to identify the best matching fold. Although more complex and computationally intensive, threading provides valuable insights when homology modeling falls short. The increasing application of threading in challenging protein targets underpins its growing market share.



    The ab initio method represents a smaller but rapidly evolving segment. Unlike homology modeling and threading, ab initio modeling predicts protein structures from scratch, without relying on known templates. This approach is particularly useful for novel proteins with no sequence homology to existing structures. While computationally demandi

  17. r

    CAPS Database

    • rrid.site
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). CAPS Database [Dataset]. http://identifiers.org/RRID:SCR_006862
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    It is a structural classification of helix-cappings or caps compiled from protein structures. Caps extracted from protein structures have been structurally classified based on geometry and conformation and organized in a tree-like hierarchical classification where the different levels correspond to different properties of the caps. CASP-DB is fully browsable and searchable and is regularly updated. The regions of the polypeptide chain immediately preceding or following a helix are known as Nt- and Ct cappings, respectively. Cappings play a central role stabilizing helices due to lack of intrahelical hydrogen bonds in the first and last turn. Sequence patterns of amino acid type preferences have been derived for cappings but the structural motifs associated to them are still unclassified. CAPS-DB is a database of clusters of structural patterns of different capping types. The clustering algorithm is based in the geometry and the space conformation of these regions. CAPS-DB is a relational database that allows the user to search, browse, inspect and retrieve structural data associated to cappings. The contents of CAPS-DB might be of interest to a wide range of scientist covering different areas such as protein design and engineering, structural biology and bioinformatics. CapsDB v4.0 * PDB structures: 4591 * Number of clusters: 859 * Number of caps: 31452

  18. List of bioinformatics tools and databases used for sequence based function...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohd Shahbaaz; Md. ImtaiyazHassan; Faizan Ahmad (2023). List of bioinformatics tools and databases used for sequence based function annotation. [Dataset]. http://doi.org/10.1371/journal.pone.0084263.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mohd Shahbaaz; Md. ImtaiyazHassan; Faizan Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    List of bioinformatics tools and databases used for sequence based function annotation.

  19. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  20. G

    Protein Crystallography Services Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Protein Crystallography Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/protein-crystallography-services-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Protein Crystallography Services Market Outlook



    According to our latest research, the global protein crystallography services market size reached USD 1.21 billion in 2024, reflecting robust demand across multiple end-user segments. The market is anticipated to grow at a CAGR of 8.4% from 2025 to 2033, propelled by technological advancements and the expanding applications of protein crystallography in drug discovery and structural biology. By 2033, the market is forecasted to attain a value of USD 2.51 billion. This growth trajectory is primarily driven by increasing investments in pharmaceutical R&D, the rising prevalence of chronic diseases necessitating novel therapeutics, and the integration of automation and artificial intelligence in structural biology workflows.




    A key growth factor for the protein crystallography services market is the surging demand for structure-based drug design in the pharmaceutical and biotechnology sectors. Drug discovery processes have become increasingly reliant on high-resolution protein structures to identify, validate, and optimize drug targets. Protein crystallography, especially X-ray crystallography, remains the gold standard for elucidating atomic-level details of biomolecules, enabling the rational design of more effective and selective therapeutics. The growing pipeline of biologics and small-molecule drugs, coupled with the need to shorten drug development timelines, has led to a significant uptick in outsourcing crystallography services to specialized providers. These providers offer advanced instrumentation, experienced personnel, and comprehensive data analysis, allowing pharmaceutical companies to focus their resources on core competencies while accelerating their R&D initiatives.




    Another major driver is the rapid evolution of crystallography technologies, including the adoption of cryo-electron microscopy (cryo-EM), neutron crystallography, and state-of-the-art synchrotron facilities. These advancements have expanded the range of proteins and complexes amenable to structural analysis, including membrane proteins and large macromolecular assemblies that were previously challenging to crystallize. The integration of automation, robotics, and artificial intelligence into sample preparation, data collection, and structure determination has dramatically increased throughput and accuracy, reducing costs and turnaround times. Furthermore, collaborations between academic institutions, research organizations, and industry players have fostered innovation in crystallization techniques, data processing algorithms, and structural databases, further fueling market growth.




    The increasing prevalence of chronic and infectious diseases, such as cancer, diabetes, and emerging viral infections, has underscored the need for novel therapeutic targets and vaccines. Protein crystallography services play a pivotal role in the structural characterization of pathogenic proteins, antigen-antibody complexes, and enzyme-inhibitor interactions, facilitating the rational design of next-generation drugs and vaccines. Government initiatives to promote biomedical research, coupled with rising investments from venture capital and pharmaceutical giants, are creating a conducive environment for market expansion. Additionally, the emergence of personalized medicine and precision therapeutics is driving the demand for structural insights into patient-specific protein variants, further boosting the uptake of crystallography services globally.



    The role of Structural Bioinformatics Software is becoming increasingly pivotal in the field of protein crystallography. These software tools facilitate the modeling and simulation of protein structures, enabling researchers to predict molecular interactions and optimize crystallization conditions. By integrating structural bioinformatics with experimental data, scientists can enhance the accuracy of protein models and streamline the drug discovery process. The synergy between computational and experimental approaches is driving innovation in structural biology, allowing for more efficient identification of drug targets and the development of novel therapeutics. As the demand for high-resolution protein structures grows, the adoption of advanced bioinformatics software is expected to rise, further propelling the market forward.




    Regionally, North America con

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005

Protein Structural Domain Classification

Explore at:
Dataset updated
Sep 30, 2024
Description

CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

Search
Clear search
Close search
Google apps
Main menu