100+ datasets found
  1. n

    HOMSTRAD - Homologous Structure Alignment Database

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Aug 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). HOMSTRAD - Homologous Structure Alignment Database [Dataset]. http://identifiers.org/RRID:SCR_006544
    Explore at:
    Dataset updated
    Aug 5, 2024
    Description

    A curated database of structure-based alignments for homologous protein families. All known protein structure are clustered into homologous families (i.e., common ancestry), and the sequences of representative members of each family are aligned on the basis of their 3D structures using the programs MNYFIT, STAMP and COMPARER. These structure-based alignments are annotated with JOY and examined individually.

  2. n

    Data from: S4: Structure-based Sequence Alignments of SCOP Superfamilies

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Feb 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). S4: Structure-based Sequence Alignments of SCOP Superfamilies [Dataset]. http://identifiers.org/RRID:SCR_007911
    Explore at:
    Dataset updated
    Feb 10, 2024
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on July 17, 2013. The S4 database contains sequence alignments of domains in SCOP superfamilies. The aligned domains are selected using ASTRAL so that no two domains in the alignment have more than 40 percent identity and, moreover, they align all domains identified by ASTRAL as having less than 40 percent sequence identity. The alignments are generated using information from pairwise structural alignments of all domains in a given superfamily. These structural alignments generate residue equivalences and distances between residues, as well as an overall similarity of the two domains being compared (RMSD). This information is used to score individual the equivalences between residues. The scores are then integrated using a multiple sequence alignment program to generate the finished alignment. This database allows alignments to be retrieved in clustal format, or viewed in a web browser, with either structural or sequence features annotated. In addition, the statistics of structural diversity for each superfamily can be seen. The pairwise structural alignments were performed using the SAP program. The output of SAP was converted to a T-Coffee library so that the multiple sequence alignment T-Coffee could be use to compute the sequence alignments.

  3. n

    SABmark

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Aug 1, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2004). SABmark [Dataset]. http://identifiers.org/RRID:SCR_011817
    Explore at:
    Dataset updated
    Aug 1, 2004
    Description

    Downloadable data set designed to assess the performance of both multiple and pairwise (protein) sequence alignment algorithms, and is extremely easy to use. Currently, the database contains 2 sets, each consisting of a number of subsets with related sequences. It''s main features are: * Covers the entire known fold space (SCOP classification), with subsets provided by the ASTRAL compendium * All structures have high quality, with 100% resolved residues * Structure alignments have been derived carefully, using both SOFI and CE, and Relaxed Transitive Alignment * At most 25 sequences in each subset to avoid overrepresentation of large folds* Automated running, archiving and scoring of programs through a few Perl scripts The Twilight Zone set is divided into sequence groups that each represent a SCOP fold. All sequences within a group share a pairwise Blast e-value of at least 1, for a theoretical database size of 100 million residues. Sequence similarity is thus very low, between 0-25% identity, and a (traceable) common evolutionary origin cannot be established between most pairs even though their structures are (distantly) similar. This set therefore represents the worst case scenario for sequence alignment, which unfortunately is also the most frequent one, as most related sequences share less than 25% identity. The Superfamilies set consists of groups that each represent a SCOP superfamily, and therefore contain sequences with a (putative) common evolutionary origin. However, they share at most 50% identity, which is still challenging for any sequence alignment algorithm. Frequently, alignments are performed to establish whether or not sequences are related. To benchmark this, a second version of both the Twilight Zone and the Superfamilies set is provided, in which to each alignment problem a number of false positives, i.e. sequences not related to the original set, are added. Database specifications: * Current version: 1.65 (concurrent with PDB, SCOP and ASTRAL) * Twilight Zone set (with false positives): 209 groups, 1740 (3280) sequences, 10667 (44056) related pairs * Superfamilies set (with false positives): 425 groups, 3280 (6526) sequences, 19092 (79095) related pairs

  4. Z

    Benchmark Database for Phonetic Alignments

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Feb 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    List, Johann-Mattis; Prokić, Jelena (2022). Benchmark Database for Phonetic Alignments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11880
    Explore at:
    Dataset updated
    Feb 21, 2022
    Dataset provided by
    Philipps-Universität Marburg
    Authors
    List, Johann-Mattis; Prokić, Jelena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the last two decades, alignment analyses have become an important technique in quantitative historical linguistics and dialectology. Phonetic alignment plays a crucial role in the identification of regular sound correspondences and deeper genealogical relations between and within languages and language families. Surprisingly, up to today, there are no easily accessible benchmark data sets for phonetic alignment analyses. Here we present a publicly available database of manually edited phonetic alignments which can serve as a platform for testing and improving the performance of automatic alignment algorithms. The database consists of a great variety of alignments drawn from a large number of different sources. The data is arranged in a such way that typical problems encountered in phonetic alignment analyses (metathesis, diversity of phonetic sequences) are represented and can be directly tested.

  5. d

    Multiple Sequence Alignment (MSA) Viewer

    • catalog.data.gov
    • healthdata.gov
    • +3more
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Multiple Sequence Alignment (MSA) Viewer [Dataset]. https://catalog.data.gov/dataset/multiple-sequence-alignment-msa-viewer
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    An interactive Web application that enables users to visualize multiple alignments created by database search results or other software applications. The MSA Viewer allows users to upload an alignment and set a master sequence and to explore the data using features such as zooming and changing of coloration.

  6. d

    Constraint-Based Multiple Alignment Tool (COBALT)

    • catalog.data.gov
    • data.virginia.gov
    • +4more
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Constraint-Based Multiple Alignment Tool (COBALT) [Dataset]. https://catalog.data.gov/dataset/cobalt
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    Constraint-Based Multiple Alignment Tool (COBALT) is a protein multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.

  7. V

    Data from: The Adaptive Evolution Database (TAED)

    • data.virginia.gov
    • healthdata.gov
    • +1more
    html
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). The Adaptive Evolution Database (TAED) [Dataset]. https://data.virginia.gov/dataset/the-adaptive-evolution-database-taed
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background The Master Catalog is a collection of evolutionary families, including multiple sequence alignments, phylogenetic trees and reconstructed ancestral sequences, for all protein-sequence modules encoded by genes in GenBank. It can therefore support large-scale genomic surveys, of which we present here The Adaptive Evolution Database (TAED). In TAED, potential examples of positive adaptation are identified by high values for the normalized ratio of nonsynonymous to synonymous nucleotide substitution rates (KA/KS values) on branches of an evolutionary tree between nodes representing reconstructed ancestral sequences.

       Results
       Evolutionary trees and reconstructed ancestral sequences were extracted from the Master Catalog for every subtree containing proteins from the Chordata only or the Embryophyta only. Branches with high KA/KS values were identified. These represent candidate episodes in the history of the protein family when the protein may have undergone positive selection, where the mutant form conferred more fitness than the ancestral form. Such episodes are frequently associated with change in function. An unexpectedly large number of families (between 10% and 20% of those families examined) were found to have at least one branch with high KA/KS values above arbitrarily chosen cut-offs (1 and 0.6). Most of these survived a robustness test and were collected into TAED.
    
    
       Conclusions
       TAED is a raw resource for bioinformaticists interested in data mining and for experimental evolutionists seeking candidate examples of adaptive evolution for further experimental study. It can be expanded to include other evolutionary information (for example changes in gene regulation or splicing) placed in a phylogenetic perspective.
    
  8. SW#db: GPU-Accelerated Exact Sequence Similarity Database Search

    • plos.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matija Korpar; Martin Šošić; Dino Blažeka; Mile Šikić (2023). SW#db: GPU-Accelerated Exact Sequence Similarity Database Search [Dataset]. http://doi.org/10.1371/journal.pone.0145857
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Matija Korpar; Martin Šošić; Dino Blažeka; Mile Šikić
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result–the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4–5 times faster than SSEARCH, 6–25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases

  9. s

    Sequence Tag Alignment and Consensus Knowledgebase Database

    • scicrunch.org
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sequence Tag Alignment and Consensus Knowledgebase Database [Dataset]. http://identifiers.org/RRID:SCR_002156
    Explore at:
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. The STACKdb is knowledgebase generated by processing EST and mRNA sequences obtained from GenBank through a pipeline consisting of masking, clustering, alignment and variation analysis steps. The STACK project aims to generate a comprehensive representation of the sequence of each of the expressed genes in the human genome by extensive processing of gene fragments to make accurate alignments, highlight diversity and provide a carefully joined set of consensus sequences for each gene. The STACK project is comprised of the STACKdb human gene index, a database of virtual human transcripts, as well as stackPACK, the tools used to create the database. STACKdb is organized into 15 tissue-based categories and one disease category. STACK is a tool for detection and visualization of expressed transcript variation in the context of developmental and pathological states. The data system organizes and reconstructs human transcripts from available public data in the context of expression state. The expression state of a transcript can include developmental state, pathological association, site of expression and isoform of expressed transcript. STACK consensus transcripts are reconstructed from clusters that capture and reflect the growing evidence of transcript diversity. The comprehensive capture of transcript variants is achieved by the use of a novel clustering approach that is tolerant of sub-sequence diversity and does not rely on pairwise alignment. This is in contrast with other gene indexing projects. STACK is generated at least four times a year and represents the exhaustive processing of all publicly available human EST data extracted from GenBank. This processed information can be explored through 15 tissue-specific categories, a disease-related category and a whole-body index

  10. f

    Data from: Bayesian Top-Down Protein Sequence Alignment with Inferred...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated May 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neuwald, Andrew F.; Altschul, Stephen F. (2016). Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001574111
    Explore at:
    Dataset updated
    May 20, 2016
    Authors
    Neuwald, Andrew F.; Altschul, Stephen F.
    Description

    We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a “top-down” strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins’ structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO’s superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.

  11. n

    Data from: Kabat Database of Sequences of Proteins of Immunological Interest...

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Jun 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Kabat Database of Sequences of Proteins of Immunological Interest [Dataset]. http://identifiers.org/RRID:SCR_006465
    Explore at:
    Dataset updated
    Jun 27, 2024
    Description

    The Kabat Database determines the combining site of antibodies based on the available amino acid sequences. The precise delineation of complementarity determining regions (CDR) of both light and heavy chains provides the first example of how properly aligned sequences can be used to derive structural and functional information of biological macromolecules. The Kabat database now includes nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules, and other proteins of immunological interest. The Kabat Database searching and analysis tools package is an ASP.NET web-based portal containing lookup tools, sequence matching tools, alignment tools, length distribution tools, positional correlation tools and much more. The searching and analysis tools are custom made for the aligned data sets contained in both the SQL Server and ASCII text flat file formats. The searching and analysis tools may be run on a single PC workstation or in a distributed environment. The analysis tools are written in ASP.NET and C# and are available in Visual Studio .NET 2003/2005/2008 formats. The Kabat Database was initially started in 1970 to determine the combining site of antibodies based on the available amino acid sequences at that time. Bence Jones proteins, mostly from human, were aligned, using the now-known Kabat numbering system, and a quantitative measure, variability, was calculated for every position. Three peaks, at positions 24-34, 50-56 and 89-97, were identified and proposed to form the complementarity determining regions (CDR) of light chains. Subsequently, antibody heavy chain amino acid sequences were also aligned using a different numbering system, since the locations of their CDRs (31-35B, 50-65 and 95-102) are different from those of the light chains. CDRL1 starts right after the first invariant Cys 23 of light chains, while CDRH1 is eight amino acid residues away from the first invariant Cys 22 of heavy chains. During the past 30 years, the Kabat database has grown to include nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules and other proteins of immunological interest. It has been used extensively by immunologists to derive useful structural and functional information from the primary sequences of these proteins.

  12. I

    Data for Ultra-Large Alignments Using Phylogeny-Aware Profiles

    • databank.illinois.edu
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam-phuong Nguyen; Siavash Mirarab; Keerthana Kumar; Tandy Warnow (2024). Data for Ultra-Large Alignments Using Phylogeny-Aware Profiles [Dataset]. http://doi.org/10.13012/B2IDB-3174395_V1
    Explore at:
    Dataset updated
    Feb 29, 2024
    Authors
    Nam-phuong Nguyen; Siavash Mirarab; Keerthana Kumar; Tandy Warnow
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Howard Hughes Medical Institute International Predoctoral Fellowship
    Subgrant from University of Alberta
    Donation from Musea Ventures
    Health Research Formula Fund from the Pennsylvania Department of Health
    U.S. National Science Foundation (NSF)
    iPlant Collaborativehttps://cyverse.org/
    U.S. National Institutes of Health (NIH)
    Texas Advanced Computing Center
    Description

    This dataset contains the data for PASTA and UPP. PASTA data was used in the following articles: Mirarab, Siavash, Nam Nguyen, Sheng Guo, Li-San Wang, Junhyong Kim, and Tandy Warnow. “PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.” Journal of Computational Biology 22, no. 5 (2015): 377–86. doi:10.1089/cmb.2014.0156. Mirarab, Siavash, Nam Nguyen, and Tandy Warnow. “PASTA: Ultra-Large Multiple Sequence Alignment.” Edited by Roded Sharan. Research in Computational Molecular Biology, 2014, 177–91. UPP data was used in: Nguyen, Nam-phuong D., Siavash Mirarab, Keerthana Kumar, and Tandy Warnow. “Ultra-Large Alignments Using Phylogeny-Aware Profiles.” Genome Biology 16, no. 1 (December 16, 2015): 124. doi:10.1186/s13059-015-0688-z.

  13. s

    BAliBASE

    • scicrunch.org
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). BAliBASE [Dataset]. http://identifiers.org/RRID:SCR_001940
    Explore at:
    Dataset updated
    Dec 4, 2023
    Description

    A collection of high quality multiple sequence alignments for objective, comparative studies of alignment algorithms. The alignments are constructed based on 3D structure superposition and manually refined to ensure alignment of important functional residues. A number of subsets are defined covering many of the most important problems encountered when aligning real sets of proteins. It is specifically designed to serve as an evaluation resource to address all the problems encountered when aligning complete sequences. The first release provided sets of reference alignments dealing with the problems of high variability, unequal repartition and large N/C-terminal extensions and internal insertions. Version 2.0 of the database incorporates three new reference sets of alignments containing structural repeats, trans-membrane sequences and circular permutations to evaluate the accuracy of detection/prediction and alignment of these complex sequences. Within the resource, users can look at a list of all the alignments, download the whole database by ftp, get the "c" program to compare a test alignment with the BAliBASE reference (The source code for the program is freely available), or look at the results of a comparison study of several multiple alignment programs, using BAliBASE reference sets.

  14. d

    Alternative Splicing Annotation Project II Database

    • dknet.org
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.

  15. N

    Conserved Domain Database (CDD)

    • datadiscovery.nlm.nih.gov
    • data.virginia.gov
    • +3more
    csv, xlsx, xml
    Updated Mar 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Conserved Domain Database (CDD) [Dataset]. https://datadiscovery.nlm.nih.gov/Molecular-biology-Genetics/Conserved-Domain-Database-CDD-/jz6b-hbzb
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Mar 2, 2021
    Description

    Conserved Domain Database (CDD) is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.

  16. Z

    A Global Lexical Database (GLED) with cognate annotation and phonological...

    • data-staging.niaid.nih.gov
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tresoldi, Tiago (2022). A Global Lexical Database (GLED) with cognate annotation and phonological alignments [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_5911131
    Explore at:
    Dataset updated
    Nov 28, 2022
    Authors
    Tresoldi, Tiago
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This work presents a lexical database encompassing most natural languages, with cognate annotation and phonological alignment, along with per-family and global phylogenetic resources. The lexical data is organized in a single and easy-to-use tabular file, and all resources are built following best practices and state-of-the-art algorithms for historical linguistics. It was developed to provide a source for prototyping studies, developing new methods, as well as bootstrapping analyses, and to allow for the community to engage in research in computational historical linguistics. The data is expected to be updated regularly, with additions and improvements. All resources are freely available for download for all interested researchers.

  17. PSSH2 - database of protein sequence-to-structure homologies (including...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, csv
    Updated Feb 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Schafferhans; Andrea Schafferhans; Sean O'Donoghue; Sean O'Donoghue; Neblina Sikta; Neblina Sikta; Sandeep Kaur; Sandeep Kaur (2022). PSSH2 - database of protein sequence-to-structure homologies (including Sars-CoV-2 structures) [Dataset]. http://doi.org/10.5281/zenodo.6021806
    Explore at:
    application/gzip, csvAvailable download formats
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrea Schafferhans; Andrea Schafferhans; Sean O'Donoghue; Sean O'Donoghue; Neblina Sikta; Neblina Sikta; Sandeep Kaur; Sandeep Kaur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Protein sequence and structure data

    This data set contains data from Uniprot (in the files called protein_sequence, protein_synonyms, protein_names, organism_synonyms) and PDB (in the files called PDB and PDB_chain) as used by the Aquaria web resource at the time of download (2022-02-08).

    The PSSH2 data set

    PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

    Calculating PSSH2

    The Swissprot and PDB data was downloaded in November 2021.
    Generating PSSH2: We used UniRef30_2021_03 (originally called UniRef30_2021_06) from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30%. The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until December 2021.

    PDB based sequence-to-structure alignments

    In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These are contained in the PDBchain data.

    This data covers sequences and PDB structures in the timeframe until February 2022.

    Evaluating PSSH2

    The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:

    • The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)
    • The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).
    • The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").

    The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in November 2021 (202111) as well as previous data from October 2020 (202010), February 2020 (202002) and September 2017 (201709). The results are collected in PSSH CATH validation.csv.

    Known errors

    Due to processing error, the profile of pdb structure 5fia A / B (sequence md5 052667679fc644184f40063c7602c9e1) is incomplete in the pdb_full hhblits database which led to further errors in generating sequence based alignments for sequences for 1vtm P (sequence md5 c844aff103449363cb8489c78c58ebf1) and 434t A / B (sequence md5 d67aa1c3a36492c719cb48b5e7ecc624).

  18. s

    Data from: DMAPS - A Database of Multiple Alignments for Protein Structures

    • scicrunch.org
    • neuinfo.org
    • +1more
    Updated Oct 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). DMAPS - A Database of Multiple Alignments for Protein Structures [Dataset]. http://identifiers.org/RRID:SCR_007140
    Explore at:
    Dataset updated
    Oct 17, 2019
    Description

    THIS RESOURCE IS NO LONGER IN SERVCE, documented September 6, 2016. DMAPS database contains pre-computed multiple structure alignments for protein chains in the Protein Data Bank (PDB). Automated structure alignments have been generated for classified protein families using CE-MC algorithm. Alignments have been built only for those families with at least three members. Currently, multiple structure alignments are available for 3050 SCOP-, 3087 CATH-, 664 ENZYME- and 1707 CE-based families. Users will be able to retrieve multiple alignments for a given PDB chain classified by one of these criteria.

  19. d

    Vector Alignment Search Tool (VAST)

    • catalog.data.gov
    • datadiscovery.nlm.nih.gov
    • +4more
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Vector Alignment Search Tool (VAST) [Dataset]. https://catalog.data.gov/dataset/vector-alignment-search-tool-vast
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    A computer algorithm that identifies similar protein 3-dimensional structures. Structure neighbors for every structure in MMDB are pre-computed and accessible via links on the MMDB Structure Summary pages.

  20. d

    Data from: Fully automated sequence alignment methods are comparable to, and...

    • researchdiscovery.drexel.edu
    • datadryad.org
    Updated Jan 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Therese A. Catanach; Andrew D. Sweet; Nam-Phuong D. Nguyen; Rhiannon M. Peery; Andrew H. Debevec; Andrea K. Thomer; Amanda C. Owings; Bret M. Boyd; Aron D. Katz; Felipe N. Soto-Adames; Julie M. Allen (2019). Data from: Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus [Dataset]. https://researchdiscovery.drexel.edu/esploro/outputs/dataset/Data-from-Fully-automated-sequence-alignment/991022048367404721
    Explore at:
    Dataset updated
    Jan 3, 2019
    Dataset provided by
    Dryad
    Authors
    Therese A. Catanach; Andrew D. Sweet; Nam-Phuong D. Nguyen; Rhiannon M. Peery; Andrew H. Debevec; Andrea K. Thomer; Amanda C. Owings; Bret M. Boyd; Aron D. Katz; Felipe N. Soto-Adames; Julie M. Allen
    Time period covered
    Jan 30, 2019
    Description

    Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected “by eye” prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). HOMSTRAD - Homologous Structure Alignment Database [Dataset]. http://identifiers.org/RRID:SCR_006544

HOMSTRAD - Homologous Structure Alignment Database

RRID:SCR_006544, nif-0000-02977, OMICS_00976, HOMSTRAD - Homologous Structure Alignment Database (RRID:SCR_006544), HOMSTRAD, HOMologous STRucture Alignment Database

Explore at:
28 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 5, 2024
Description

A curated database of structure-based alignments for homologous protein families. All known protein structure are clustered into homologous families (i.e., common ancestry), and the sequences of representative members of each family are aligned on the basis of their 3D structures using the programs MNYFIT, STAMP and COMPARER. These structure-based alignments are annotated with JOY and examined individually.

Search
Clear search
Close search
Google apps
Main menu