100+ datasets found

d
NCBI Structure
dknet.org
neuinfo.org
+2more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCBI Structure [Dataset]. http://identifiers.org/RRID:SCR_004218
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004218
Description
Database of three-dimensional structures of macromolecules that allows the user to retrieve structures for specific molecule types as well as structures for genes and proteins of interest. Three main databases comprise Structure-The Molecular Modeling Database; Conserved Domains and Protein Classification; and the BioSystems Database. Structure also links to the PubChem databases to connect biological activity data to the macromolecular structures. Users can locate structural templates for proteins and interactively view structures and sequence data to closely examine sequence-structure relationships. * Macromolecular structures: The three-dimensional structures of biomolecules provide a wealth of information on their biological function and evolutionary relationships. The Molecular Modeling Database (MMDB), as part of the Entrez system, facilitates access to structure data by connecting them with associated literature, protein and nucleic acid sequences, chemicals, biomolecular interactions, and more. It is possible, for example, to find 3D structures for homologs of a protein of interest by following the Related Structure link in an Entrez Protein sequence record. * Conserved domains and protein classification: Conserved domains are functional units within a protein that act as building blocks in molecular evolution and recombine in various arrangements to make proteins with different functions. The Conserved Domain Database (CDD) brings together several collections of multiple sequence alignments representing conserved domains, in addition to NCBI-curated domains that use 3D-structure information explicitly to define domain boundaries and provide insights into sequence/structure/function relationships. * Small molecules and their biological activity: The PubChem project provides information on the biological activities of small molecules and is a component of NIH''''s Molecular Libraries Roadmap Initiative. PubChem includes three databases: PCSubstance, PCBioAssay, and PCCompound. The PubChem data are linked to other data types (illustrated example) in the Entrez system, making it possible, for example, to retrieve information about a compound and then Link to its biological activity data, retrieve 3D protein structures bound to the compound and interactively view their active sites, and find biosystems that include the compound as a component. * Biological Systems: A biosystem, or biological system, is a group of molecules that interact directly or indirectly, where the grouping is relevant to the characterization of living matter. The NCBI BioSystems Database provides centralized access to biological pathways from several source databases and connects the biosystem records with associated literature, molecular, and chemical data throughout the Entrez system. BioSystem records list and categorize components (illustrated example), such as the genes, proteins, and small molecules involved in a biological system. The companion FLink icon FLink tool, in turn, allows you to input a list of proteins, genes, or small molecules and retrieve a ranked list of biosystems.
f
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics
acs.figshare.com
xlsx
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric W. Deutsch; Zhi Sun; David S. Campbell; Pierre-Alain Binz; Terry Farrah; David Shteynberg; Luis Mendoza; Gilbert S. Omenn; Robert L. Moritz (2023). Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics [Dataset]. http://doi.org/10.1021/acs.jproteome.6b00445.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.6b00445.s002
Dataset updated
Jun 9, 2023
Dataset provided by
ACS Publications
Authors
Eric W. Deutsch; Zhi Sun; David S. Campbell; Pierre-Alain Binz; Terry Farrah; David Shteynberg; Luis Mendoza; Gilbert S. Omenn; Robert L. Moritz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstancesa problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/.
d
Transcriptome Shotgun Assembly (TSA) Sequence Database and Submissions
catalog.data.gov
data.virginia.gov
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Transcriptome Shotgun Assembly (TSA) Sequence Database and Submissions [Dataset]. https://catalog.data.gov/dataset/transcriptome-shotgun-assembly-tsa-sequence-database-and-submissions-822d5
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
TSA is an archive of computationally assembled transcript sequences from primary data such as ESTs and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs. The primary sequence data used in the assemblies must have been experimentally determined by the same submitter. TSA sequence records differ from GenBank records because there are no physical counterparts to the assemblies.
e
CATH-Gene3D
ebi.ac.uk
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Oct 21, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
e
NCBIFAM
ebi.ac.uk
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NCBIFAM [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Dec 16, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).
r
Gabi Primary Database
rrid.site
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002755 https://identifiers.org/RRID:SCR_002755/resolver?q=*&i=rrid
Dataset updated
Jul 14, 2025
Description
Database that collects, integrates and links all relevant primary information from the GABI plant genome research projects and makes them accessible via internet. Its purpose is to support plant genome research in Germany, to yield information about commercial important plant genomes, and to establish a scientific network within plant genomic research. GreenCards is the main interface for text based retrieval of sequence, SNP, mapping data etc. Sharing and interchange of data among collaborating research groups, industry and the patent- and licensing agency are facilitated. * GreenCards: Text based search for sequence, mapping, SNP data etc. * Maps: Visualization of genetic or physical maps. * BLAST: Secure BLAST search against different public databases or non-public sequence data stored in GabiPD. * Proteomics: View interactive 2D-gels and view or download information for identified protein spots. Registered users can submit data via secure file upload.
Content of the Bioinformatics for Dentistry, with its respective primary...
plos.figshare.com
xls
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ava K. Chow; Rachel Low; Jerald Yuan; Karen K. Yee; Jaskaranjit Kaur Dhaliwal; Shanice Govia; Nazlee Sharmin (2024). Content of the Bioinformatics for Dentistry, with its respective primary sources. [Dataset]. http://doi.org/10.1371/journal.pone.0303628.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0303628.t002
Dataset updated
Jun 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Ava K. Chow; Rachel Low; Jerald Yuan; Karen K. Yee; Jaskaranjit Kaur Dhaliwal; Shanice Govia; Nazlee Sharmin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Content of the Bioinformatics for Dentistry, with its respective primary sources.
List of genes and related information involved in tooth development.
plos.figshare.com
figshare.com
xlsx
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ava K. Chow; Rachel Low; Jerald Yuan; Karen K. Yee; Jaskaranjit Kaur Dhaliwal; Shanice Govia; Nazlee Sharmin (2024). List of genes and related information involved in tooth development. [Dataset]. http://doi.org/10.1371/journal.pone.0303628.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0303628.s002
Dataset updated
Jun 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Ava K. Chow; Rachel Low; Jerald Yuan; Karen K. Yee; Jaskaranjit Kaur Dhaliwal; Shanice Govia; Nazlee Sharmin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Link: https://figshare.com/articles/dataset/S2_Table/25546426. (XLSX)
e
PROSITE profiles
ebi.ac.uk
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
n
Data from: Kabat Database of Sequences of Proteins of Immunological Interest...
neuinfo.org
dknet.org
+2more
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Kabat Database of Sequences of Proteins of Immunological Interest [Dataset]. http://identifiers.org/RRID:SCR_006465
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006465
Dataset updated
Jun 27, 2024
Description
The Kabat Database determines the combining site of antibodies based on the available amino acid sequences. The precise delineation of complementarity determining regions (CDR) of both light and heavy chains provides the first example of how properly aligned sequences can be used to derive structural and functional information of biological macromolecules. The Kabat database now includes nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules, and other proteins of immunological interest. The Kabat Database searching and analysis tools package is an ASP.NET web-based portal containing lookup tools, sequence matching tools, alignment tools, length distribution tools, positional correlation tools and much more. The searching and analysis tools are custom made for the aligned data sets contained in both the SQL Server and ASCII text flat file formats. The searching and analysis tools may be run on a single PC workstation or in a distributed environment. The analysis tools are written in ASP.NET and C# and are available in Visual Studio .NET 2003/2005/2008 formats. The Kabat Database was initially started in 1970 to determine the combining site of antibodies based on the available amino acid sequences at that time. Bence Jones proteins, mostly from human, were aligned, using the now-known Kabat numbering system, and a quantitative measure, variability, was calculated for every position. Three peaks, at positions 24-34, 50-56 and 89-97, were identified and proposed to form the complementarity determining regions (CDR) of light chains. Subsequently, antibody heavy chain amino acid sequences were also aligned using a different numbering system, since the locations of their CDRs (31-35B, 50-65 and 95-102) are different from those of the light chains. CDRL1 starts right after the first invariant Cys 23 of light chains, while CDRH1 is eight amino acid residues away from the first invariant Cys 22 of heavy chains. During the past 30 years, the Kabat database has grown to include nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules and other proteins of immunological interest. It has been used extensively by immunologists to derive useful structural and functional information from the primary sequences of these proteins.
s
Data from: SBASE
scicrunch.org
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). SBASE [Dataset]. http://identifiers.org/RRID:SCR_007914
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007914
Dataset updated
Jun 17, 2025
Description
SBASE is a database of protein domain sequences collected from the literature, from protein sequence databases and from genomic databases. The protein domains are defined by their sequence boundaries given by the publishing authors or in one of the primary sequence databases (Swiss-Prot, PIR, TREMBL etc.). Domain groups are included if they have well defined sequence boundaries, and if they can be distinguished from other sequences using a similarity search technique. The SBASE database uses a set theoretical approach for representing similarities, which in practical terms is extremely simple. Sequences are considered similar if they are members of a similarity group in which all or most sequences are similar to each other and less similar to other members of the database. Sequences that have an above threshold BLAST similarity score to at least one member of the group is called the neighbourhood of the group. The below sketch shows such a neighborhood; the similarities within the group (self-similarities) and those pointing to non-member neighbours (non-self similarities) are shown in different colours.
f
Data from: A New Phase of Networking: The Molecular Composition and...
acs.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sean R. Millar; Jie Qi Huang; Karl J. Schreiber; Yi-Cheng Tsai; Jiyun Won; Jianping Zhang; Alan M. Moses; Ji-Young Youn (2023). A New Phase of Networking: The Molecular Composition and Regulatory Dynamics of Mammalian Stress Granules [Dataset]. http://doi.org/10.1021/acs.chemrev.2c00608.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.chemrev.2c00608.s002
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Sean R. Millar; Jie Qi Huang; Karl J. Schreiber; Yi-Cheng Tsai; Jiyun Won; Jianping Zhang; Alan M. Moses; Ji-Young Youn
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Stress granules (SGs) are cytosolic biomolecular condensates that form in response to cellular stress. Weak, multivalent interactions between their protein and RNA constituents drive their rapid, dynamic assembly through phase separation coupled to percolation. Though a consensus model of SG function has yet to be determined, their perceived implication in cytoprotective processes (e.g., antiviral responses and inhibition of apoptosis) and possible role in the pathogenesis of various neurodegenerative diseases (e.g., amyotrophic lateral sclerosis and frontotemporal dementia) have drawn great interest. Consequently, new studies using numerous cell biological, genetic, and proteomic methods have been performed to unravel the mechanisms underlying SG formation, organization, and function and, with them, a more clearly defined SG proteome. Here, we provide a consensus SG proteome through literature curation and an update of the user-friendly database RNAgranuleDB to version 2.0 (http://rnagranuledb.lunenfeld.ca/). With this updated SG proteome, we use next-generation phase separation prediction tools to assess the predisposition of SG proteins for phase separation and aggregation. Next, we analyze the primary sequence features of intrinsically disordered regions (IDRs) within SG-resident proteins. Finally, we review the protein- and RNA-level determinants, including post-translational modifications (PTMs), that regulate SG composition and assembly/disassembly dynamics.
d
Protein Data Bank Site
dknet.org
scicrunch.org
+2more
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Protein Data Bank Site [Dataset]. http://identifiers.org/RRID:SCR_008227
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008227
Dataset updated
Apr 9, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on August 18, 2014. A database for structural and functional information on various protein sites (post-translational modification, catalytic active, organic and inorganic ligand binding, protein-protein, protein-DNA and protein-RNA interactions) in the Protein Data Bank (PDB). It was developed as a daughter database accumulating the data on functional and structural characteristics of functional sites stored in PDB, as well as their spatial surroundings. It consists of functional sites extracted from PDB using the SITE records and of an additional set containing the protein interaction sites inferred from the contact residues in heterocomplexes. The PDBSite was set up by automated processing of the PDB. The PDBSite database can be queried through the functional description and the structural characteristics of the site and its environment. The PDBSite is integrated with the PDBSiteScan tool allowing structural comparisons of a protein against the functional sites. The PDBSite enables the recognition of functional sites in protein tertiary structures, providing annotation of function through structure. The Protein Data Bank (PDB) contains data on the spatial protein structures and their biologically active sites (i.e., ligand binding regions, enzyme catalytic centers, regions subjected to biochemical modifications, etc.). However, neither of the well known systems searching PDB does not provide the user with possibility to make the queries related with the active sites. A database PDBSITE storing the data on biologically active sites contained in the PDB database has been developed. PDBSITE accumulates amino acid content, structure features calculated by spatial protein structures, and physicochemical properties of sites and their spatial surroundings. The data on biologically active protein sites are of extreme importance for solving many problems in molecular biology, biotechnology, and medicine. High specificity of biological activity in proteins is produced by unique structure of active sites that are often organized by a very complicate pattern. In particular, biologically active sites in proteins are often compiled out of remote by primary structure amino acid residues, which form compact clusters in the spatial structure with strictly ordered conformation. Specific structure and conformational parameters of these sites are determined by the structure of their spatial amino acid surroundings. For example, spatial amino acid surroundings of enzyme catalytic centres determine the relief of hollows in catalytic centres of enzymes in a substrate binding regions, whereas the residues of antigen determinants of proteins determine their structure by organizing prominent parts at the protein surface. For many natural and mutant proteins, the relationships were found between protein activity and physico-chemical properties of amino acid residues composing the local surroundings of a functional site. The spatial surroundings of biologically active sites may be detected only if the data on tertiary protein structures are available. The Protein Data Bank (PDB) contains data on the spatial protein structures and their biologically active sites. However, neither of the well-known systems searching PDB does not provide the user with possibility to make the queries related with the active sites. Sponsor: This site is funded by GeneNetWorks.
d
Third Party Annotation (TPA) Database
catalog.data.gov
datadiscovery.nlm.nih.gov
+3more
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Third Party Annotation (TPA) Database [Dataset]. https://catalog.data.gov/dataset/third-party-annotation-tpa-database
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
A database that contains sequences built from the existing primary sequence data in GenBank. The sequences and corresponding annotations are experimentally supported and have been published in a peer-reviewed scientific journal.
d
Gabi Primary Database
dknet.org
scicrunch.org
+1more
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Gabi Primary Database [Dataset]. http://identifiers.org/RRID:SCR_002755
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002755
Dataset updated
Mar 7, 2025
Description
Database that collects, integrates and links all relevant primary information from the GABI plant genome research projects and makes them accessible via internet. Its purpose is to support plant genome research in Germany, to yield information about commercial important plant genomes, and to establish a scientific network within plant genomic research. GreenCards is the main interface for text based retrieval of sequence, SNP, mapping data etc. Sharing and interchange of data among collaborating research groups, industry and the patent- and licensing agency are facilitated. * GreenCards: Text based search for sequence, mapping, SNP data etc. * Maps: Visualization of genetic or physical maps. * BLAST: Secure BLAST search against different public databases or non-public sequence data stored in GabiPD. * Proteomics: View interactive 2D-gels and view or download information for identified protein spots. Registered users can submit data via secure file upload.
d
Sequence Set Browser
catalog.data.gov
datadiscovery.nlm.nih.gov
+4more
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Sequence Set Browser [Dataset]. https://catalog.data.gov/dataset/sequence-set-browser
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
This site is for browsing WGS (Whole Genome Shotgun) genomes, TSA (Transcriptome Shotgun Assemblies) and TLS (Targeted Locus Study) sets. WGS sequences are incomplete genomes that have been sequenced by a whole genome shotgun strategy. TSA sequences are transcript sequences that have been computationally assembled from primary RNA sequence data. TLS sequences are large-scale marker gene sequencing studies. Please consult WGS Submission or TSA Submission pages for more details. https://www.ncbi.nlm.nih.gov/genbank/wgs https://www.ncbi.nlm.nih.gov/genbank/tsa
Z
PSSH2 - database of protein sequence-to-structure homologies (including...
data.niaid.nih.gov
zenodo.org
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandeep Kaur (2022). PSSH2 - database of protein sequence-to-structure homologies (including Sars-CoV-2 structures) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4279163
Explore at:
Dataset updated
Feb 11, 2022
Dataset provided by
Andrea Schafferhans
Sandeep Kaur
Neblina Sikta
Sean O'Donoghue
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Protein sequence and structure data

This data set contains data from Uniprot (in the files called protein_sequence, protein_synonyms, protein_names, organism_synonyms) and PDB (in the files called PDB and PDB_chain) as used by the Aquaria web resource at the time of download (2022-02-08).

The PSSH2 data set

PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

Calculating PSSH2

The Swissprot and PDB data was downloaded in November 2021. Generating PSSH2: We used UniRef30_2021_03 (originally called UniRef30_2021_06) from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30%. The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until December 2021.

PDB based sequence-to-structure alignments

In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These are contained in the PDBchain data.

This data covers sequences and PDB structures in the timeframe until February 2022.

Evaluating PSSH2

The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:

The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)

The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).

The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").

The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in November 2021 (202111) as well as previous data from October 2020 (202010), February 2020 (202002) and September 2017 (201709). The results are collected in PSSH CATH validation.csv.

Known errors

Due to processing error, the profile of pdb structure 5fia A / B (sequence md5 052667679fc644184f40063c7602c9e1) is incomplete in the pdb_full hhblits database which led to further errors in generating sequence based alignments for sequences for 1vtm P (sequence md5 c844aff103449363cb8489c78c58ebf1) and 434t A / B (sequence md5 d67aa1c3a36492c719cb48b5e7ecc624).
e
CDD
ebi.ac.uk
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). CDD [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Apr 18, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.
d
Data from: HIV Sequence Database
dknet.org
rrid.site
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). HIV Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_002906/resolver?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002906 https://identifiers.org/RRID:SCR_002906/resolver?q=&i=rrid
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE. Documented on January 4, 2023. HIV Sequence Database is a database of annotated HIV sequences, plus a variety of tools and information for researchers studying HIV and SIV. The main aim of this website is to provide easy access to our sequence database, alignments, and the tools and interfaces we have produced. The HIV Sequence Database focuses on five primary goals: * Collecting HIV and SIV sequence data (all sequences since 1987) * Curating and annotating this data, and making it available to the scientific community * Computer analysis of HIV and related sequences * Production of software for the analysis of (sequence) data * The data and analyses on this site and published in a yearly printed publication, the HIV sequence Compendium, which is available free of charge.
e
SFLD
ebi.ac.uk
Updated Sep 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Sep 7, 2018
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.

Facebook

Twitter

Click to copy link

Link copied

Cite

NCBI Structure [Dataset]. http://identifiers.org/RRID:SCR_004218

NCBI Structure

RRID:SCR_004218, nlx_23947, NCBI Structure (RRID:SCR_004218), NCBI Structure

Explore at:

287 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://identifiers.org/RRID:SCR_004218

Description

Database of three-dimensional structures of macromolecules that allows the user to retrieve structures for specific molecule types as well as structures for genes and proteins of interest. Three main databases comprise Structure-The Molecular Modeling Database; Conserved Domains and Protein Classification; and the BioSystems Database. Structure also links to the PubChem databases to connect biological activity data to the macromolecular structures. Users can locate structural templates for proteins and interactively view structures and sequence data to closely examine sequence-structure relationships. * Macromolecular structures: The three-dimensional structures of biomolecules provide a wealth of information on their biological function and evolutionary relationships. The Molecular Modeling Database (MMDB), as part of the Entrez system, facilitates access to structure data by connecting them with associated literature, protein and nucleic acid sequences, chemicals, biomolecular interactions, and more. It is possible, for example, to find 3D structures for homologs of a protein of interest by following the Related Structure link in an Entrez Protein sequence record. * Conserved domains and protein classification: Conserved domains are functional units within a protein that act as building blocks in molecular evolution and recombine in various arrangements to make proteins with different functions. The Conserved Domain Database (CDD) brings together several collections of multiple sequence alignments representing conserved domains, in addition to NCBI-curated domains that use 3D-structure information explicitly to define domain boundaries and provide insights into sequence/structure/function relationships. * Small molecules and their biological activity: The PubChem project provides information on the biological activities of small molecules and is a component of NIH''''s Molecular Libraries Roadmap Initiative. PubChem includes three databases: PCSubstance, PCBioAssay, and PCCompound. The PubChem data are linked to other data types (illustrated example) in the Entrez system, making it possible, for example, to retrieve information about a compound and then Link to its biological activity data, retrieve 3D protein structures bound to the compound and interactively view their active sites, and find biosystems that include the compound as a component. * Biological Systems: A biosystem, or biological system, is a group of molecules that interact directly or indirectly, where the grouping is relevant to the characterization of living matter. The NCBI BioSystems Database provides centralized access to biological pathways from several source databases and connects the biosystem records with associated literature, molecular, and chemical data throughout the Entrez system. BioSystem records list and categorize components (illustrated example), such as the genes, proteins, and small molecules involved in a biological system. The companion FLink icon FLink tool, in turn, allows you to input a list of proteins, genes, or small molecules and retrieve a ranked list of biosystems.

Clear search

Close search

Google apps

Main menu

NCBI Structure

Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics

Transcriptome Shotgun Assembly (TSA) Sequence Database and Submissions

CATH-Gene3D

NCBIFAM

Gabi Primary Database

Content of the Bioinformatics for Dentistry, with its respective primary...

List of genes and related information involved in tooth development.

PROSITE profiles

Data from: Kabat Database of Sequences of Proteins of Immunological Interest...

Data from: SBASE

Data from: A New Phase of Networking: The Molecular Composition and...

Protein Data Bank Site

Third Party Annotation (TPA) Database

Gabi Primary Database

Sequence Set Browser

PSSH2 - database of protein sequence-to-structure homologies (including...

CDD

Data from: HIV Sequence Database

SFLD

NCBI StructureSee More Versions

RRID:SCR_004218, nlx_23947, NCBI Structure (RRID:SCR_004218), NCBI Structure

NCBI Structure