90 datasets found

b
Gene Ontology Annotation Database
bioregistry.io
Updated Apr 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Gene Ontology Annotation Database [Dataset]. https://bioregistry.io/goa
Explore at:
Dataset updated
Apr 24, 2021
Description
The GOA (Gene Ontology Annotation) project provides high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI). This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups.
r
UniProt Chordata protein annotation program
rrid.site
scicrunch.org
+2more
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). UniProt Chordata protein annotation program [Dataset]. http://identifiers.org/RRID:SCR_007071
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007071 https://identifiers.org/RRID:SCR_007071/resolver?q=*&i=rrid
Dataset updated
Jun 16, 2025
Description
Data set of manually annotated chordata-specific proteins as well as those that are widely conserved. The program keeps existing human entries up-to-date and broadens the manual annotation to other vertebrate species, especially model organisms, including great apes, cow, mouse, rat, chicken, zebrafish, as well as Xenopus laevis and Xenopus tropicalis. A draft of the complete human proteome is available in UniProtKB/Swiss-Prot and one of the current priorities of the Chordata protein annotation program is to improve the quality of human sequences provided. To this aim, they are updating sequences which show discrepancies with those predicted from the genome sequence. Dubious isoforms, sequences based on experimental artifacts and protein products derived from erroneous gene model predictions are also revisited. This work is in part done in collaboration with the Hinxton Sequence Forum (HSF), which allows active exchange between UniProt, HAVANA, Ensembl and HGNC groups, as well as with RefSeq database. UniProt is a member of the Consensus CDS project and thye are in the process of reviewing their records to support convergence towards a standard set of protein annotation. They also continuously update human entries with functional annotation, including novel structural, post-translational modification, interaction and enzymatic activity data. In order to identify candidates for re-annotation, they use, among others, information extraction tools such as the STRING database. In addition, they regularly add new sequence variants and maintain disease information. Indeed, this annotation program includes the Variation Annotation Program, the goal of which is to annotate all known human genetic diseases and disease-linked protein variants, as well as neutral polymorphisms.
e
PROSITE profiles
ebi.ac.uk
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
e
SFLD
ebi.ac.uk
Updated Sep 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Sep 7, 2018
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Z
Protein database for SAGApipeline
data.niaid.nih.gov
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Rui (2022). Protein database for SAGApipeline [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6591506
Explore at:
Dataset updated
Jun 3, 2022
Dataset authored and provided by
Li Rui
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Proteins from multiple databases were used in algal genome annotation. We first downloaded the protein data of green algae (Chlorophyta) and red algae (Rhodophyta) on UniProt for 10 higher-quality assemblies gene predictions. We refer to these protein sequences for RefSeq genomes gene prediction as seed_algae_mix, whose role is to complete gene prediction of higher quality genomes quickly and accurately. These predicted results obtained in the RefSeq genomes will be used as the query sequences to query the NR (RefSeq non-redundant proteins) database, Then the searched target sequence is extracted as the NR_extract part. Subsequently, we retrieved the protein sequences on Uniprot of all lineages of algae, please note that we did not select only the protein sequences of the 17 lineages to be annotated, but all algae from the 21 lineages searched on NCBI taxonomy, a total of 2,432,633 (1.17GB) algae protein sequences named total_algae_mix were for future predictions. The plants part of OrthoDB V10.1 and BUSCO proteins of 7 lineages were also merged into pr_total_mix.
d
Data from: Initial set of gene ontology (GO) terms for the D v virgifera...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Initial set of gene ontology (GO) terms for the D v virgifera GCF_003013835.1 RefSeq protein models [Dataset]. https://catalog.data.gov/dataset/initial-set-of-gene-ontology-go-terms-for-the-d-v-virgifera-gcf-003013835-1-refseq-protein-b2e0e
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
Gene ontologies generated using GOanna with a standard pipeline (https://agbase-docs.readthedocs.io/en/latest/goanna/using_goanna_cmd.html ; default settings) with queries against the invertebrate subsection of the UniProt database. Alignments provided in html format. The initial set of gene ontology (GO) terms in the sliminput.txt files generated by GOanna were used as input for GOSlimViewer to parse and summarize molecular function (F), biological process (P) and cellular component (C) at level 2. Annotations were also converted to gene annotation format (.gaf) file using Goanna2ga. Resources in this dataset:Resource Title: Alignment of Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models against invertebrate proteins in the UniProt database . File Name: Dvir_2.0GOAnna.align.sn060d1588083909.htmlResource Description: Gapped BLAST and PSI-BLAST results for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models against 22,595 invertebrate protein sequences in the UniProt database at AgBase (invertebrates_exponly.fa)Resource Software Recommended: GOanna,url: https://agbase.arizona.edu/cgi-bin/tools/GOanna.cgi Resource Title: Summary of all putative GO annotations received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. File Name: Dvir_2.0_GOAnna_GOs.sn060d1588083909.txtResource Description: Putative GO annotations received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. Contains all putative hit with proteins in a curated UniProt invertebrate database, invertebrates_exponly.fa, maintained at AgAbase (https://agbase.arizona.edu/cgi-bin/team.pl)Resource Software Recommended: GOanna,url: https://agbase.arizona.edu/cgi-bin/tools/GOanna.cgi Resource Title: Annotations used for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. File Name: Dvir_2.0_GOAnna_ProtAnnotations_sn060d1588083909.txtResource Description: Top annotation received for each Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. Contains only the "top" hit with proteins in a curated UniProt invertebrate database, invertebrates_exponly.fa, maintained at AgAbase (https://agbase.arizona.edu/cgi-bin/team.pl). Carried through for analyses using GOslim.Resource Software Recommended: GOanna,url: https://agbase.arizona.edu/cgi-bin/tools/GOanna.cgi Resource Title: Putative GO annotations received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models reformatted for GOslim input. File Name: Dvir_2.0_GOAnna_sliminput_sn060d1588083909.txtResource Description: Putative GO terms received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. Resource Title: Summary of GO biological process (BP) terms received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. File Name: Dvir_2.0GOslimout.bp.w63fq51588102929.bp_.txtResource Software Recommended: GOslimViewer,url: https://agbase.arizona.edu/cgi-bin/tools/goslimviewer_select.pl Resource Title: Summary of GO cellular component (CC) terms received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models.. File Name: Dvir_2.0GOslimoutcc.w63fq51588102929.cc.txtResource Software Recommended: GOslimViewer,url: https://agbase.arizona.edu/cgi-bin/tools/goslimviewer_select.pl Resource Title: Summary of GO molecular function (MF) terms received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models.. File Name: Dvir_2.0_GOslimoutmf.w63fq51588102929.txtResource Software Recommended: GOslimViewer,url: https://agbase.arizona.edu/cgi-bin/tools/goslimviewer_select.pl Resource Title: Gene annotation file (.gaf) output generated for GO terms assigned to Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models.. File Name: GOanna2GA_Reformat_kky33f1592832838.xlsResource Software Recommended: GOanna2ga,url: https://agbase.arizona.edu/cgi-bin/tools/GOanna2ga.cgi
e
CATH-Gene3D
ebi.ac.uk
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Oct 21, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
f
List of UniProt feature "motif" (n = 68) in disordered regions (IDRs) that...
plos.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shehab S. Ahmed; Zaara T. Rifat; Ruchi Lohia; Arthur J. Campbell; A. Keith Dunker; M. Sohel Rahman; Sumaiya Iqbal (2023). List of UniProt feature "motif" (n = 68) in disordered regions (IDRs) that were found in the Eukaryotic Linear Motif (ELM) resource. [Dataset]. http://doi.org/10.1371/journal.pcbi.1009911.s015
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1009911.s015
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS Computational Biology
Authors
Shehab S. Ahmed; Zaara T. Rifat; Ruchi Lohia; Arthur J. Campbell; A. Keith Dunker; M. Sohel Rahman; Sumaiya Iqbal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Out of 561 intrinsically disordered proteins (IDPs) studied in this work, 143 proteins had at least one short linear motif (total count = 237) according to the UniProt database (referred to as UniProt feature: “motif”). 68 out of these 237 UniProt-annotated motifs are recorded in the ELM resource, where they are grouped into different “ELM types” based on their function. For each of these 68 motifs, the table lists the UniProt feature description, UniProt identifier, gene name, ELM accession, identifier, type, and the start/end position of the motif as recorded in ELM. Additionally, we report the Gene Ontology terms for each motif as available in the ELM resource. The possible ELM types are: LIG—ligand sites, DOC—docking sites, TRG—subcellular targeting sites, DEG—degradation sites, and MOD—PTM sites. The proportion of motifs in different ELM types and GO terms are shown in S6 Fig. (XLSX)
n
neXtProt entries
nextprot.cn
nextprot.org
fasta, peff, ttl, txt +1
Updated Sep 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
neXtProt (2023). neXtProt entries [Dataset]. http://identifiers.org/MI:0000
Explore at:
ttl, txt, fasta, xml, peffAvailable download formats
Unique identifier
https://identifiers.org/MI:0000
Dataset updated
Sep 11, 2023
Dataset authored and provided by
neXtProt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The collection of neXtProt entries for human proteins
n
Homologous Invertebrate Genes Database
neuinfo.org
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Homologous Invertebrate Genes Database [Dataset]. http://identifiers.org/RRID:SCR_007716
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007716
Dataset updated
Oct 1, 2024
Description
A database of homologous invertebrate genes, structured under ACNUC sequence database management system. It allows one to select sets of homologous genes among invertebrate species, and to visualize multiple alignments and phylogenetic trees. The database itself contains all invertebrate protein sequences from UniProt (SWISS-PROT+TrEMBL), with some data corrected, clarified or completed (notably to address the problem of redundancy and orthology/paralogy) and with some annotation modifications. It contains also all the corresponding nucleotide sequences in EMBL. Homologous proteins are classified into families and multiple alignments and phylogenetic trees are computed for each family. Sequences and related information have been structured in an ACNUC database. Thus, HOINVGEN is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies. More generally, HOINVGEN gives an overall view of what is known about a peculiar gene family.
n
Data from: Analysis of rod-cone dystrophy genes reveals unique mutational...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jan 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Said El Shamieh; Lama Jaffal; Mariam Ibrahim (2023). Analysis of rod-cone dystrophy genes reveals unique mutational patterns [Dataset]. http://doi.org/10.5061/dryad.59zw3r2b7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.59zw3r2b7
Dataset updated
Jan 11, 2023
Dataset provided by
Beirut Arab University
Lebanese International University
Lebanese University
Authors
Said El Shamieh; Lama Jaffal; Mariam Ibrahim
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background Rod-cone dystrophy (RCD) is the most common inherited retinal disease that is characterised by the progressive degeneration of retinal photoreceptors. RCD genes' classification is based exclusively on gene mutations’ prevalence and does not consider the implication of the same gene in different phenotypes. Therefore, we first investigated the mutations occurrence in autosomal recessive RCD (arRCD) and non-arRCD conditions. Then, finally, we identified arRCD enriched mutational patterns in specific genes and coding exons.

Methods and results The mutations' patterns differed according to arRCD (p=0.001). Specifically, When compared with missense; insertions/deletions (OR=1.2, p=0.007), nonsense (OR=1.2, p=0.014) and splice-site mutations (OR=1.6, p=0.038) increased the OR of arRCD by 20%–60% versus non-arRCD conditions. The gene-based analysis identified that EYS, IMPG2, RP1L1 and USH2A mutations were enriched in arRCD (p<0.05). The exon-based analysis revealed specific mutation patterns in exons of CRB1, RP1L1 and exons 12, 60 and 62 coding for Lamin EGF and FTIII domains of USH2A.

Conclusion The current analysis showed that many aRCD genes have unique mutational patterns.

Methods Data extraction, inclusion, and exclusion criteria The Retinal Information Network Database The Retinal Information Network (Retnet) is a database that provides tables of genes and loci causing IRDs (https://sph.uth.edu/retnet/). Thus, it was used to search the arRCD genes. In total, sixty-three genes were found (ABCA4, AGBL5, AHR, ARHGEF18, ARL6, ARL2BP, BBS1, BBS2, BEST1, C2orf71, C8orf37, CERKL, CLCC1, CLRN1, CNGA1, CNGB1, CRB1, CYP4V2, DHDDS, DHX38, EMC1, EYS, FAM161A, GPR125, HGSNAT, IDH3B, IFT140, IFT172, IMPG2, KIAA1549, KIZ, LRAT, MAK, MERTK, MVK, NEK2, NEUROD1, NR2E3, NRL, PDE6A, PDE6B, PDE6G, POMGNT1, PRCD, PROM1, RBP3, REEP6, RGR, RHO, RLBP1, RP1, RP1L1, RPE65, SAG, SAMD11, SLC7A14, SPATA7, TRNT1, TTC8, TULP1, USH2A, ZNF408, ZNF513) (last accessed on June, 10, 2021). HGMD database Our sample population is individuals affected with rod-cone dystrophy. The genotype of these individuals was determined through various genotyping and sequencing techniques (genotyping arrays) such as next-generation sequencing (targeted and whole exome) and Sanger sequencing. Genetic variations in the sixty-three arRCD genes were downloaded in .txt format, with information including c.DNA position, protein position, class, associated phenotype, and corresponding reference (N=7,382). Mutations associated with 'retinal dystrophy' or 'retinal degeneration' and ‘retinal disease’ were not included in the analysis since these terms are broad and do not allow a correct diagnosis. This led to a total of 6,627 mutations (http://www.hgmd.cf.ac.uk/ac/index.php, accessed: September 10, 2021). We have removed the Rhodopsin mutations that were reported to have a dominant effect. Furthermore, we have removed the ‘duplicate’ mutations; these are different DNA mutations that lead to the same amino acid exchange in a gene. This filtering kept 5,868 mutations. For every mutation, we added a type (missense, nonsense, insertion/deletion (InDel), or splice site) based on the HGMD annotation. LOVD database Genetic variations in arRCD genes were also downloaded from the LOVD database (N=1,104; https://www.lovd.nl/, accessed: September 20,2021). UniProt and Gene databases All amino acid (a.a) domains were retrieved from the Uniprot database https://www.uniprot.org/. On the other hand, the longest mRNA isoform from the NCBI gene database was selected (http://www.ncbi.nlm.nih.gov/gene). Statistical analyses The analyses were conducted using SPSS software version 20 (SPSS, Inc., IL, USA). All studied variables were expressed as frequencies. The plots were generated using Origin software (OriginPro, Version 8, OriginLab Corporation, Northampton, MA, USA). Mutations distribution stratified according to autosomal recessive rod-cone dystrophy The number of unique mutations in HGMD database was compared according to arRCD phenotype (arRCD vs. non arRCD). Data from the LOVD database was used in the analysis of the total mutations only. Individual mutations and not their frequencies were used in the distribution analysis, and thus even if more than one affected individual carried a mutation, it was counted once. However, the genetic heterogeneity at the mutational level was considered. Thus, the mutations causing more than one phenotype were counted more than once according to their number of associations. The chi-square (χ2) goodness of fit test was used to compare the mutations number (in global, per gene, and per exon) according to arRCD.
r
Australian Nucleotide (DNA/RNA) and Protein sequences from Australian...
researchdata.edu.au
Updated Jul 23, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QFAB Bioinformatics (2012). Australian Nucleotide (DNA/RNA) and Protein sequences from Australian organisms in the species Ficus virens [Dataset]. https://researchdata.edu.au/australian-nucleotide-dnarna-ficus-virens/79819
Explore at:
Dataset updated
Jul 23, 2012
Dataset provided by
QFAB
Authors
QFAB Bioinformatics
Area covered
Australia
Description
This data collection contains all currently published nucleotide (DNA/RNA) and protein sequences from Australian Ficus virens, commonly known as Albayi. Other information about this group:

The nucleotide (DNA/RNA) and protein sequences have been sourced through the European Nucleotide Archive (ENA) and Universal Protein Resource (UniProt), databases that contains comprehensive sets of nucleotide (DNA/RNA) and protein sequences from all organisms that have been published by the International Research Community.

The identification of species in Ficus virens as Australian dwelling organisms has been achieved by accessing the Australian Plant Census (APC) or Australian Faunal Directory (AFD) through the Atlas of Living Australia.
d
Onto-Translate
dknet.org
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Onto-Translate [Dataset]. http://identifiers.org/RRID:SCR_005725
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005725 https://identifiers.org/RRID:SCR_005725/resolver/mentions?q=&i=rrid
Dataset updated
May 27, 2025
Description
In the annotation world, the same piece of information can be stored and viewed differently across different databases. For instance, more than one Affymetrix probe ID can refer to the same GenBank sequence (accession number) and more than one nucleotide sequence from GenBank can be grouped in a single UniGene cluster. The result of Onto-Express depends on whether the input list contains Affymetrix probe IDs, GenBank accession numbers or UniGene cluster IDs. The user has to be aware of relations between the different forms of the data in order to interpret correctly the results. Even if the user is aware of the relationships and knows how to convert them, most existing tools allow conversions of individual genes. Onto-Translate is a tool that allows the user to perform easily such translations. Affymetrix probe IDs, etc., translate GO terms into other identifiers like GenBank accession number, Uniprot IDs. User account required. Platform: Online tool
d
Data from: BioCyc Database Collection
datadiscoverystudio.org
resource url
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BioCyc Database Collection [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/e6cd1552d15a41c59e1afa174251cecc/html
Explore at:
resource urlAvailable download formats
Description
Link Function: information
e
SMART
ebi.ac.uk
Updated Feb 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 14, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
d
ProSAS
dknet.org
neuinfo.org
+2more
Updated Jul 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ProSAS [Dataset]. http://identifiers.org/RRID:SCR_007876
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007876
Dataset updated
Jul 19, 2025
Description
This database provides a unified resource to analyze the effects of alternative splicing events on the structure of the resulting protein isoforms. ProSAS comprehensively annotates protein structures for several Ensembl genomes and alternative transcripts can be analyzed on the protein structure and protein function level using the intuitive user interface of the database. Users can search based on Ensembl gene or Ensembl transcript ids, Gene descriptions, Uniprot gene names, Genes matching patterns, Swissprot/Uniprot identifiers or Affymetrix probeset ids.
f
Characterization of Disease-Associated Mutations in Human Transmembrane...
figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
János Molnár; Gergely Szakács; Gábor E. Tusnády (2023). Characterization of Disease-Associated Mutations in Human Transmembrane Proteins [Dataset]. http://doi.org/10.1371/journal.pone.0151760
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0151760
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
János Molnár; Gergely Szakács; Gábor E. Tusnády
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Transmembrane protein coding genes are commonly associated with human diseases. We characterized disease causing mutations and natural polymorphisms in transmembrane proteins by mapping missense genetic variations from the UniProt database on the transmembrane protein topology listed in the Human Transmembrane Proteome database. We found characteristic differences in the spectrum of amino acid changes within transmembrane regions: in the case of disease associated mutations the non-polar to non-polar and non-polar to charged amino acid changes are equally frequent. In contrast, in the case of natural polymorphisms non-polar to charged amino acid changes are rare while non-polar to non-polar changes are common. The majority of disease associated mutations result in glycine to arginine and leucine to proline substitutions. Mutations to positively charged amino acids are more common in the center of the lipid bilayer, where they cause more severe structural and functional anomalies. Our analysis contributes to the better understanding of the effect of disease associated mutations in transmembrane proteins, which can help prioritize genetic variations in personal genomic investigations.
The Pathogen-Host Interactions Database, version 4.17
zenodo.org
data.niaid.nih.gov
bin, csv, html, json
Updated Sep 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Urban; Martin Urban; Alayne Cuzick; Alayne Cuzick; James Seager; James Seager; Kim Hammond-Kosack; Kim Hammond-Kosack (2024). The Pathogen-Host Interactions Database, version 4.17 [Dataset]. http://doi.org/10.5281/zenodo.13485488
Explore at:
json, html, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13485488
Dataset updated
Sep 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Urban; Martin Urban; Alayne Cuzick; Alayne Cuzick; James Seager; James Seager; Kim Hammond-Kosack; Kim Hammond-Kosack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PHI-base is an online database (available at phi-base.org) that catalogues experimentally verified pathogenicity, virulence and effector genes from fungal, oomycete and bacterial pathogens, which infect animal, plant, fungal and insect hosts. PHI-base is a valuable resource in the discovery of genes in medically and agronomically important pathogens, which may be potential targets for chemical intervention.

Each entry in PHI-base is curated by domain experts and is supported by strong experimental evidence (for example, gene disruption and gene complementation experiments), as well as literature references in which the original experiments are described. Each gene in PHI-base is presented with its nucleotide sequence and deduced amino acid sequence (available in a FASTA file), as well as a detailed description of the predicted protein's function during the host infection process. To facilitate data interoperability, we have annotated genes using ontologies, controlled vocabularies, and links to external sources (including UniProt, Gene Ontology, Enzyme Commission, NCBI Taxonomy, EMBL, PubMed and FRAC).

This PHI-base dataset is a Frictionless Data Package that contains an export of the PHI-base database in CSV format (comma-separated values), plus a FASTA file with sequences for each gene in the database. This version of the dataset, version 4.17, contains 5,521 publications, covering 22,408 pathogen–host interactions and 9,973 pathogen genes across 296 pathogen species and 249 host species.

Erratum

Please note that the funding information included in the readme file for this dataset (specifically README.md and README.html) is incorrect. The correct funding sources are Growing Health [BB/X010953/1; BBS/E/RH/230003A] and Delivering Sustainable Wheat [BB/X011003/1; BBS/E/RH/230001B], both ultimately funded by the Biotechnology and Biological Sciences Research Council (BBSRC). The metadata for this dataset has been amended to use the correct funding sources (updated 16 September 2024).
e
HAMAP
ebi.ac.uk
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). HAMAP [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
e
SUPERFAMILY
ebi.ac.uk
Updated Nov 8, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2010). SUPERFAMILY [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Nov 8, 2010
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2021). Gene Ontology Annotation Database [Dataset]. https://bioregistry.io/goa

Gene Ontology Annotation Database

Explore at:

430 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 24, 2021

Description

The GOA (Gene Ontology Annotation) project provides high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI). This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups.

Clear search

Close search

Google apps

Main menu

Gene Ontology Annotation Database

UniProt Chordata protein annotation program

PROSITE profiles

SFLD

Protein database for SAGApipeline

Data from: Initial set of gene ontology (GO) terms for the D v virgifera...

CATH-Gene3D

List of UniProt feature "motif" (n = 68) in disordered regions (IDRs) that...

neXtProt entries

Homologous Invertebrate Genes Database

Data from: Analysis of rod-cone dystrophy genes reveals unique mutational...

Australian Nucleotide (DNA/RNA) and Protein sequences from Australian...

Onto-Translate

Data from: BioCyc Database Collection

SMART

ProSAS

Characterization of Disease-Associated Mutations in Human Transmembrane...

The Pathogen-Host Interactions Database, version 4.17

Erratum

HAMAP

SUPERFAMILY

Gene Ontology Annotation DatabaseSee More Versions

Gene Ontology Annotation Database