The GOA (Gene Ontology Annotation) project provides high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI). This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups.
Data set of manually annotated chordata-specific proteins as well as those that are widely conserved. The program keeps existing human entries up-to-date and broadens the manual annotation to other vertebrate species, especially model organisms, including great apes, cow, mouse, rat, chicken, zebrafish, as well as Xenopus laevis and Xenopus tropicalis. A draft of the complete human proteome is available in UniProtKB/Swiss-Prot and one of the current priorities of the Chordata protein annotation program is to improve the quality of human sequences provided. To this aim, they are updating sequences which show discrepancies with those predicted from the genome sequence. Dubious isoforms, sequences based on experimental artifacts and protein products derived from erroneous gene model predictions are also revisited. This work is in part done in collaboration with the Hinxton Sequence Forum (HSF), which allows active exchange between UniProt, HAVANA, Ensembl and HGNC groups, as well as with RefSeq database. UniProt is a member of the Consensus CDS project and thye are in the process of reviewing their records to support convergence towards a standard set of protein annotation. They also continuously update human entries with functional annotation, including novel structural, post-translational modification, interaction and enzymatic activity data. In order to identify candidates for re-annotation, they use, among others, information extraction tools such as the STRING database. In addition, they regularly add new sequence variants and maintain disease information. Indeed, this annotation program includes the Variation Annotation Program, the goal of which is to annotate all known human genetic diseases and disease-linked protein variants, as well as neutral polymorphisms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Proteins from multiple databases were used in algal genome annotation. We first downloaded the protein data of green algae (Chlorophyta) and red algae (Rhodophyta) on UniProt for 10 higher-quality assemblies gene predictions. We refer to these protein sequences for RefSeq genomes gene prediction as seed_algae_mix, whose role is to complete gene prediction of higher quality genomes quickly and accurately. These predicted results obtained in the RefSeq genomes will be used as the query sequences to query the NR (RefSeq non-redundant proteins) database, Then the searched target sequence is extracted as the NR_extract part. Subsequently, we retrieved the protein sequences on Uniprot of all lineages of algae, please note that we did not select only the protein sequences of the 17 lineages to be annotated, but all algae from the 21 lineages searched on NCBI taxonomy, a total of 2,432,633 (1.17GB) algae protein sequences named total_algae_mix were for future predictions. The plants part of OrthoDB V10.1 and BUSCO proteins of 7 lineages were also merged into pr_total_mix.
Gene ontologies generated using GOanna with a standard pipeline (https://agbase-docs.readthedocs.io/en/latest/goanna/using_goanna_cmd.html ; default settings) with queries against the invertebrate subsection of the UniProt database. Alignments provided in html format. The initial set of gene ontology (GO) terms in the sliminput.txt files generated by GOanna were used as input for GOSlimViewer to parse and summarize molecular function (F), biological process (P) and cellular component (C) at level 2. Annotations were also converted to gene annotation format (.gaf) file using Goanna2ga. Resources in this dataset:Resource Title: Alignment of Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models against invertebrate proteins in the UniProt database . File Name: Dvir_2.0GOAnna.align.sn060d1588083909.htmlResource Description: Gapped BLAST and PSI-BLAST results for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models against 22,595 invertebrate protein sequences in the UniProt database at AgBase (invertebrates_exponly.fa)Resource Software Recommended: GOanna,url: https://agbase.arizona.edu/cgi-bin/tools/GOanna.cgi Resource Title: Summary of all putative GO annotations received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. File Name: Dvir_2.0_GOAnna_GOs.sn060d1588083909.txtResource Description: Putative GO annotations received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. Contains all putative hit with proteins in a curated UniProt invertebrate database, invertebrates_exponly.fa, maintained at AgAbase (https://agbase.arizona.edu/cgi-bin/team.pl)Resource Software Recommended: GOanna,url: https://agbase.arizona.edu/cgi-bin/tools/GOanna.cgi Resource Title: Annotations used for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. File Name: Dvir_2.0_GOAnna_ProtAnnotations_sn060d1588083909.txtResource Description: Top annotation received for each Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. Contains only the "top" hit with proteins in a curated UniProt invertebrate database, invertebrates_exponly.fa, maintained at AgAbase (https://agbase.arizona.edu/cgi-bin/team.pl). Carried through for analyses using GOslim.Resource Software Recommended: GOanna,url: https://agbase.arizona.edu/cgi-bin/tools/GOanna.cgi Resource Title: Putative GO annotations received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models reformatted for GOslim input. File Name: Dvir_2.0_GOAnna_sliminput_sn060d1588083909.txtResource Description: Putative GO terms received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. Resource Title: Summary of GO biological process (BP) terms received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models. File Name: Dvir_2.0GOslimout.bp.w63fq51588102929.bp_.txtResource Software Recommended: GOslimViewer,url: https://agbase.arizona.edu/cgi-bin/tools/goslimviewer_select.pl Resource Title: Summary of GO cellular component (CC) terms received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models.. File Name: Dvir_2.0GOslimoutcc.w63fq51588102929.cc.txtResource Software Recommended: GOslimViewer,url: https://agbase.arizona.edu/cgi-bin/tools/goslimviewer_select.pl Resource Title: Summary of GO molecular function (MF) terms received for Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models.. File Name: Dvir_2.0_GOslimoutmf.w63fq51588102929.txtResource Software Recommended: GOslimViewer,url: https://agbase.arizona.edu/cgi-bin/tools/goslimviewer_select.pl Resource Title: Gene annotation file (.gaf) output generated for GO terms assigned to Diabrotica virgifera virgifera RefSeq GCF_003013835.1 protein models.. File Name: GOanna2GA_Reformat_kky33f1592832838.xlsResource Software Recommended: GOanna2ga,url: https://agbase.arizona.edu/cgi-bin/tools/GOanna2ga.cgi
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Out of 561 intrinsically disordered proteins (IDPs) studied in this work, 143 proteins had at least one short linear motif (total count = 237) according to the UniProt database (referred to as UniProt feature: “motif”). 68 out of these 237 UniProt-annotated motifs are recorded in the ELM resource, where they are grouped into different “ELM types” based on their function. For each of these 68 motifs, the table lists the UniProt feature description, UniProt identifier, gene name, ELM accession, identifier, type, and the start/end position of the motif as recorded in ELM. Additionally, we report the Gene Ontology terms for each motif as available in the ELM resource. The possible ELM types are: LIG—ligand sites, DOC—docking sites, TRG—subcellular targeting sites, DEG—degradation sites, and MOD—PTM sites. The proportion of motifs in different ELM types and GO terms are shown in S6 Fig. (XLSX)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The collection of neXtProt entries for human proteins
A database of homologous invertebrate genes, structured under ACNUC sequence database management system. It allows one to select sets of homologous genes among invertebrate species, and to visualize multiple alignments and phylogenetic trees. The database itself contains all invertebrate protein sequences from UniProt (SWISS-PROT+TrEMBL), with some data corrected, clarified or completed (notably to address the problem of redundancy and orthology/paralogy) and with some annotation modifications. It contains also all the corresponding nucleotide sequences in EMBL. Homologous proteins are classified into families and multiple alignments and phylogenetic trees are computed for each family. Sequences and related information have been structured in an ACNUC database. Thus, HOINVGEN is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies. More generally, HOINVGEN gives an overall view of what is known about a peculiar gene family.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background Rod-cone dystrophy (RCD) is the most common inherited retinal disease that is characterised by the progressive degeneration of retinal photoreceptors. RCD genes' classification is based exclusively on gene mutations’ prevalence and does not consider the implication of the same gene in different phenotypes. Therefore, we first investigated the mutations occurrence in autosomal recessive RCD (arRCD) and non-arRCD conditions. Then, finally, we identified arRCD enriched mutational patterns in specific genes and coding exons.
Methods and results The mutations' patterns differed according to arRCD (p=0.001). Specifically, When compared with missense; insertions/deletions (OR=1.2, p=0.007), nonsense (OR=1.2, p=0.014) and splice-site mutations (OR=1.6, p=0.038) increased the OR of arRCD by 20%–60% versus non-arRCD conditions. The gene-based analysis identified that EYS, IMPG2, RP1L1 and USH2A mutations were enriched in arRCD (p<0.05). The exon-based analysis revealed specific mutation patterns in exons of CRB1, RP1L1 and exons 12, 60 and 62 coding for Lamin EGF and FTIII domains of USH2A.
Conclusion The current analysis showed that many aRCD genes have unique mutational patterns.
Methods Data extraction, inclusion, and exclusion criteria The Retinal Information Network Database The Retinal Information Network (Retnet) is a database that provides tables of genes and loci causing IRDs (https://sph.uth.edu/retnet/). Thus, it was used to search the arRCD genes. In total, sixty-three genes were found (ABCA4, AGBL5, AHR, ARHGEF18, ARL6, ARL2BP, BBS1, BBS2, BEST1, C2orf71, C8orf37, CERKL, CLCC1, CLRN1, CNGA1, CNGB1, CRB1, CYP4V2, DHDDS, DHX38, EMC1, EYS, FAM161A, GPR125, HGSNAT, IDH3B, IFT140, IFT172, IMPG2, KIAA1549, KIZ, LRAT, MAK, MERTK, MVK, NEK2, NEUROD1, NR2E3, NRL, PDE6A, PDE6B, PDE6G, POMGNT1, PRCD, PROM1, RBP3, REEP6, RGR, RHO, RLBP1, RP1, RP1L1, RPE65, SAG, SAMD11, SLC7A14, SPATA7, TRNT1, TTC8, TULP1, USH2A, ZNF408, ZNF513) (last accessed on June, 10, 2021). HGMD database Our sample population is individuals affected with rod-cone dystrophy. The genotype of these individuals was determined through various genotyping and sequencing techniques (genotyping arrays) such as next-generation sequencing (targeted and whole exome) and Sanger sequencing. Genetic variations in the sixty-three arRCD genes were downloaded in .txt format, with information including c.DNA position, protein position, class, associated phenotype, and corresponding reference (N=7,382). Mutations associated with 'retinal dystrophy' or 'retinal degeneration' and ‘retinal disease’ were not included in the analysis since these terms are broad and do not allow a correct diagnosis. This led to a total of 6,627 mutations (http://www.hgmd.cf.ac.uk/ac/index.php, accessed: September 10, 2021). We have removed the Rhodopsin mutations that were reported to have a dominant effect. Furthermore, we have removed the ‘duplicate’ mutations; these are different DNA mutations that lead to the same amino acid exchange in a gene. This filtering kept 5,868 mutations. For every mutation, we added a type (missense, nonsense, insertion/deletion (InDel), or splice site) based on the HGMD annotation. LOVD database Genetic variations in arRCD genes were also downloaded from the LOVD database (N=1,104; https://www.lovd.nl/, accessed: September 20,2021). UniProt and Gene databases All amino acid (a.a) domains were retrieved from the Uniprot database https://www.uniprot.org/. On the other hand, the longest mRNA isoform from the NCBI gene database was selected (http://www.ncbi.nlm.nih.gov/gene). Statistical analyses The analyses were conducted using SPSS software version 20 (SPSS, Inc., IL, USA). All studied variables were expressed as frequencies. The plots were generated using Origin software (OriginPro, Version 8, OriginLab Corporation, Northampton, MA, USA). Mutations distribution stratified according to autosomal recessive rod-cone dystrophy The number of unique mutations in HGMD database was compared according to arRCD phenotype (arRCD vs. non arRCD). Data from the LOVD database was used in the analysis of the total mutations only. Individual mutations and not their frequencies were used in the distribution analysis, and thus even if more than one affected individual carried a mutation, it was counted once. However, the genetic heterogeneity at the mutational level was considered. Thus, the mutations causing more than one phenotype were counted more than once according to their number of associations. The chi-square (χ2) goodness of fit test was used to compare the mutations number (in global, per gene, and per exon) according to arRCD.
This data collection contains all currently published nucleotide (DNA/RNA) and protein sequences from Australian Ficus virens, commonly known as Albayi. Other information about this group:
The nucleotide (DNA/RNA) and protein sequences have been sourced through the European Nucleotide Archive (ENA) and Universal Protein Resource (UniProt), databases that contains comprehensive sets of nucleotide (DNA/RNA) and protein sequences from all organisms that have been published by the International Research Community.
The identification of species in Ficus virens as Australian dwelling organisms has been achieved by accessing the Australian Plant Census (APC) or Australian Faunal Directory (AFD) through the Atlas of Living Australia.
In the annotation world, the same piece of information can be stored and viewed differently across different databases. For instance, more than one Affymetrix probe ID can refer to the same GenBank sequence (accession number) and more than one nucleotide sequence from GenBank can be grouped in a single UniGene cluster. The result of Onto-Express depends on whether the input list contains Affymetrix probe IDs, GenBank accession numbers or UniGene cluster IDs. The user has to be aware of relations between the different forms of the data in order to interpret correctly the results. Even if the user is aware of the relationships and knows how to convert them, most existing tools allow conversions of individual genes. Onto-Translate is a tool that allows the user to perform easily such translations. Affymetrix probe IDs, etc., translate GO terms into other identifiers like GenBank accession number, Uniprot IDs. User account required. Platform: Online tool
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
This database provides a unified resource to analyze the effects of alternative splicing events on the structure of the resulting protein isoforms. ProSAS comprehensively annotates protein structures for several Ensembl genomes and alternative transcripts can be analyzed on the protein structure and protein function level using the intuitive user interface of the database. Users can search based on Ensembl gene or Ensembl transcript ids, Gene descriptions, Uniprot gene names, Genes matching patterns, Swissprot/Uniprot identifiers or Affymetrix probeset ids.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transmembrane protein coding genes are commonly associated with human diseases. We characterized disease causing mutations and natural polymorphisms in transmembrane proteins by mapping missense genetic variations from the UniProt database on the transmembrane protein topology listed in the Human Transmembrane Proteome database. We found characteristic differences in the spectrum of amino acid changes within transmembrane regions: in the case of disease associated mutations the non-polar to non-polar and non-polar to charged amino acid changes are equally frequent. In contrast, in the case of natural polymorphisms non-polar to charged amino acid changes are rare while non-polar to non-polar changes are common. The majority of disease associated mutations result in glycine to arginine and leucine to proline substitutions. Mutations to positively charged amino acids are more common in the center of the lipid bilayer, where they cause more severe structural and functional anomalies. Our analysis contributes to the better understanding of the effect of disease associated mutations in transmembrane proteins, which can help prioritize genetic variations in personal genomic investigations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PHI-base is an online database (available at phi-base.org) that catalogues experimentally verified pathogenicity, virulence and effector genes from fungal, oomycete and bacterial pathogens, which infect animal, plant, fungal and insect hosts. PHI-base is a valuable resource in the discovery of genes in medically and agronomically important pathogens, which may be potential targets for chemical intervention.
Each entry in PHI-base is curated by domain experts and is supported by strong experimental evidence (for example, gene disruption and gene complementation experiments), as well as literature references in which the original experiments are described. Each gene in PHI-base is presented with its nucleotide sequence and deduced amino acid sequence (available in a FASTA file), as well as a detailed description of the predicted protein's function during the host infection process. To facilitate data interoperability, we have annotated genes using ontologies, controlled vocabularies, and links to external sources (including UniProt, Gene Ontology, Enzyme Commission, NCBI Taxonomy, EMBL, PubMed and FRAC).
This PHI-base dataset is a Frictionless Data Package that contains an export of the PHI-base database in CSV format (comma-separated values), plus a FASTA file with sequences for each gene in the database. This version of the dataset, version 4.17, contains 5,521 publications, covering 22,408 pathogen–host interactions and 9,973 pathogen genes across 296 pathogen species and 249 host species.
Please note that the funding information included in the readme file for this dataset (specifically README.md and README.html) is incorrect. The correct funding sources are Growing Health [BB/X010953/1; BBS/E/RH/230003A] and Delivering Sustainable Wheat [BB/X011003/1; BBS/E/RH/230001B], both ultimately funded by the Biotechnology and Biological Sciences Research Council (BBSRC). The metadata for this dataset has been amended to use the correct funding sources (updated 16 September 2024).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.
The GOA (Gene Ontology Annotation) project provides high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI). This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups.