Facebook
TwitterThe Characterized Protein Database, CharProtDB, is designed and being developed as a resource of expertly curated, experimentally characterized proteins described in published literature. For each protein record in CharProtDB, storage of several data types is supported. It includes functional annotation (several instances of protein names and gene symbols) taxonomic classification, literature links, specific Gene Ontology (GO) terms and GO evidence codes, EC (Enzyme Commisssion) and TC (Transport Classification) numbers and protein sequence. Additionally, each protein record is associated with cross links to all public accessions in major protein databases as ��synonymous accessions��. Each of the above data types can be linked to as many literature references as possible. Every CharProtDB entry requires minimum data types to be furnished. They are protein name, GO terms and supporting reference(s) associated to GO evidence codes. Annotating using the GO system is of importance for several reasons; the GO system captures defined concepts (the GO terms) with unique ids, which can be attached to specific genes and the three controlled vocabularies of the GO allow for the capture of much more annotation information than is traditionally captured in protein common names, including, for example, not just the function of the protein, but its location as well. GO evidence codes implemented in CharProtDB directly correlate with the GO consortium definitions of experimental codes. CharProtDB tools link characterization data from multiple input streams through synonymous accessions or direct sequence identity. CharProtDB can represent multiple characterizations of the same protein, with proper attribution and links to database sources. Users can use a variety of search terms including protein name, gene symbol, EC number, organism name, accessions or any text to search the database. Following the search, a display page lists all the proteins that match the search term. Click on the protein name to view more detailed annotated information for each protein. Additionally, each protein record can be annotated.
Facebook
TwitterData set of manually annotated chordata-specific proteins as well as those that are widely conserved. The program keeps existing human entries up-to-date and broadens the manual annotation to other vertebrate species, especially model organisms, including great apes, cow, mouse, rat, chicken, zebrafish, as well as Xenopus laevis and Xenopus tropicalis. A draft of the complete human proteome is available in UniProtKB/Swiss-Prot and one of the current priorities of the Chordata protein annotation program is to improve the quality of human sequences provided. To this aim, they are updating sequences which show discrepancies with those predicted from the genome sequence. Dubious isoforms, sequences based on experimental artifacts and protein products derived from erroneous gene model predictions are also revisited. This work is in part done in collaboration with the Hinxton Sequence Forum (HSF), which allows active exchange between UniProt, HAVANA, Ensembl and HGNC groups, as well as with RefSeq database. UniProt is a member of the Consensus CDS project and thye are in the process of reviewing their records to support convergence towards a standard set of protein annotation. They also continuously update human entries with functional annotation, including novel structural, post-translational modification, interaction and enzymatic activity data. In order to identify candidates for re-annotation, they use, among others, information extraction tools such as the STRING database. In addition, they regularly add new sequence variants and maintain disease information. Indeed, this annotation program includes the Variation Annotation Program, the goal of which is to annotate all known human genetic diseases and disease-linked protein variants, as well as neutral polymorphisms.
Facebook
TwitterThe Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc).
Facebook
TwitterThe GOA (Gene Ontology Annotation) project provides high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI). This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
Facebook
TwitterPROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].
Facebook
TwitterHighly reproducible interaction data in the Yeast Interacting Proteins Database with the "IST hit" (to be described in the table below) of 3 or more. Annotation (gene name and description) is updated by the SGD (Saccharomyces Genome Database;http://www.yeastgenome.org/, August 15, 2009). The number of data is 841. The data are given in a CSV format text file.
Facebook
TwitterAssigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and “tail labels†with few known examples. Unlike previous methods that mainly focused on protein sequence features, we use a pretrained large natural language model to understand the semantic meaning of protein labels. Specifically, we introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homol..., , , # FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling
https://doi.org/10.5061/dryad.m905qfv9p
The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo
The dataset includes:
The information of GO (Gene Ontology). This is a system to describe the functions of proteins.Â
-The basic version of the GO (file name: go1.4-basic.obo). Source: https://geneontology.org/docs/download-ontology/
-The mapping between GO numbers and GO descriptions (file name: go_descriptions1.4.txt)
-GO terms (file names: bp_terms.pkl; mf_terms.pkl; cc_terms.pkl)
Manually annotated data derived from Uniprot database. These datasets are used to finetune the model.
-File names:
train_exp_prompt_bp.csv; train_exp_prompt_mf.csv; train_exp_prompt_cc.cs...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Proteins from multiple databases were used in algal genome annotation. We first downloaded the protein data of green algae (Chlorophyta) and red algae (Rhodophyta) on UniProt for 10 higher-quality assemblies gene predictions. We refer to these protein sequences for RefSeq genomes gene prediction as seed_algae_mix, whose role is to complete gene prediction of higher quality genomes quickly and accurately. These predicted results obtained in the RefSeq genomes will be used as the query sequences to query the NR (RefSeq non-redundant proteins) database, Then the searched target sequence is extracted as the NR_extract part. Subsequently, we retrieved the protein sequences on Uniprot of all lineages of algae, please note that we did not select only the protein sequences of the 17 lineages to be annotated, but all algae from the 21 lineages searched on NCBI taxonomy, a total of 2,432,633 (1.17GB) algae protein sequences named total_algae_mix were for future predictions. The plants part of OrthoDB V10.1 and BUSCO proteins of 7 lineages were also merged into pr_total_mix.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction Coverage on 18,736 Proteins from 100 Pfam Families
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is the list of all the functional annotations about the 797 nsLTP sequences used for our work "Comprehensive classification of the plant non-specific lipid transfer protein superfamily towards its Sequence – Structure – Function analysis". It might be useful for any studies in sequence-structure-function relationships (this is the largest dataset used in a phylogeny study to date). The information is provided as a csv and an xlsx file organized as follows Column A contains the database ID used for our study. Column B contains the nsLTP Type (if available and as stated by initial authors) together with the organism 5 letters code. Column C: LTP name Column D: UniProt page if it exists (URL) Column E: protein sequence (without propeptide) The annotations are displayed in column G to L with ontology (e‧g.: GO for Gene Ontology) then the term of the ontology and a link to its description, the level of evidence (see guide : http://geneontology.org/docs/guide-go-evidence-codes/), additional information, reference in the literature (author + article DOI) If a second annotation is available, the same kind of information is added in new columns on the right. NB: Sequence number 526, 600 and 693 were removed during the study (redundancy)
Facebook
TwitterFunctional annotation of yellow horn unigenes in public protein databases.
Facebook
TwitterSubcellular localization of proteins from low-throughput or high-throughput protein localization assays
Facebook
TwitterVersion 3 (22 November, 2021) See https://doi.org/10.24072/pcjournal.173 for a detailed description of the database. See http://evocellbio.com/eukprot/ for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at https://doi.org/10.6084/m9.figshare.21586065. See https://github.com/beaplab/EukProt for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above). Scroll to the end of this page for changes since version 2. Are we missing anything? Please let us know! EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification. This release contains 5 files: EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17). EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity (https://github.com/cjneely10/EukMetaSanity) for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file. EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns: EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier. Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files. Strain: the strain(s) of the species sequenced. Previous_Names: any previous names that this species was known by. Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table). Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691). Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited). Merged_Strains: whether multiple strains of the same species were merged to create the data set. Data_Source_URL: the URL(s) from which the data were downloaded. Data_Source_Name: the name of the data set (as assigned by the data source). Paper_DOI: the DOI(s) of the paper(s) that published the data set. Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details): ‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/ ‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/ ‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/ ‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/ ‘gffread’: v.0.12.3 https://github.com/gpertea/gffread ‘predict genes’: EukMetaSanity https://github.com/cjneely10/EukMetaSanity (cloned on 21 September, 2021) All parameter values were default, unless otherwise specified. Data_Source_Type: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome). Notes: additional information on the data set (including why it is replaced by/is replacing another data set, or why it was not included). Columns_Modified_Since_Previous_Version: column(s) in this file modified for the data set since the previous release. Not listed: modifications to the Notes column or to new columns added in this version. Alternative_Strain_Names: non-exhaustive list of alternative names for the sequenced strain for this data set. 18S_Sequence_GenBank_ID: GenBank identifier for the strain sequenced in the data set. When multiple strains were sequenced, identifiers are separated with a comma, in the same order as the Strain column. Ranges of identifiers for the same strain are separated by a hyphen. ‘N/A’ indicates either that there is no GenBank sequence for the strain or that all available sequences are not full-length (< 1,500 bp). 18S_Sequence: 18S for the strain derived from publicly available sequences associated with the data set, in the case where a GenBank sequence is not available. 18S_Sequence_Source: the source for the sequence in the 18S_Sequence column, if any. 18S_Sequence_Other_Strain_GenBank_ID: GenBank identifier for 18S sequence(s) from other strains of the same species as the data set. 18S_Sequence_Other_Strain_Name: strain name(s) for the sequences in the 18S_Sequence_Other_Strain_GenBank_ID column. 18S_and_Taxonomy_Notes: additional information on the values in the 18S_Sequence columns. Changes since version 2 There are 324 new data sets included. 57 of these replace data sets from version 2. 40 newly published data sets were added to the list that are not included in the database (annotated in the Notes column with the reasons they were not included). Instead of unannotated genomes (for published genomes lacking protein predictions), we now include predicted proteins and gene annotations (in GFF3 format). All sequences within each file are now assigned a standardized, unique identifier based on the data set’s EukProt_ID and on the type of data (protein or transcriptome). Illegal characters are removed from sequences. In the UniEuk_Taxonomy field, single quotes are now used instead of double quotes, to be consistent with other UniEuk databases (EukMap, EukRibo). Changes to metadata of individual data sets (in the included and not_included tables) with respect to the previous version are now listed in the Columns_Modified_Since_Previous_Version column. The Taxogroup_UniEuk column has been split into the Taxogroup1_UniEuk and Taxogroup2_UniEuk columns. This resulted in the Supergroup_UniEuk column changing for Opisthokonta. In addition, the following new columns have been added (see our manuscript for details): Alternative_Strain_Names, 18S_Sequence_GenBank_ID, 18S_Sequence, 18S_Sequence_Source, 18S_Sequence_Other_Strain_GenBank_ID, 18S_Sequence_Other_Strain_Name, 18S_and_Taxonomy_Notes. EukProt_assembled_transcriptomes.v03.2021_11_22.tgz: assembled transcriptome contigs, for 126 species with publicly available mRNA sequence reads but no publicly available assembly. The proteins predicted from these assemblies are included in the proteins file. Sequence names in the proteins and transcriptomes files have standardized, unique identifiers with the following format: >[EukProt ID]_[Name_to_Use]_[Type abbreviation][Counter] [Previous header contents] Type abbreviations are P (protein) and T (transcriptome). All characters not in the following list are removed from nucleic acid sequences: ACGTNUKSYMWRBDHV All characters not in the the following list are removed from protein sequences: ABCDEFGHIKLMNPQRSTUVWYZX* Lists of legal characters are from: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Protein annotation is a major goal in molecular biology, yet experimentally determined knowledge is typically limited to a few model organisms. In non-model species, the sequence-based prediction of gene orthology can be used to infer protein identity, however this approach loses predictive power at longer evolutionary distances. Here we propose a workflow for protein annotation using structural similarity, exploiting the fact that similar protein structures often reflect homology and are more conserved than protein sequences.
Results: We propose a workflow of openly available tools for the functional annotation of proteins via structural similarity (MorF: MorphologFinder) and use it to annotate the complete proteome of a sponge. Sponges are highly relevant for inferring the early history of animals, yet their proteomes remain sparsely annotated. MorF accurately predicts the functions of proteins with known homology in >90% cases, and annotates an additional 50% of the proteome beyond standard sequence-based methods. We uncover new functions for sponge cell types, including extensive FGF, TGF and Ephrin signalling in sponge epithelia, and redox metabolism and control in myopeptidocytes. Notably, we also annotate genes specific to the enigmatic sponge mesocytes, proposing they function to digest cell walls.
Conclusions: Our work demonstrates that structural similarity is a powerful approach that complements and extends sequence similarity searches to identify homologous proteins over long evolutionary distances. We anticipate this to be a powerful approach that boosts discovery in numerous -omics datasets, especially for non-model organisms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contain protein sequences of aquatic microbial eukaryotes, or protists. The purpose of this is to make a database that is of reasonable quality to serve as resource for both taxonomy and functional interpretation of metagenomic and metatranscriptomic studies of protists. The source of the sequences were mainly from Marine Microbial Eukaryotes Transcriptome Sequencing Project (MMETSP), and supplemented with various genomes and transcriptomes of organisms that were not a part of MMETSP.
To use this database, one has to understand the main function of the three files here.
(1) The protein sequences are stored in .faa file. You can build an alignment/search database out of that and search your meta-omics sequences against it. Each sequence in the FASTA file has an ID which always consists of two parts like this: "MMETSP0004_1234567". The text before the first underscore is the source ID of that sequence.
(2) Taxonomy information of each source ID are stored in "EukZoo_taxonomy_table_v_0.2.tsv". One can use the information within in conjunction with database search results to assign taxonomy to sequences.
(3) KEGG annotation of each sequence are stored in "EukZoo_KEGG_annotation_v_0.2.tsv". One can use the information within in conjunction with database search results to assign KEGG functional annotation (KO ID) to sequences.
I also provide scripts to assign taxonomy and KEGG annotation from database search results. You can also find the scripts and explanations on how to use them on the EukZoo GitHub page. You will find details on how the database was created and curated on there as well.
Please contact me at zhenfeng.liu1@gmail.com if you have any questions or requests. Thank you for your interest in EukZoo.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset presents the Bradysia coprophila genome annotations Bcop_v1.0. It will be used as a starting point to manually improve annotations.
The annotations were generated using Maker2. Highly detailed bioinformatic methods information can be found in the supplemental material of our preprint titled, "Single-molecule sequencing of long DNA molecules allows high contiguity de novo genome assembly for the fungus fly, Sciara coprophila" (doi: https://doi.org/10.1101/2020.02.24.963009 ). See the Table of Contents therein. A far briefer description is below. Note that Sciara coprophila is synonymous with Bradysia coprophila, and was used in the title of our publication for historical reasons.
Repeat library used for masking: species-specific repeat libraries were built using RepeatModeler. A more comprehensive repeat library was created by adding previously-known repeat sequences from Bradysia coprophila and all Arthropod repeats in the RepeatMasker Combined Database: Dfam_Consensus-20181026, RepBase-20181026. The comprehensive repeat library was used with RepeatMasker as part of the Maker2 pipeline.
Automated gene finding: To predict/find protein-coding genes, Maker2 was used to take of 3 sources of evidence: RNA-seq expression evidence, homology, and gene prediction. RNA-seq data from both male and female embryos, larvae, pupae, and adults were combined to create transcriptome assemblies using Trinity (de novo) and HiSat2 followed by StringTie (genome-guided). The transcriptome assemblies were used as EST evidence in Maker2. Transcript and protein sequences from related species was used for homology evidence. Three gene predictors were used: Augustus, SNAP, GeneMark-ES. See the supplemental materials in our preprint for more information on iterative Maker2 rounds, training each gene predictor, RNA-seq methods, and transcriptome assembly generation. The Maker2 gene annotations of the final round were evaluated using annotation edit distances, BUSCO, RSEM-Eval, and TransRate.
Functional information: InterProScan was used to identify Pfam domains and GO terms from predicted protein sequences, and BLASTp was to find best matches to curated proteins in the UniProtKB/Swiss-Prot database.
Resources in this dataset:Resource Title: Bradysia coprophila genome annotations Bcop_v1.0. File Name: bradysia_coprophila.bcop_v1.0.tar.gzResource Description: Primary file:
- Bradysia_coprophila.Bcop_v1.0_gene_set.gff
- Contains automated annotations from Maker2 (described in https://doi.org/10.1101/2020.02.24.963009).
- This is the main file in this tar archive.
- The reference genome fasta is available from GenBank: https://www.ncbi.nlm.nih.gov/assembly/GCA_014529535.1.
- The Seqid in Column 1 of this gff3 file corresponds to the 'Sequence name' in the GenBank assembly report: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/529/535/GCA_014529535.1_BU_Bcop_v1/GCA_014529535.1_BU_Bcop_v1_assembly_report.txt
Supplementary files: - Bradysia_coprophila.Bcop_v1.0_evidence.rnd3.gff - Contains aligned evidence Maker2 used.
Bradysia_coprophila.Bcop_v1.0_masked_genome.rnd3.gff - Contains coordinates for masked regions of the genome as seen by Maker2.
Bradysia_coprophila.Bcop_v1.0_proteins_with_putative_function.fasta - Contains predicted protein sequences
Bradysia_coprophila.Bcop_v1.0_transcripts_with_putative_function.fasta - Contains predicted transcript sequences
Facebook
TwitterDatabase of annotations of functional units in proteins including multiple sequence alignment models for ancient domains and full-length proteins. This collection of models includes 3D structures that display the sequence/structure/function relationships in proteins. It also includes alignments of the domains to known three-dimensional protein structures in the MMDB database. The source databases are Pfam, Smart, and COG. Users can identify amino acids in protein sequences with the resources available as well as view single sequences embedded within multiple sequence alignments.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Understanding protein sequence-function relationships is essential for advancing protein biology and engineering. However, fewer than 1% of known protein sequences have human-verified functions, and scientists continually update the set of possible functions. While deep learning methods have demonstrated promise for protein function prediction, current models are limited to predicting only those functions on which they were trained. Here, we introduce ProtNote, a multimodal deep learning model that leverages free-form text to enable both supervised and zero-shot protein function prediction. ProtNote not only maintains near state-of-the-art performance for annotations in its train set, but also generalizes to unseen and novel functions in zero-shot test settings. We envision that ProtNote will enhance protein function discovery by enabling scientists to use free text inputs, without restriction to predefined labels – a necessary capability for navigating the dynamic landscape of protein biology.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Novel phenotypes are increasingly recognized to have evolved by co-option of conserved genes into new developmental contexts, yet the process by which co-opted genes modify existing developmental programs remains obscure. Here we provide insight into this process by characterizing the role of co-opted doublesex in butterfly wing color pattern development. dsx is the master regulator of insect sex differentiation but was co-opted to control the switch between discrete non-mimetic and mimetic patterns in Papilio alphenor and its relatives. We found dynamic spatial and temporal expression pattern differences between mimetic and non-mimetic butterflies throughout wing development. A mimetic color pattern program is switched on by a pulse of dsx expression in early pupal development that causes acute and long-term differential gene expression, particularly in Wnt and Hedgehog signaling pathways. RNAi suggested opposing, novel roles for these pathways in mimetic pattern development. Importantly, Dsx co-option caused Engrailed, a key transcription factor target of Hedgehog signaling, to gain a novel expression domain early in pupal wing development that is propagated through mid-pupal development to specify novel mimetic patterns despite becoming decoupled from Dsx expression itself. Altogether, our findings provide multiple views into how co-opted genes can both cause and elicit changes to conserved networks and pathways to result in development of novel, adaptive phenotypes. This dataset contains the genome assembly and annotation described in the associated manuscript. Sequencing data, including all RNA-seq data, is available under the NCBI under BioProject PRJNA882073. Methods Please see our publication for further details. The published Papilio polytes reference genome (RefSeq GCF_000836215.1) was generated using P. polytes polytes from Japan, while the RNA-seq data generated in this study is from P. alphenor from the Philippines. These two groups diverged ~1.7 million years ago and have ~5.1% nucleotide divergence, resulting in low mapping rates of our alphenor RNA-seq data to the published reference. The alphenor genome assembly in this dataset is based on the P. polytes polytes assembly and a recent chromosome-level P. bianor assembly. We assembled a draft alphenor genome using PE100 sequencing from 29 individuals (BioProject PRJNA234541, excluding SRR1118138) and Platanus v2.0.2. Before assembly, we trimmed raw reads using TrimGalore!, then removed reads containing over-represented sequences using FastQC and an available python script. We then assembled all processed reads using the default platanus2 pipeline. Next, we assigned raw alphenor scaffolds to the RefSeq polytes assembly using RagTag v1.0.1, then assigned these alphenor pseudoscaffolds and all other alphenor scaffolds that hit insect sequences in the NCBI nr database to P. bianor chromosomes, again using RagTag v1.0.1. Finally, we assembled alphenor doublesex alleles separately and substituted them into the final chromosome-level assembly. These additional steps were necessary because the individuals used for the full Platanus assembly were a mix of homo- and heterozygotes at the dsx locus – while the majority of the genome is homogeneous among these samples, the dsx alleles are very divergent except for the dsx coding region. We assembled all non-mimetic female samples BioProject PRJNA234541 using Platanus 2 and assigned scaffolds to the polytes assembly using RagTag. We then pulled out the polytes region corresponding to the dsx inversion, defined by Nishikawa et al. (2015) as H_locus_nonmimetic_H_scaffold:1931762-2054949, and used it to replace the corresponding region in the chromosome-level alphenor assembly. Similarly, we assembled the mimetic dsx allele by assembling all mimetic female samples from BioProject PRJNA234541 using Platanus 2, assigning those scaffolds to the RefSeq assembly, and extracting the H_locus_mimetic_hetero scaffold. We added this scaffold as chrH in the final alphenor assembly. We assembled the alphenor mitochondrial genome using NOVOplasty v4.2 using sequencing data from SRR1108726 and the RefSeq mtDNA assembly for polytes (NC_024742.1) as the seed sequence. This resulted in a single circularized sequence of 15,247 bp. We annotated the alphenor genome using EvidenceModeler 1.1.1. We first assembled a high-quality transcript database using PASA, our SE50 data, and PE100 and SE50 data from Nallu et al. (2018). After adapter trimming, we performed de novo and genome-guided assembly using Trinity v2.10.0 and genome-guided assembly using StringTie v1.3.3. RNA-seq data was also mapped to alphenor chromosomes using STAR 2.6.1d, and the resulting alignments were used to generate genome-guided assemblies with Trinity and StringTie 1.3.1. We combined de novo and genome-guided assemblies using PASA 2.4.1. Evidence for protein-coding regions came from mapping the UniProt/Swiss-Prot (2020_06) database and all Papilionoidea proteins available in NCBI’s GenBank nr protein database (downloaded 6/2020) using exonerate. We identified high-quality multi-exon protein-coding PASA transcripts using TransDecoder (transdecoder.github.io), then used these models to train and run Genemark-ET 4 and GlimmerHMM 3.0.4. We also predicted gene models using Augustus 3.3.2, the supplied heliconius_melpomene1 parameter set, and hints derived from RNA-seq and protein mapping above. Augustus predictions with >90% of their length covered by hints were considered high-quality models. Transcript, protein, and ab initio data were integrated using EVM. Raw EVM models were then updated twice using PASA to add UTRs and identify alternative transcripts. Gene models derived from transposable element proteins were identified using BLASTp and removed from the annotation set. The final annotation comprises 17,342 genes encoding 26,991 protein-coding transcripts, containing 95% and missing 3% of endopterygota single-copy orthologs according to BUSCO v3 and OrthoDB v9. We functionally annotated protein models using eggNOG’s emapper-2.0.1b utility and the v2.0 eggNOG database. Transcript and protein sequences were extracted from this final annotation file using Cufflinks' gffread utility.
Facebook
TwitterThe Characterized Protein Database, CharProtDB, is designed and being developed as a resource of expertly curated, experimentally characterized proteins described in published literature. For each protein record in CharProtDB, storage of several data types is supported. It includes functional annotation (several instances of protein names and gene symbols) taxonomic classification, literature links, specific Gene Ontology (GO) terms and GO evidence codes, EC (Enzyme Commisssion) and TC (Transport Classification) numbers and protein sequence. Additionally, each protein record is associated with cross links to all public accessions in major protein databases as ��synonymous accessions��. Each of the above data types can be linked to as many literature references as possible. Every CharProtDB entry requires minimum data types to be furnished. They are protein name, GO terms and supporting reference(s) associated to GO evidence codes. Annotating using the GO system is of importance for several reasons; the GO system captures defined concepts (the GO terms) with unique ids, which can be attached to specific genes and the three controlled vocabularies of the GO allow for the capture of much more annotation information than is traditionally captured in protein common names, including, for example, not just the function of the protein, but its location as well. GO evidence codes implemented in CharProtDB directly correlate with the GO consortium definitions of experimental codes. CharProtDB tools link characterization data from multiple input streams through synonymous accessions or direct sequence identity. CharProtDB can represent multiple characterizations of the same protein, with proper attribution and links to database sources. Users can use a variety of search terms including protein name, gene symbol, EC number, organism name, accessions or any text to search the database. Following the search, a display page lists all the proteins that match the search term. Click on the protein name to view more detailed annotated information for each protein. Additionally, each protein record can be annotated.