Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. BLAST Programs:
blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query.
blastp: Search protein database using a protein query. Resources in this dataset:Resource Title: Website Pointer for CottonGen BLAST Search. File Name: Web Page, url: https://www.cottongen.org/blast CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. Enter or upload FASTA sequence(s) to query and select BLAST database.
BLAST Programs:
blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query. blastp: Search protein database using a protein query.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
blastn search was performed on nucleotide sequence databases (nt). At any time instance, the Past database size is the size of the database from the previous time instance. The Present database size is the database size at the present time instance. Delta is the incremental database growth from the previous time instance to the current time instance. NCBI BLAST must be performed on the entire Present database size, while iBLAST only needs to be performed on Delta.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Custom BLAST database for Hybridisation Chain Reaction (HCR) probe designer.
Database includes: cDNA + intron + ncRNA sequences.
Please see: https://github.com/jefflee1103/HCRv3_probe_design
Facebook
TwitterEach category may be used separately or together to restrict a BLAST analysis by TaxID. Phage common names and NCBI accession numbers are included in the custom canonical phages database, which is also pre-compiled as a BLAST database. Bolded rows contain the well-studied representatives of each morphotype. (XLSX)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record provides a pre-built BLAST nucleotide database archive for the ESKAPEE bacterial pathogens used by rMAP-2.0. The archive contains a complete makeblastdb output directory (index and auxiliary files) so users can run blastn queries without rebuilding the database locally.
Contents
eskapee_db.tar.gz — compressed tar archive containing the BLAST database files (e.g., .nsq, .nin, .nhr, and related files) under eskapee_db/
eskapee_db.tar.gz.sha256 — SHA-256 checksum for integrity verification
How the database was built
The database was generated from a curated FASTA file of ESKAPEE bacterial sequences using NCBI BLAST+:
makeblastdb \
-in eskapee_db.fasta \
-dbtype nucl \
-parse_seqids \
-max_file_sz 3000000000 \
-out eskapee_db/eskapee_db
Download and unpack
# download from Zenodo and verify checksum
sha256sum -c eskapee_db.tar.gz.sha256
# unpack
tar -xzvf eskapee_db.tar.gz
Example usage
After unpacking, the database prefix is:
eskapee_db/eskapee_db
Example blastn query:
blastn -query query.fasta -db eskapee_db/eskapee_db -outfmt 6 -max_target_seqs 10 -evalue 1e-10 > blast_hits.tsv
Intended use
This database archive is intended to support reproducible, rapid local BLAST-based screening within the rMAP-2.0 workflow and related microbial genomics analyses, especially in settings where rebuilding large databases is time-consuming or bandwidth-limited.
Versioning
This Zenodo record corresponds to version of the ESKAPEE BLAST database used in rMAP-2.0. Updated databases will be released as new versions on Zenodo.
Project repository
The rMAP-2.0 code and documentation are available on GitHub: [GitHub repo link / rMAP-2.0] (add as a “Related identifier” in Zenodo).
Checksum
SHA-256 checksum is provided in eskapee_db.tar.gz.sha256 and should be used to validate file integrity after download.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The CottonGen CottonCyc Pathways Database, part of CottonGen, supports searching and browsing the following CottonCyc databases:
Cyc pathways for JGI v2.0 G. raimondii D5 genome assembly
This Cyc database was constructed using PathwayTools version 20.0 using the gene models from the JGI v2.0 D5 genome assembly of Gossypium raimondii. There has been no manual curation of this Cyc database. Pathway predictions were made using PathwayTools and in-silico v2.1 annotations as provided by JGI.
Cyc pathways for CGP-BGI v1.0 G. hirsutum AD1 genome assembly
This Cyc database was constructed using PathwayTools version 20.0 using the gene models from the CGP-BGI v1.0 AD1 genome assembly of Gossypium hirsutum. There has been no manual curation of this Cyc database. Pathway predictions were made using PathwayTools and in-silico v1.0 annotations as provided by CGP-BGI. Search parameters include genes, proteins, RNAs, compounds, reactions, pathways, growth media, and BLAST search. Resources in this dataset:Resource Title: Website Pointer to CottonGen CottonCyc Pathways Database. File Name: Web Page, url: http://ptools.cottongen.org/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source code and data files for PaperBLAST, SitesBLAST, and Curated BLAST for Genomes. The database is as of April 2022. The code is as of June 2022.
Exploding the code tarball will create a PaperBLAST/ directory. The code is in the cgi/ bin/ and lib/ subdirectories.
To install the data, create a data subdirectory, and explode the PaperBLAST_Apr2022 tarball into that directory. This includes the SQLite database (litsearch.db), and two BLAST databases (uniq.faa for PaperBLAST and hassites.faa for SitesBLAST).
For up to date code and databases, see https://github.com/morgannprice/PaperBLAST
Facebook
TwitterThe goals of Antibiotic Resistance Genes Database (ARGB) are to provide a centralized compendium of information on antibiotic resistance, to facilitate the consistent annotation of resistance information in newly sequenced organisms, and also to facilitate the identification and characterization of new genes. ARGB contains six types of database groups: - Resistance Type: This database contains information, such as resistance profile, mechanism, requirement, epidemiology for each type. - Resistance Gene: This database contains information, such as resistance profile, resistance type, requirement, protein and DNA sequence for each gene.This database only includes NON-REDUNDANT, NON-VECTOR, COMPLETE genes. - Antibiotic: This database contains information, such as producer, action mechanism, resistance type, for each gene. - Resistance Gene(NonRD): This database contains the same information as Resistance Gene. It does NOT include NON-REDUNDANT, NON-VECTOR genes, but includes INCOMPLETE genes. - Resistance Gene(ALL): This database contains the same information as Resistance Gene. It includes all REDUNDANT, VECTOR AND INCOMPLETE genes. - Resistance Species: This database contains resistance profile and corresponding resistance genes for each species. Furthermore, ARDB also contians three types BLAST database: - Resistance Genes Complete: Contains only NON-REDUNDANT, NON-VECTOR, COMPLETE genes sequences. - Resistance Genes Non-redundant: Contains NON-REDUNDANT, NON-VECTOR, COMPLETE, INCOMPLETE genes sequences. - Resistance Genes All: Contains all REDUNDANT, VECTOR, COMPLETE, INCOMPLETE genes sequences. Lastly, ARDB provides four types of Analytical tools: - Normal BLAST: This function allows an user to input a DNA or protein sequence, and find similar DNA (Nucleotide BLAST) or protein (Protein BLAST) sequences using blastn, blastp, blastx, tblastn, tblastx - RPS BLAST: A web RPSBLAST (RPS BLAST) interface is provided to align a query sequence against the Position Specific Scoring Matrix (PSSM) for each type. Normally, this will give the same annotation information as using regular BLAST mentioned above. - Multiple Sequences BLAST (Genome Annotation): This function allows an user to annotate multiple (less than 5000) query sequences in FASTA format. - Mutation Resistance Identification: This function allows an user to identify mutations that will cause potential antibiotic resistance, for 12 genes (16S rRNA, 23S rRNA, gyrA, gyrB, parC, parE, rpoB, katG, pncA, embB, folP, dfr). ������ :Sponsors: ARDB is funded by Uniformed Services University of the Health Sciences, administered by the Henry Jackson Foundation. :
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
CottonGen (https://www.cottongen.org) is a curated and integrated web-based relational database providing access to publicly available genomic, genetic and breeding data to enable basic, translational and applied research in cotton. Built using the open-source Tripal database infrastructure, CottonGen supersedes CottonDB and the Cotton Marker Database, which includes sequences, genetic and physical maps, genotypic and phenotypic markers and polymorphisms, quantitative trait loci (QTLs), pathogens, germplasm collections and trait evaluations, pedigrees, and relevant bibliographic citations, with enhanced tools for easier data sharing, mining, visualization, and data retrieval of cotton research data. CottonGen contains annotated whole genome sequences, unigenes from expressed sequence tags (ESTs), markers, trait loci, genetic maps, genes, taxonomy, germplasm, publications and communication resources for the cotton community. Annotated whole genome sequences of Gossypium raimondii are available with aligned genetic markers and transcripts. These whole genome data can be accessed through genome pages, search tools and GBrowse, a popular genome browser. Most of the published cotton genetic maps can be viewed and compared using CMap, a comparative map viewer, and are searchable via map search tools. Search tools also exist for markers, quantitative trait loci (QTLs), germplasm, publications and trait evaluation data. CottonGen also provides online analysis tools such as NCBI BLAST and Batch BLAST. This project is funded/supported by Cotton Incorporated, the USDA-ARS Crop Germplasm Research Unit at College Station, TX, the Southern Association of Agricultural Experiment Station Directors, Bayer CropScience, Corteva/Agriscience, Dow/Phytogen, Monsanto, Washington State University, and NRSP10. Resources in this dataset:Resource Title: Website Pointer for CottonGen. File Name: Web Page, url: https://www.cottongen.org/ Genomic, Genetic and Breeding Resources for Cotton Research Discovery and Crop Improvement organized by :
Species (Gossypium arboreum, barbadense, herbaceum, hirsutum, raimondii, others), Data (Contributors, Download, Submission, Community Projects, Archives, Cotton Trait Ontology, Nomenclatures, and links to Variety Testing Data and NCBISRA Datasets), Search options (Colleague, Genes and Transcripts, Genotype, Germplasm, Map, Markers, Publications, QTLs, Sequences, Trait Evaluation, MegaSearch), Tools (BIMS, BLAST+, CottonCyc, JBrowse, Map Viewer, Primer3, Sequence Retrieval, Synteny Viewer), International Cotton Genome Initiative (ICGI), and Help sources (User manual, FAQs).
Also provides Quick Start links for Major Species and Tools.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purposes of the database are to consolidate sequences from undersampled animal taxa and provide usable search tools. The publication associated with LukProt can be found here: https://doi.org/10.1093/gbe/evae231.
The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo).
Proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format:
(A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY
where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number YYYYYY, and each taxon XXXXXX. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed.
A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/.
Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference:
Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1
Holozoa
(excluding Metazoa)
31 40 39 43
Ctenophora 2 2 35 38
Porifera 4 5 30 47
Placozoa 2 2 3 6
Cnidaria 3 5 65 88
Bilateria 51 51 94 142
Included with the database are:
ready to use main database files:
LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB
to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence.
LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB
LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB
LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB
auxiliary database files:
LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command: cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2, uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB
LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different
BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis
OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification)
OMArk_output.zip – a folder with the results of all OMArk analyses
metadata:
README.md – a README file describing the metadata
LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice)
LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include:
the LukProt taxonomy in various formats
supporting scripts for data manipulation and visualization
a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience.
other files - see README
changelog.md – database changelog
Words of caution:
The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates.
Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details.
The taxonomy is different to UniEuk/EukMap, but UniEuk data were integrated where possible.
A few NCBI taxids are missing and will be added in due course.
Proteomes from NCBI and UniProt will be updated to current versions.
A number of proteomes present in some metadata, are unpublished and were held back.
While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established.
Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.
Acknowledgements:
Andrew E. Allen Lab for creating the original PhyloDB.
Daniel Richter et al. for creating EukProt and keeping it updated.
Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science.
All the authors of the original data.
National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented August 22, 2016. A database of information on bacterial phages. It contains multiple phage genomes, which users can BLAST and MegaBLAST, and also hosts a Phage Forum in which users can discuss phage data. Interactive browsing of completed phage genomes is available using the program. The browser allows users to scan the genome for particular features and to download sequence information plus analyses of those features. Views of the genome are generated showing named genes BLAST similarities to other phages predicted tRNAs and other sequence features.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
BIMS, the Breeding Information Management System, is a secure and comprehensive online breeding management system developed for the generic Tripal Database Platform which allows breeders to store, manage, archive and analyze their private breeding program. Breeders can load data in templates provided as well as output from the Field Book App, an android app for collecting phenotype data. In addition to the private breeders BIMS, users without accounts can also view the publicly available breeding data. The fully developed version will allow users to:
Fully integrate their data with publicly available genomic, genetic and breeding data in the community database.
Utilize their integrated pedigree, phenotype and genotype data in performing genomic analysis and making breeding decisions.
Use open-source new genomics tool and breeding decision tools with seamless access to HPC. Resources in this dataset:Resource Title: Website Pointer for CottonGen BIMS (Breeding Information Management System). File Name: Web Page, url: https://www.cottongen.org/bims BIMS, the Breeding Information Management System, is a secure and comprehensive online breeding management system developed for the generic Tripal Database Platform which allows breeders to store, manage, archive and analyze their private breeding program. Breeders can load data in templates provided as well as output from the Field Book App, an android app for collecting phenotype data. In addition to the private breeders BIMS users without accounts can also view the publicly available breeding data. The fully developed version will allow users to:
Fully integrate their data with publicly available genomic, genetic and breeding data in the community database
Utilize their integrated pedigree, phenotype and genotype data in performing genomic analysis and making breeding decisions.
Use open-source new genomics tool and breeding decision tools with seamless access to HPC.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Version 3 (22 November, 2021)
See https://doi.org/10.24072/pcjournal.173 for a detailed description of the database. See http://evocellbio.com/eukprot/ for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at https://doi.org/10.6084/m9.figshare.21586065. See https://github.com/beaplab/EukProt for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above).
Scroll to the end of this page for changes since version 2.
Are we missing anything? Please let us know!
EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.
This release contains 5 files:
EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17).
EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity (https://github.com/cjneely10/EukMetaSanity) for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file.
EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns:
EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier.
Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files.
Strain: the strain(s) of the species sequenced.
Previous_Names: any previous names that this species was known by.
Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table).
Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691).
Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited).
Merged_Strains: whether multiple strains of the same species were merged to create the data set.
Data_Source_URL: the URL(s) from which the data were downloaded.
Data_Source_Name: the name of the data set (as assigned by the data source).
Paper_DOI: the DOI(s) of the paper(s) that published the data set.
Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details): ‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/ ‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/ ‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/ ‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/ ‘gffread’: v.0.12.3 https://github.com/gpertea/gffread ‘predict genes’: EukMetaSanity https://github.com/cjneely10/EukMetaSanity (cloned on 21 September, 2021) All parameter values were default, unless otherwise specified.
Data_Source_Type: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome).
Notes: additional information on the data set (including why it is replaced by/is replacing another data set, or why it was not included).
Columns_Modified_Since_Previous_Version: column(s) in this file modified for the data set since the previous release. Not listed: modifications to the Notes column or to new columns added in this version.
Alternative_Strain_Names: non-exhaustive list of alternative names for the sequenced strain for this data set.
18S_Sequence_GenBank_ID: GenBank identifier for the strain sequenced in the data set. When multiple strains were sequenced, identifiers are separated with a comma, in the same order as the Strain column. Ranges of identifiers for the same strain are separated by a hyphen. ‘N/A’ indicates either that there is no GenBank sequence for the strain or that all available sequences are not full-length (< 1,500 bp).
18S_Sequence: 18S for the strain derived from publicly available sequences associated with the data set, in the case where a GenBank sequence is not available.
18S_Sequence_Source: the source for the sequence in the 18S_Sequence column, if any.
18S_Sequence_Other_Strain_GenBank_ID: GenBank identifier for 18S sequence(s) from other strains of the same species as the data set.
18S_Sequence_Other_Strain_Name: strain name(s) for the sequences in the 18S_Sequence_Other_Strain_GenBank_ID column.
18S_and_Taxonomy_Notes: additional information on the values in the 18S_Sequence columns.
Changes since version 2
There are 324 new data sets included. 57 of these replace data sets from version 2.
40 newly published data sets were added to the list that are not included in the database (annotated in the Notes column with the reasons they were not included).
Instead of unannotated genomes (for published genomes lacking protein predictions), we now include predicted proteins and gene annotations (in GFF3 format).
All sequences within each file are now assigned a standardized, unique identifier based on the data set’s EukProt_ID and on the type of data (protein or transcriptome). Illegal characters are removed from sequences.
In the UniEuk_Taxonomy field, single quotes are now used instead of double quotes, to be consistent with other UniEuk databases (EukMap, EukRibo).
Changes to metadata of individual data sets (in the included and not_included tables) with respect to the previous version are now listed in the Columns_Modified_Since_Previous_Version column.
The Taxogroup_UniEuk column has been split into the Taxogroup1_UniEuk and Taxogroup2_UniEuk columns. This resulted in the Supergroup_UniEuk column changing for Opisthokonta.
In addition, the following new columns have been added (see our manuscript for details): Alternative_Strain_Names, 18S_Sequence_GenBank_ID, 18S_Sequence, 18S_Sequence_Source, 18S_Sequence_Other_Strain_GenBank_ID, 18S_Sequence_Other_Strain_Name, 18S_and_Taxonomy_Notes.
EukProt_assembled_transcriptomes.v03.2021_11_22.tgz: assembled transcriptome contigs, for 126 species with publicly available mRNA sequence reads but no publicly available assembly. The proteins predicted from these assemblies are included in the proteins file.
Sequence names in the proteins and transcriptomes files have standardized, unique identifiers with the following format:
[EukProt ID]_[Name_to_Use]_[Type abbreviation][Counter] [Previous header contents]
Type abbreviations are P (protein) and T (transcriptome).
All characters not in the following list are removed from nucleic acid sequences: ACGTNUKSYMWRBDHV All characters not in the the following list are removed from protein sequences: ABCDEFGHIKLMNPQRSTUVWYZX*
Lists of legal characters are from: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source code and data files for PaperBLAST and Curated BLAST for Genomes, as of December 7, 2018Exploding the tarball will create a PaperBLAST/ directory. The SQLite3 database, fasta file, and BLAST database are in the data/ subdirectory. The code is in the cgi/ bin/ and lib/ subdirectories. The main CGI scripts are cgi/litSearch.cgi for PaperBLAST and cgi/genomeSearch.cgi for Curated BLAST. For Curated BLAST to fetch genomes from JGI's IMG, it needs a valid username and password, which should be placed on separate lines in the private/.JGI.info file. If you plan to run a public web server, make sure that the contents of private/ are not made accessible.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ActDES – a Curated Actinobacterial Database for Evolutionary StudiesExtracted Sequences for Blast databasea. BLAST nucleotide database (.fasta file)b. BLAST protein database (.fasta file)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
rCRUX generated reference database using NCBI nt blast database downloaded in December 2022.
Primer Name: Fungal_ITS gITS7/ITS4 Gene: FITS Length of Target: 150-330 get_seeds_local() minimum length: 105 get_seeds_local() maximum length: 500 blast_seeds() minimum length: 65 blast_seeds() maximum length: 461 max_to_blast: 100 Forward Sequence (5'-3'): GTGARTCATCGARTCTTTG Reverse Sequence (5'-3'): TCCTCCGCTTATTGATATGC Reference: White, T. J., Bruns, T., Lee, S., & Taylor, J. (1990). Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics. PCR Protocols: A Guide to Methods and Applications, 18(1), 315–322. Ihrmark, K., Bödeker, I., Cruz-Martinez, K., Friberg, H., Kubartova, A., Schenck, J., Strid, Y., Stenlid, J., Brandström-Durling, M., & Clemmensen, K. E. (2012). New primers to amplify the fungal ITS2 region–evaluation by 454-sequencing of artificial and natural communities. FEMS Microbiology Ecology, 82(3), 666–677. https://doi.org/10.1111/j.1574-6941.2012.01437.x
We chose default rCRUX parameters for get_blast_seeds() of percent coverage of 70, percent identity of 70, evalue 3e+7, and max number of blast alignments = '100000000' and for blast_seeds() of coverage of 70, percent identity of 70, evalue 3e+7, rank of genus, and max number of blast alignments = '10000000'.
Facebook
TwitterDatabase of proteins found in the nucleoli of Arabidopsis, identified through proteomic analysis. The Arabidopsis Nucleolar Protein database (AtNoPDB) provides information on the plant proteins in comparison to human and yeast proteins, and images of cellular localizations for over a third of the proteins. A proteomic analysis was carried out of nucleoli purified from Arabidopsis cell cultures and to date 217 proteins have been identified. Many proteins were known nucleolar proteins or proteins involved in ribosome biogenesis. Some proteins, such as spliceosomal and snRNP proteins, and translation factors, were unexpected. In addition, proteins of unknown function which were either plant-specific or conserved between human and plant, and proteins with differential localizations were identified.
Facebook
TwitterThe columns represent in order: the sequence identifier (ID), the sequence identifier from the NCBI database (NCBI_ID); the pairwise sequence identity of the match (pwid); the length of the match (length); the number of mismatches (n_mismatches); the number of gap openings (n_gap); the start of the alignment in the query (q_start); the end of the alignment in the query (q_end); the start of the alignment on the NCBI matched sequence (t_start); the end of the alignment on the NCBI matched sequence (t_end); the BLAST e-value (evalue); and the BLAST bit score (score). (CSV)
Facebook
TwitterA database of protein subcellular localization containing proteins from primary protein database SWISS-PROT and PIR. By collecting the subcellular localization annotation, these information are classified and categorized by cross references to taxonomies and Gene Ontology database. Annotations were taken from primary protein databases, model organism genome projects and literature texts, and then were analyzed to dig out the subcellular localization features of the proteins. The proteins are also classified into different categories. Based on sequence alignment, nonredundant subsets of the database have been built, which may provide useful information for subcellular localization prediction. The database now contains >60 000 protein sequences including 30 000 protein sequences in the nonredundant data sets. Online download, SOAP server, Blast tools and prediction services are also available.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. BLAST Programs:
blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query.
blastp: Search protein database using a protein query. Resources in this dataset:Resource Title: Website Pointer for CottonGen BLAST Search. File Name: Web Page, url: https://www.cottongen.org/blast CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. Enter or upload FASTA sequence(s) to query and select BLAST database.
BLAST Programs:
blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query. blastp: Search protein database using a protein query.