100+ datasets found

u
Data from: CottonGen BLAST
agdatacommons.nal.usda.gov
catalog.data.gov
bin
Updated Feb 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main (2024). CottonGen BLAST [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/CottonGen_BLAST/24853260
Explore at:
binAvailable download formats
Dataset updated
Feb 13, 2024
Dataset provided by
MainLab, Washington State University
Authors
Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. BLAST Programs:

blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query.

blastp: Search protein database using a protein query. Resources in this dataset:Resource Title: Website Pointer for CottonGen BLAST Search. File Name: Web Page, url: https://www.cottongen.org/blast CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. Enter or upload FASTA sequence(s) to query and select BLAST database.

BLAST Programs:

blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query. blastp: Search protein database using a protein query.
f
Blast database sequences
datasetcatalog.nlm.nih.gov
Updated Jul 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang, Chi (2020). Blast database sequences [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000547354
Explore at:
Dataset updated
Jul 18, 2020
Authors
Zhang, Chi
Description
Sequences used as Blast database.
Case study I: Fidelity of iBLAST in three consecutive time periods.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sajal Dash; Sarthok Rasique Rahman; Heather M. Hines; Wu-chun Feng (2023). Case study I: Fidelity of iBLAST in three consecutive time periods. [Dataset]. http://doi.org/10.1371/journal.pone.0249410.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0249410.t002
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sajal Dash; Sarthok Rasique Rahman; Heather M. Hines; Wu-chun Feng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
blastn search was performed on nucleotide sequence databases (nt). At any time instance, the Past database size is the size of the database from the previous time instance. The Present database size is the database size at the present time instance. Delta is the incremental database growth from the previous time instance to the current time instance. NCBI BLAST must be performed on the entire Present database size, while iBLAST only needs to be performed on Delta.
HCR Probe Designer: Custom BLAST database
zenodo.org
application/gzip
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Y Lee; Jeffrey Y Lee; Ilan Davis; Ilan Davis (2025). HCR Probe Designer: Custom BLAST database [Dataset]. http://doi.org/10.5281/zenodo.15658381
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15658381
Dataset updated
Jun 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeffrey Y Lee; Jeffrey Y Lee; Ilan Davis; Ilan Davis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Custom BLAST database for Hybridisation Chain Reaction (HCR) probe designer.

Database includes: cDNA + intron + ncRNA sequences.

Please see: https://github.com/jefflee1103/HCRv3_probe_design
f
List of common names and identifiers used in BLAST.
datasetcatalog.nlm.nih.gov
Updated Nov 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maughmer, Cory; Young, Ry; Hu, James C.; Rasche, Helena; Ramsey, Jolene; Gill, Jason J.; Mijalis, Eleni; Liu, Mei; Criscione, Anthony (2020). List of common names and identifiers used in BLAST. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000452489
Explore at:
Dataset updated
Nov 2, 2020
Authors
Maughmer, Cory; Young, Ry; Hu, James C.; Rasche, Helena; Ramsey, Jolene; Gill, Jason J.; Mijalis, Eleni; Liu, Mei; Criscione, Anthony
Description
Each category may be used separately or together to restrict a BLAST analysis by TaxID. Phage common names and NCBI accession numbers are included in the custom canonical phages database, which is also pre-compiled as a BLAST database. Bolded rows contain the well-studied representatives of each morphotype. (XLSX)
ESKAPEE BLAST nucleotide database (rMAP-2.0)
zenodo.org
bin
Updated Dec 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerald Mboowa; Gerald Mboowa (2025). ESKAPEE BLAST nucleotide database (rMAP-2.0) [Dataset]. http://doi.org/10.5281/zenodo.18001238
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.18001238
Dataset updated
Dec 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gerald Mboowa; Gerald Mboowa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 20, 2025
Description
This record provides a pre-built BLAST nucleotide database archive for the ESKAPEE bacterial pathogens used by rMAP-2.0. The archive contains a complete makeblastdb output directory (index and auxiliary files) so users can run blastn queries without rebuilding the database locally.

Contents

eskapee_db.tar.gz — compressed tar archive containing the BLAST database files (e.g., .nsq, .nin, .nhr, and related files) under eskapee_db/

eskapee_db.tar.gz.sha256 — SHA-256 checksum for integrity verification

How the database was built
The database was generated from a curated FASTA file of ESKAPEE bacterial sequences using NCBI BLAST+:

makeblastdb \ -in eskapee_db.fasta \ -dbtype nucl \ -parse_seqids \ -max_file_sz 3000000000 \ -out eskapee_db/eskapee_db

Download and unpack

# download from Zenodo and verify checksum sha256sum -c eskapee_db.tar.gz.sha256 # unpack tar -xzvf eskapee_db.tar.gz

Example usage
After unpacking, the database prefix is:

eskapee_db/eskapee_db

Example blastn query:

blastn -query query.fasta -db eskapee_db/eskapee_db -outfmt 6 -max_target_seqs 10 -evalue 1e-10 > blast_hits.tsv

Intended use
This database archive is intended to support reproducible, rapid local BLAST-based screening within the rMAP-2.0 workflow and related microbial genomics analyses, especially in settings where rebuilding large databases is time-consuming or bandwidth-limited.

Versioning
This Zenodo record corresponds to version of the ESKAPEE BLAST database used in rMAP-2.0. Updated databases will be released as new versions on Zenodo.

Project repository
The rMAP-2.0 code and documentation are available on GitHub: [GitHub repo link / rMAP-2.0] (add as a “Related identifier” in Zenodo).

Checksum
SHA-256 checksum is provided in eskapee_db.tar.gz.sha256 and should be used to validate file integrity after download.
u
Data from: CottonGen CottonCyc Pathways Database
agdatacommons.nal.usda.gov
catalog.data.gov
bin
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main (2023). CottonGen CottonCyc Pathways Database [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/CottonGen_CottonCyc_Pathways_Database/24853212
Explore at:
binAvailable download formats
Dataset updated
Dec 18, 2023
Dataset provided by
MainLab, Washington State University
Authors
Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The CottonGen CottonCyc Pathways Database, part of CottonGen, supports searching and browsing the following CottonCyc databases:

Cyc pathways for JGI v2.0 G. raimondii D5 genome assembly

This Cyc database was constructed using PathwayTools version 20.0 using the gene models from the JGI v2.0 D5 genome assembly of Gossypium raimondii. There has been no manual curation of this Cyc database. Pathway predictions were made using PathwayTools and in-silico v2.1 annotations as provided by JGI.

Cyc pathways for CGP-BGI v1.0 G. hirsutum AD1 genome assembly

This Cyc database was constructed using PathwayTools version 20.0 using the gene models from the CGP-BGI v1.0 AD1 genome assembly of Gossypium hirsutum. There has been no manual curation of this Cyc database. Pathway predictions were made using PathwayTools and in-silico v1.0 annotations as provided by CGP-BGI. Search parameters include genes, proteins, RNAs, compounds, reactions, pathways, growth media, and BLAST search. Resources in this dataset:Resource Title: Website Pointer to CottonGen CottonCyc Pathways Database. File Name: Web Page, url: http://ptools.cottongen.org/
PaperBLAST and SitesBLAST database from April 2022
figshare.com
application/x-gzip
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Price (2022). PaperBLAST and SitesBLAST database from April 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.20022590.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20022590.v1
Dataset updated
Jun 8, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Morgan Price
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source code and data files for PaperBLAST, SitesBLAST, and Curated BLAST for Genomes. The database is as of April 2022. The code is as of June 2022.

Exploding the code tarball will create a PaperBLAST/ directory. The code is in the cgi/ bin/ and lib/ subdirectories.

To install the data, create a data subdirectory, and explode the PaperBLAST_Apr2022 tarball into that directory. This includes the SQLite database (litsearch.db), and two BLAST databases (uniq.faa for PaperBLAST and hassites.faa for SitesBLAST).

For up to date code and databases, see https://github.com/morgannprice/PaperBLAST
n
Antibiotic Resistance Genes Database
neuinfo.org
rrid.site
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Antibiotic Resistance Genes Database [Dataset]. http://identifiers.org/RRID:SCR_007040
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007040
Dataset updated
Jan 29, 2022
Description
The goals of Antibiotic Resistance Genes Database (ARGB) are to provide a centralized compendium of information on antibiotic resistance, to facilitate the consistent annotation of resistance information in newly sequenced organisms, and also to facilitate the identification and characterization of new genes. ARGB contains six types of database groups: - Resistance Type: This database contains information, such as resistance profile, mechanism, requirement, epidemiology for each type. - Resistance Gene: This database contains information, such as resistance profile, resistance type, requirement, protein and DNA sequence for each gene.This database only includes NON-REDUNDANT, NON-VECTOR, COMPLETE genes. - Antibiotic: This database contains information, such as producer, action mechanism, resistance type, for each gene. - Resistance Gene(NonRD): This database contains the same information as Resistance Gene. It does NOT include NON-REDUNDANT, NON-VECTOR genes, but includes INCOMPLETE genes. - Resistance Gene(ALL): This database contains the same information as Resistance Gene. It includes all REDUNDANT, VECTOR AND INCOMPLETE genes. - Resistance Species: This database contains resistance profile and corresponding resistance genes for each species. Furthermore, ARDB also contians three types BLAST database: - Resistance Genes Complete: Contains only NON-REDUNDANT, NON-VECTOR, COMPLETE genes sequences. - Resistance Genes Non-redundant: Contains NON-REDUNDANT, NON-VECTOR, COMPLETE, INCOMPLETE genes sequences. - Resistance Genes All: Contains all REDUNDANT, VECTOR, COMPLETE, INCOMPLETE genes sequences. Lastly, ARDB provides four types of Analytical tools: - Normal BLAST: This function allows an user to input a DNA or protein sequence, and find similar DNA (Nucleotide BLAST) or protein (Protein BLAST) sequences using blastn, blastp, blastx, tblastn, tblastx - RPS BLAST: A web RPSBLAST (RPS BLAST) interface is provided to align a query sequence against the Position Specific Scoring Matrix (PSSM) for each type. Normally, this will give the same annotation information as using regular BLAST mentioned above. - Multiple Sequences BLAST (Genome Annotation): This function allows an user to annotate multiple (less than 5000) query sequences in FASTA format. - Mutation Resistance Identification: This function allows an user to identify mutations that will cause potential antibiotic resistance, for 12 genes (16S rRNA, 23S rRNA, gyrA, gyrB, parC, parE, rpoB, katG, pncA, embB, folP, dfr). �� :Sponsors: ARDB is funded by Uniformed Services University of the Health Sciences, administered by the Henry Jackson Foundation. :
u
Data from: CottonGen: Cotton Database Resources
agdatacommons.nal.usda.gov
datasetcatalog.nlm.nih.gov
+1more
bin
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Yu; Sook Jung; Chun-Huai Cheng; Stephen P. Ficklin; Taein Lee; Ping Zheng; Don Jones; Richard G. Percy; Dorrie Main (2025). CottonGen: Cotton Database Resources [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/CottonGen_Cotton_Database_Resources/24853203
Explore at:
binAvailable download formats
Dataset updated
Nov 21, 2025
Dataset provided by
MainLab, Washington State University
Authors
Jing Yu; Sook Jung; Chun-Huai Cheng; Stephen P. Ficklin; Taein Lee; Ping Zheng; Don Jones; Richard G. Percy; Dorrie Main
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
CottonGen (https://www.cottongen.org) is a curated and integrated web-based relational database providing access to publicly available genomic, genetic and breeding data to enable basic, translational and applied research in cotton. Built using the open-source Tripal database infrastructure, CottonGen supersedes CottonDB and the Cotton Marker Database, which includes sequences, genetic and physical maps, genotypic and phenotypic markers and polymorphisms, quantitative trait loci (QTLs), pathogens, germplasm collections and trait evaluations, pedigrees, and relevant bibliographic citations, with enhanced tools for easier data sharing, mining, visualization, and data retrieval of cotton research data. CottonGen contains annotated whole genome sequences, unigenes from expressed sequence tags (ESTs), markers, trait loci, genetic maps, genes, taxonomy, germplasm, publications and communication resources for the cotton community. Annotated whole genome sequences of Gossypium raimondii are available with aligned genetic markers and transcripts. These whole genome data can be accessed through genome pages, search tools and GBrowse, a popular genome browser. Most of the published cotton genetic maps can be viewed and compared using CMap, a comparative map viewer, and are searchable via map search tools. Search tools also exist for markers, quantitative trait loci (QTLs), germplasm, publications and trait evaluation data. CottonGen also provides online analysis tools such as NCBI BLAST and Batch BLAST. This project is funded/supported by Cotton Incorporated, the USDA-ARS Crop Germplasm Research Unit at College Station, TX, the Southern Association of Agricultural Experiment Station Directors, Bayer CropScience, Corteva/Agriscience, Dow/Phytogen, Monsanto, Washington State University, and NRSP10. Resources in this dataset:Resource Title: Website Pointer for CottonGen. File Name: Web Page, url: https://www.cottongen.org/ Genomic, Genetic and Breeding Resources for Cotton Research Discovery and Crop Improvement organized by :

Species (Gossypium arboreum, barbadense, herbaceum, hirsutum, raimondii, others), Data (Contributors, Download, Submission, Community Projects, Archives, Cotton Trait Ontology, Nomenclatures, and links to Variety Testing Data and NCBISRA Datasets), Search options (Colleague, Genes and Transcripts, Genotype, Germplasm, Map, Markers, Publications, QTLs, Sequences, Trait Evaluation, MegaSearch), Tools (BIMS, BLAST+, CottonCyc, JBrowse, Map Viewer, Primer3, Sequence Retrieval, Synteny Viewer), International Cotton Genome Initiative (ICGI), and Help sources (User manual, FAQs).

Also provides Quick Start links for Major Species and Tools.
Z
LukProt - an animal evolution-centric eukaryotic protein database
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sobala, Łukasz F. (2025). LukProt - an animal evolution-centric eukaryotic protein database [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7089120
Explore at:
Dataset updated
Feb 7, 2025
Dataset provided by
Hirszfeld Institute of Immunology and Experimental Therapy, PAS
Authors
Sobala, Łukasz F.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purposes of the database are to consolidate sequences from undersampled animal taxa and provide usable search tools. The publication associated with LukProt can be found here: https://doi.org/10.1093/gbe/evae231.

The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo).

Proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format:

(A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY

where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number YYYYYY, and each taxon XXXXXX. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed.

A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/.

Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference:

Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1

Holozoa

(excluding Metazoa)

31 40 39 43

Ctenophora 2 2 35 38

Porifera 4 5 30 47

Placozoa 2 2 3 6

Cnidaria 3 5 65 88

Bilateria 51 51 94 142

Included with the database are:

ready to use main database files:

LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB

to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence.

LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB

LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB

LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB

auxiliary database files:

LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command: cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2, uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB

LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different

BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis

OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification)

OMArk_output.zip – a folder with the results of all OMArk analyses

metadata:

README.md – a README file describing the metadata

LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice)

LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include:

the LukProt taxonomy in various formats

supporting scripts for data manipulation and visualization

a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience.

other files - see README

changelog.md – database changelog

Words of caution:

The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates.

Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details.

The taxonomy is different to UniEuk/EukMap, but UniEuk data were integrated where possible.

A few NCBI taxids are missing and will be added in due course.

Proteomes from NCBI and UniProt will be updated to current versions.

A number of proteomes present in some metadata, are unpublished and were held back.

While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established.

Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.

Acknowledgements:

Andrew E. Allen Lab for creating the original PhyloDB.

Daniel Richter et al. for creating EukProt and keeping it updated.

Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science.

All the authors of the original data.

National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.
n
T4-like genome database
neuinfo.org
rrid.site
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). T4-like genome database [Dataset]. http://identifiers.org/RRID:SCR_005367
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005367
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 22, 2016. A database of information on bacterial phages. It contains multiple phage genomes, which users can BLAST and MegaBLAST, and also hosts a Phage Forum in which users can discuss phage data. Interactive browsing of completed phage genomes is available using the program. The browser allows users to scan the genome for particular features and to download sequence information plus analyses of those features. Views of the genome are generated showing named genes BLAST similarities to other phages predicted tRNAs and other sequence features.
u
Data from: CottonGen Breeding Information Management System (BIMS)
agdatacommons.nal.usda.gov
catalog.data.gov
bin
Updated Feb 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main (2024). CottonGen Breeding Information Management System (BIMS) [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/CottonGen_Breeding_Information_Management_System_BIMS_/24853209
Explore at:
binAvailable download formats
Dataset updated
Feb 13, 2024
Dataset provided by
MainLab, Washington State University
Authors
Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
BIMS, the Breeding Information Management System, is a secure and comprehensive online breeding management system developed for the generic Tripal Database Platform which allows breeders to store, manage, archive and analyze their private breeding program. Breeders can load data in templates provided as well as output from the Field Book App, an android app for collecting phenotype data. In addition to the private breeders BIMS, users without accounts can also view the publicly available breeding data. The fully developed version will allow users to:

Fully integrate their data with publicly available genomic, genetic and breeding data in the community database.

Utilize their integrated pedigree, phenotype and genotype data in performing genomic analysis and making breeding decisions.

Use open-source new genomics tool and breeding decision tools with seamless access to HPC. Resources in this dataset:Resource Title: Website Pointer for CottonGen BIMS (Breeding Information Management System). File Name: Web Page, url: https://www.cottongen.org/bims BIMS, the Breeding Information Management System, is a secure and comprehensive online breeding management system developed for the generic Tripal Database Platform which allows breeders to store, manage, archive and analyze their private breeding program. Breeders can load data in templates provided as well as output from the Field Book App, an android app for collecting phenotype data. In addition to the private breeders BIMS users without accounts can also view the publicly available breeding data. The fully developed version will allow users to:

Fully integrate their data with publicly available genomic, genetic and breeding data in the community database

Utilize their integrated pedigree, phenotype and genotype data in performing genomic analysis and making breeding decisions.

Use open-source new genomics tool and breeding decision tools with seamless access to HPC.
Data from: EukProt: a database of genome-scale predicted proteins across the...
figshare.com
datasetcatalog.nlm.nih.gov
+1more
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Richter; Cédric Berney; Jürgen Strassert; Yu-Ping Poh; Emily K. Herman; Sergio A. Muñoz-Gómez; Jeremy G. Wideman; Fabien Burki; Colomban de Vargas (2023). EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotes [Dataset]. http://doi.org/10.6084/m9.figshare.12417881.v3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12417881.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Daniel Richter; Cédric Berney; Jürgen Strassert; Yu-Ping Poh; Emily K. Herman; Sergio A. Muñoz-Gómez; Jeremy G. Wideman; Fabien Burki; Colomban de Vargas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Version 3 (22 November, 2021)

See https://doi.org/10.24072/pcjournal.173 for a detailed description of the database. See http://evocellbio.com/eukprot/ for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at https://doi.org/10.6084/m9.figshare.21586065. See https://github.com/beaplab/EukProt for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above).

Scroll to the end of this page for changes since version 2.

Are we missing anything? Please let us know!

EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.

This release contains 5 files:

EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17).

EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity (https://github.com/cjneely10/EukMetaSanity) for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file.

EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns:

EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier.

Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files.

Strain: the strain(s) of the species sequenced.

Previous_Names: any previous names that this species was known by.

Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table).

Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691).

Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited).

Merged_Strains: whether multiple strains of the same species were merged to create the data set.

Data_Source_URL: the URL(s) from which the data were downloaded.

Data_Source_Name: the name of the data set (as assigned by the data source).

Paper_DOI: the DOI(s) of the paper(s) that published the data set.

Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details): ‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/ ‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/ ‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/ ‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/ ‘gffread’: v.0.12.3 https://github.com/gpertea/gffread ‘predict genes’: EukMetaSanity https://github.com/cjneely10/EukMetaSanity (cloned on 21 September, 2021) All parameter values were default, unless otherwise specified.

Data_Source_Type: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome).

Notes: additional information on the data set (including why it is replaced by/is replacing another data set, or why it was not included).

Columns_Modified_Since_Previous_Version: column(s) in this file modified for the data set since the previous release. Not listed: modifications to the Notes column or to new columns added in this version.

Alternative_Strain_Names: non-exhaustive list of alternative names for the sequenced strain for this data set.

18S_Sequence_GenBank_ID: GenBank identifier for the strain sequenced in the data set. When multiple strains were sequenced, identifiers are separated with a comma, in the same order as the Strain column. Ranges of identifiers for the same strain are separated by a hyphen. ‘N/A’ indicates either that there is no GenBank sequence for the strain or that all available sequences are not full-length (< 1,500 bp).

18S_Sequence: 18S for the strain derived from publicly available sequences associated with the data set, in the case where a GenBank sequence is not available.

18S_Sequence_Source: the source for the sequence in the 18S_Sequence column, if any.

18S_Sequence_Other_Strain_GenBank_ID: GenBank identifier for 18S sequence(s) from other strains of the same species as the data set.

18S_Sequence_Other_Strain_Name: strain name(s) for the sequences in the 18S_Sequence_Other_Strain_GenBank_ID column.

18S_and_Taxonomy_Notes: additional information on the values in the 18S_Sequence columns.

Changes since version 2

There are 324 new data sets included. 57 of these replace data sets from version 2.

40 newly published data sets were added to the list that are not included in the database (annotated in the Notes column with the reasons they were not included).

Instead of unannotated genomes (for published genomes lacking protein predictions), we now include predicted proteins and gene annotations (in GFF3 format).

All sequences within each file are now assigned a standardized, unique identifier based on the data set’s EukProt_ID and on the type of data (protein or transcriptome). Illegal characters are removed from sequences.

In the UniEuk_Taxonomy field, single quotes are now used instead of double quotes, to be consistent with other UniEuk databases (EukMap, EukRibo).

Changes to metadata of individual data sets (in the included and not_included tables) with respect to the previous version are now listed in the Columns_Modified_Since_Previous_Version column.

The Taxogroup_UniEuk column has been split into the Taxogroup1_UniEuk and Taxogroup2_UniEuk columns. This resulted in the Supergroup_UniEuk column changing for Opisthokonta.

In addition, the following new columns have been added (see our manuscript for details): Alternative_Strain_Names, 18S_Sequence_GenBank_ID, 18S_Sequence, 18S_Sequence_Source, 18S_Sequence_Other_Strain_GenBank_ID, 18S_Sequence_Other_Strain_Name, 18S_and_Taxonomy_Notes.

EukProt_assembled_transcriptomes.v03.2021_11_22.tgz: assembled transcriptome contigs, for 126 species with publicly available mRNA sequence reads but no publicly available assembly. The proteins predicted from these assemblies are included in the proteins file.

Sequence names in the proteins and transcriptomes files have standardized, unique identifiers with the following format:

[EukProt ID]_[Name_to_Use]_[Type abbreviation][Counter] [Previous header contents]

Type abbreviations are P (protein) and T (transcriptome).

All characters not in the following list are removed from nucleic acid sequences: ACGTNUKSYMWRBDHV All characters not in the the following list are removed from protein sequences: ABCDEFGHIKLMNPQRSTUVWYZX*

Lists of legal characters are from: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
PaperBLAST December 2018 release
figshare.com
application/gzip
Updated Dec 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Price (2018). PaperBLAST December 2018 release [Dataset]. http://doi.org/10.6084/m9.figshare.7439216.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7439216.v1
Dataset updated
Dec 8, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Morgan Price
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source code and data files for PaperBLAST and Curated BLAST for Genomes, as of December 7, 2018Exploding the tarball will create a PaperBLAST/ directory. The SQLite3 database, fasta file, and BLAST database are in the data/ subdirectory. The code is in the cgi/ bin/ and lib/ subdirectories. The main CGI scripts are cgi/litSearch.cgi for PaperBLAST and cgi/genomeSearch.cgi for Curated BLAST. For Curated BLAST to fetch genomes from JGI's IMG, it needs a valid username and password, which should be placed on separate lines in the private/.JGI.info file. If you plan to run a public web server, make sure that the contents of private/ are not made accessible.
ActDES – a Curated Actinobacterial Database for Evolutionary Studies -...
figshare.com
txt
Updated Apr 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jana Schniete; Nelly Sélem; Anna Birke; Pablo Cruz-Morales; Iain S. Hunter; Francisco Barona-Gomez; Paul A Hoskisson (2020). ActDES – a Curated Actinobacterial Database for Evolutionary Studies - Extracted sequences for BLAST Database [Dataset]. http://doi.org/10.6084/m9.figshare.12167880.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12167880.v1
Dataset updated
Apr 21, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jana Schniete; Nelly Sélem; Anna Birke; Pablo Cruz-Morales; Iain S. Hunter; Francisco Barona-Gomez; Paul A Hoskisson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ActDES – a Curated Actinobacterial Database for Evolutionary StudiesExtracted Sequences for Blast databasea. BLAST nucleotide database (.fasta file)b. BLAST protein database (.fasta file)
Z
rCRUX Generated Fungal ITS Reference Database
nde-dev.biothings.io
data.niaid.nih.gov
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Curd (2023). rCRUX Generated Fungal ITS Reference Database [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_7909648
Explore at:
Dataset updated
Oct 5, 2023
Dataset provided by
Ramon Gallego
Luna Gal
Emily Curd
Shaun Nielsen
Zachary Gold
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
rCRUX generated reference database using NCBI nt blast database downloaded in December 2022.

Primer Name: Fungal_ITS gITS7/ITS4 Gene: FITS Length of Target: 150-330 get_seeds_local() minimum length: 105 get_seeds_local() maximum length: 500 blast_seeds() minimum length: 65 blast_seeds() maximum length: 461 max_to_blast: 100 Forward Sequence (5'-3'): GTGARTCATCGARTCTTTG Reverse Sequence (5'-3'): TCCTCCGCTTATTGATATGC Reference: White, T. J., Bruns, T., Lee, S., & Taylor, J. (1990). Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics. PCR Protocols: A Guide to Methods and Applications, 18(1), 315–322. Ihrmark, K., Bödeker, I., Cruz-Martinez, K., Friberg, H., Kubartova, A., Schenck, J., Strid, Y., Stenlid, J., Brandström-Durling, M., & Clemmensen, K. E. (2012). New primers to amplify the fungal ITS2 region–evaluation by 454-sequencing of artificial and natural communities. FEMS Microbiology Ecology, 82(3), 666–677. https://doi.org/10.1111/j.1574-6941.2012.01437.x

We chose default rCRUX parameters for get_blast_seeds() of percent coverage of 70, percent identity of 70, evalue 3e+7, and max number of blast alignments = '100000000' and for blast_seeds() of coverage of 70, percent identity of 70, evalue 3e+7, rank of genus, and max number of blast alignments = '10000000'.
n
Arabidopsis Nucleolar Protein Database
neuinfo.org
scicrunch.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Arabidopsis Nucleolar Protein Database [Dataset]. http://identifiers.org/RRID:SCR_001793
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_001793
Dataset updated
Jan 29, 2022
Description
Database of proteins found in the nucleoli of Arabidopsis, identified through proteomic analysis. The Arabidopsis Nucleolar Protein database (AtNoPDB) provides information on the plant proteins in comparison to human and yeast proteins, and images of cellular localizations for over a third of the proteins. A proteomic analysis was carried out of nucleoli purified from Arabidopsis cell cultures and to date 217 proteins have been identified. Many proteins were known nucleolar proteins or proteins involved in ribosome biogenesis. Some proteins, such as spliceosomal and snRNP proteins, and translation factors, were unexpected. In addition, proteins of unknown function which were either plant-specific or conserved between human and plant, and proteins with differential localizations were identified.
f
This table indicates all the BLAST results from comparing the top 100 most...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Feb 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruiz-Mesía, Lastenia; Prugnolle, Franck; Zakeri, Sedigheh; Papenfuss, Anthony T.; Chan, Yao-ban; Rask, Thomas S.; Tonkin-Hill, Gerry; Duffy, Michael F.; Ruybal-Pesántez, Shazia; Day, Karen P.; Branch, OraLee H.; Rougeron, Virginie; Pumpaibool, Tepanata; Tiedje, Kathryn E.; Harnyuttanakorn, Pongchai (2021). This table indicates all the BLAST results from comparing the top 100 most conserved DBLα sequences against the NCBI database. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000897797
Explore at:
Dataset updated
Feb 25, 2021
Authors
Ruiz-Mesía, Lastenia; Prugnolle, Franck; Zakeri, Sedigheh; Papenfuss, Anthony T.; Chan, Yao-ban; Rask, Thomas S.; Tonkin-Hill, Gerry; Duffy, Michael F.; Ruybal-Pesántez, Shazia; Day, Karen P.; Branch, OraLee H.; Rougeron, Virginie; Pumpaibool, Tepanata; Tiedje, Kathryn E.; Harnyuttanakorn, Pongchai
Description
The columns represent in order: the sequence identifier (ID), the sequence identifier from the NCBI database (NCBI_ID); the pairwise sequence identity of the match (pwid); the length of the match (length); the number of mismatches (n_mismatches); the number of gap openings (n_gap); the start of the alignment in the query (q_start); the end of the alignment in the query (q_end); the start of the alignment on the NCBI matched sequence (t_start); the end of the alignment on the NCBI matched sequence (t_end); the BLAST e-value (evalue); and the BLAST bit score (score). (CSV)
i
Data from: DBSubLoc - Database of protein Subcellular Localization
uri.interlex.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DBSubLoc - Database of protein Subcellular Localization [Dataset]. http://identifiers.org/RRID:SCR_002339
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002339
Description
A database of protein subcellular localization containing proteins from primary protein database SWISS-PROT and PIR. By collecting the subcellular localization annotation, these information are classified and categorized by cross references to taxonomies and Gene Ontology database. Annotations were taken from primary protein databases, model organism genome projects and literature texts, and then were analyzed to dig out the subcellular localization features of the proteins. The proteins are also classified into different categories. Based on sequence alignment, nonredundant subsets of the database have been built, which may provide useful information for subcellular localization prediction. The database now contains >60 000 protein sequences including 30 000 protein sequences in the nonredundant data sets. Online download, SOAP server, Blast tools and prediction services are also available.

Facebook

Twitter

Click to copy link

Link copied

Cite

Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main (2024). CottonGen BLAST [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/CottonGen_BLAST/24853260

Data from: CottonGen BLAST

Explore at:

binAvailable download formats

Dataset updated

Feb 13, 2024

Dataset provided by

MainLab, Washington State University

Authors

Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. BLAST Programs:

blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query.

blastp: Search protein database using a protein query. Resources in this dataset:Resource Title: Website Pointer for CottonGen BLAST Search. File Name: Web Page, url: https://www.cottongen.org/blast CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. Enter or upload FASTA sequence(s) to query and select BLAST database.

BLAST Programs:

Clear search

Close search

Google apps

Main menu

Data from: CottonGen BLAST

Blast database sequences

Case study I: Fidelity of iBLAST in three consecutive time periods.

HCR Probe Designer: Custom BLAST database

List of common names and identifiers used in BLAST.

ESKAPEE BLAST nucleotide database (rMAP-2.0)

Data from: CottonGen CottonCyc Pathways Database

PaperBLAST and SitesBLAST database from April 2022

Antibiotic Resistance Genes Database

Data from: CottonGen: Cotton Database Resources

LukProt - an animal evolution-centric eukaryotic protein database

T4-like genome database

Data from: CottonGen Breeding Information Management System (BIMS)

Data from: EukProt: a database of genome-scale predicted proteins across the...

PaperBLAST December 2018 release

ActDES – a Curated Actinobacterial Database for Evolutionary Studies -...

rCRUX Generated Fungal ITS Reference Database

Arabidopsis Nucleolar Protein Database

This table indicates all the BLAST results from comparing the top 100 most...

Data from: DBSubLoc - Database of protein Subcellular Localization

Data from: CottonGen BLASTSee More Versions

Data from: CottonGen BLAST