Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Public databases are essential to the development of multi-omics resources. The amount of data created by biological technologies needs a systematic and organized form of storage, that can quickly be accessed, and managed. This is the objective of a biological database. Here, we present an overview of human databases with web applications. The databases and tools allow the search of biological sequences, genes and genomes, gene expression patterns, epigenetic variation, protein-protein interactions, variant frequency, regulatory elements, and comparative analysis between human and model organisms. Our goal is to provide an opportunity for exploring large datasets and analyzing the data for users with little or no programming skills. Public user-friendly web-based databases facilitate data mining and the search for information applicable to healthcare professionals. Besides, biological databases are essential to improve biomedical search sensitivity and efficiency and merge multiple datasets needed to share data and build global initiatives for the diagnosis, prognosis, and discovery of new treatments for genetic diseases. To show the databases at work, we present a a case study using ACE2 as example of a gene to be investigated. The analysis and the complete list of databases is available in the following website .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
Bio Resource for array genes is a free online resource for easy access to collective and integrated information from various public biological resources for human, mouse, rat, fly and c. elegans genes. The resource includes information about the genes that are represented in Unigene clusters. This resource provides interactive tools to selectively view, analyze and interpret gene expression patterns against the background of gene and protein functional information. Different query options are provided to mine the biological relationships represented in the underlying database. Search button will take you to the list of query tools available. This Bio resource is a platform designed as an online resource to assist researchers in analyzing results of microarray experiments and developing a biological interpretation of the results. This site is mainly to interpret the unique gene expression patterns found as biological changes that can lead to new diagnostic procedures and drug targets. This interactive site allows users to selectively view a variety of information about gene functions that is stored in an underlying database. Although there are other online resources that provide a comprehensive annotation and summary of genes, this resource differs from these by further enabling researchers to mine biological relationships amongst the genes captured in the database using new query tools. Thus providing a unique way of interpreting the microarray data results based on the knowledge provided for the cellular roles of genes and proteins. A total of six different query tools are provided and each offer different search features, analysis options and different forms of display and visualization of data. The data is collected in relational database from public resources: Unigene, Locus link, OMIM, NCBI dbEST, protein domains from NCBI CDD, Gene Ontology, Pathways (Kegg, Genmapp and Biocarta) and BIND (Protein interactions). Data is dynamically collected and compiled twice a week from public databases. Search options offer capability to organize and cluster genes based on their Interactions in biological pathways, their association with Gene Ontology terms, Tissue/organ specific expression or any other user-chosen functional grouping of genes. A color coding scheme is used to highlight differential gene expression patterns against a background of gene functional information. Concept hierarchies (Anatomy and Diseases) of MESH (Medical Subject Heading) terms are used to organize and display the data related to Tissue specific expression and Diseases. Sponsors: BioRag database is maintained by the Bioinformatics group at Arizona Cancer Center. The material presented here is compiled from different public databases. BioRag is hosted by the Biotechnology Computing Facility of the University of Arizona. 2002,2003 University of Arizona.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Bioinformatics resource system including web server and web service for functional annotation and enrichment analyses of gene lists. Consists of comprehensive knowledgebase and set of functional analysis tools. Includes gene centered database integrating heterogeneous gene annotation resources to facilitate high throughput gene functional analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The organizations that contribute to the longevity of 67 long-lived molecular biology databases published in Nucleic Acids Research (NAR) between 1991-2016 were identified to address two research questions 1) which organizations fund these databases? and 2) which organizations maintain these databases? Funders were determined by examining funding acknowledgements in each database's most recent NAR Database Issue update article published (prior to 2017) and organizations operating the databases were determine through review of database websites.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Agricultural crop breeding programs, particularly at the national level, typically consist of a core panel of elite breeding cultivars alongside a number of local landrace varieties (or other endemic cultivars) that provide additional sources of phenotypic and genomic variation or contribute as experimental materials (e.g., in GWAS studies). Three issues commonly arise. First, focusing primarily on core development accessions may mean that the potential contributions of landraces or other secondary accessions may be overlooked. Second, elite cultivars may accumulate deleterious alleles away from nontarget loci due to the strong effects of artificial selection. Finally, a tendency to focus solely on SNP-based methods may cause incomplete or erroneous identification of functional variants. In practice, integration of local breeding programs with findings from global database projects may be challenging. First, local GWAS experiments may only indicate useful functional variants according to the diversity of the experimental panel, while other potentially useful loci—identifiable at a global level—may remain undiscovered. Second, large-scale experiments such as GWAS may prove prohibitively costly or logistically challenging for some agencies. Here, we present a fully automated bioinformatics pipeline (riceExplorer) that can easily integrate local breeding program sequence data with international database resources, without relying on any phenotypic experimental procedure. It identifies associated functional haplotypes that may prove more robust in determining the genotypic determinants of desirable crop phenotypes. In brief, riceExplorer evaluates a global crop database (IRRI 3000 Rice Genomes) to identify haplotypes that are associated with extreme phenotypic variation at the global level and recorded in the database. It then examines which potentially useful variants are present in the local crop panel, before distinguishing between those that are already incorporated into the elite breeding accessions and those only found among secondary varieties (e.g., landraces). Results highlight the effectiveness of our pipeline, identifying potentially useful functional haplotypes across the genome that are absent from elite cultivars and found among landraces and other secondary varieties in our breeding program. riceExplorer can automatically conduct a full genome analysis and produces annotated graphical output of chromosomal maps, potential global diversity sources, and summary tables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current database was downloaded on 27.09.2019 and has the data fields (columns) as described below:# 1 Entry# 2 Entry name# 3 Status# 4 Protein names# 5 Gene names# 6 Organism# 7 Length# 8 Cross-reference (KO)# 9 Taxonomic lineage (PHYLUM)# 10 Taxonomic lineage (SPECIES) # This field carries current and old* taxonomic classifications.# 11 Taxonomic lineage (GENUS)# 12 Taxonomic lineage (KINGDOM)# 13 Taxonomic lineage (SUPERKINGDOM)# 14 Cross-reference (OrthoDB)# 15 Cross-reference (eggNOG)*Details about the classification used in UNIPROT can be found at the link: https://www.uniprot.org/help/taxonomy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on the BAliBASE 3 [24] and SABmark 1.65 [25] databases. Probalign has the highest accuracy on the commonly-used BAliBASE 3 dataset and FSA in default mode has superior accuracy on the BAliBASE 3+fp and SABmark 1.65 datasets (note that only FSA and AMAP explicitly attempt to maximize the expected accuracy). FSA has higher positive predictive values than any other program on all datasets and can additionally achieve high sensitivity when run in maximum-sensitivity mode. The BAliBASE 3+fp dataset, which mirrors BAliBASE 3 but includes a single non-homologous sequence in each alignment, was designed to test the robustness of alignment programs to incomplete homology. Traditional alignment programs, designed to maximize sensitivity, suffer greatly-increased mis-alignment when even a single non-homologous sequence is introduced; in contrast, FSA is robust to the non-homologous sequence and has an unchanged positive predictive value. Remarkably, FSA was the only tested program with a mis-alignment rate of
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Knowledge-based databases and the codes for collecting these databases are stored.
Expasy is the bioinformatics resource portal of the SIB Swiss Institute of Bioinformatics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
Supplementary material A.3 for the paper 'IDPredictor: predict database links in biomedical database'. Abstract: Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological knowledge is worldwide represented in a network of databases. These data are spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in information retrieval environments an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of providing a comprehensive knowledge excerpt out of the interlinked databases. A prerequisit for supporting the concept of an integrated data view is the to acquiring insights into cross-references among database entities. But only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend an automated construction of an integrated data network is possible. We propose a method that predict and extracts cross-references from multiple life science databases and thier possible referenced data targets. We study the retrieval quality of our method and the relationship between manually crafted relevance ranking and relevance ranking based on cross-references, and report on first, promising results.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.
MINT focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators
Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for more efficient compression tools. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available.We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database) that allows building custom visualizations for selected subsets of benchmark results.We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows comparing compressors and their settings using a variety of performance measures, offering the opportunity to select the optimal compressor based on the data type and usage scenario specific to a particular application.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The 4th German Crop BioGreenformatics Network (GCBN, https://www.denbi.de/gcbn) user training provided a hands-on introduction to useful bioinformatics tools for biologists with little or no previous knowledge. The training enabled biologists to process their own small and large datasets using R and Linux based methods and is entirely computer-based with interspersed lectures. The first part started with an introduction into the use and basic administration (software installation) of the Linux distribution Ubuntu and demonstrated the first steps into the use the R software (trainer Andrea Bräutigam, folder AB). In part two the use of Blast+ in the command line version, of simple Linux commands like 'cut', of Perl scripts and the graphical user interface of the phylogeny tool 'seaview' were demonstrated (trainer Uwe Scholz, folder US). The third session introduced basic concepts and practical tools for processing biological datasets in Linux. In particular, 'awk' and 'sed' were used. Moreover, 'SAMtools' and 'BEDTools' were applied (trainer Martin Mascher, folder MM). In the fourth part 'Introduction to Databases' a quick start guide to use relational databases was presented. By providing easy examples, this lesson set the fundamentals to motivate to use relational database systems as daily bioinformatics tool to store, retrieve and even analyze -omics data in the big data age (trainer Matthias Lange, folder ML). The last part introduced basic statistics to biologist, teach some commonly used statistical methods and demonstrated the creation of graphical visualizations with software R (trainer Yusheng Zhao, folder YZ).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is intended to hold the databases for Pharokka (https://github.com/gbouras13/pharokka)
It includes the PHROGs database, and mmseqs2 compatible versions of the CARD and VFDB databases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Public databases are essential to the development of multi-omics resources. The amount of data created by biological technologies needs a systematic and organized form of storage, that can quickly be accessed, and managed. This is the objective of a biological database. Here, we present an overview of human databases with web applications. The databases and tools allow the search of biological sequences, genes and genomes, gene expression patterns, epigenetic variation, protein-protein interactions, variant frequency, regulatory elements, and comparative analysis between human and model organisms. Our goal is to provide an opportunity for exploring large datasets and analyzing the data for users with little or no programming skills. Public user-friendly web-based databases facilitate data mining and the search for information applicable to healthcare professionals. Besides, biological databases are essential to improve biomedical search sensitivity and efficiency and merge multiple datasets needed to share data and build global initiatives for the diagnosis, prognosis, and discovery of new treatments for genetic diseases. To show the databases at work, we present a a case study using ACE2 as example of a gene to be investigated. The analysis and the complete list of databases is available in the following website .