Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Public databases are essential to the development of multi-omics resources. The amount of data created by biological technologies needs a systematic and organized form of storage, that can quickly be accessed, and managed. This is the objective of a biological database. Here, we present an overview of human databases with web applications. The databases and tools allow the search of biological sequences, genes and genomes, gene expression patterns, epigenetic variation, protein-protein interactions, variant frequency, regulatory elements, and comparative analysis between human and model organisms. Our goal is to provide an opportunity for exploring large datasets and analyzing the data for users with little or no programming skills. Public user-friendly web-based databases facilitate data mining and the search for information applicable to healthcare professionals. Besides, biological databases are essential to improve biomedical search sensitivity and efficiency and merge multiple datasets needed to share data and build global initiatives for the diagnosis, prognosis, and discovery of new treatments for genetic diseases. To show the databases at work, we present a a case study using ACE2 as example of a gene to be investigated. The analysis and the complete list of databases is available in the following website .
Facebook
TwitterDatabase of curated links to molecular resources, tools and databases selected on the basis of recommendations from bioinformatics experts in the field. This resource relies on input from its community of bioinformatics users for suggestions. Starting in 2003, it has also started listing all links contained in the NAR Webserver issue. The different types of information available in this portal: * Computer Related: This category contains links to resources relating to programming languages often used in bioinformatics. Other tools of the trade, such as web development and database resources, are also included here. * Sequence Comparison: Tools and resources for the comparison of sequences including sequence similarity searching, alignment tools, and general comparative genomics resources. * DNA: This category contains links to useful resources for DNA sequence analyses such as tools for comparative sequence analysis and sequence assembly. Links to programs for sequence manipulation, primer design, and sequence retrieval and submission are also listed here. * Education: Links to information about the techniques, materials, people, places, and events of the greater bioinformatics community. Included are current news headlines, literature sources, educational material and links to bioinformatics courses and workshops. * Expression: Links to tools for predicting the expression, alternative splicing, and regulation of a gene sequence are found here. This section also contains links to databases, methods, and analysis tools for protein expression, SAGE, EST, and microarray data. * Human Genome: This section contains links to draft annotations of the human genome in addition to resources for sequence polymorphisms and genomics. Also included are links related to ethical discussions surrounding the study of the human genome. * Literature: Links to resources related to published literature, including tools to search for articles and through literature abstracts. Additional text mining resources, open access resources, and literature goldmines are also listed. * Model Organisms: Included in this category are links to resources for various model organisms ranging from mammals to microbes. These include databases and tools for genome scale analyses. * Other Molecules: Bioinformatics tools related to molecules other than DNA, RNA, and protein. This category will include resources for the bioinformatics of small molecules as well as for other biopolymers including carbohydrates and metabolites. * Protein: This category contains links to useful resources for protein sequence and structure analyses. Resources for phylogenetic analyses, prediction of protein features, and analyses of interactions are also found here. * RNA: Resources include links to sequence retrieval programs, structure prediction and visualization tools, motif search programs, and information on various functional RNAs.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Agricultural crop breeding programs, particularly at the national level, typically consist of a core panel of elite breeding cultivars alongside a number of local landrace varieties (or other endemic cultivars) that provide additional sources of phenotypic and genomic variation or contribute as experimental materials (e.g., in GWAS studies). Three issues commonly arise. First, focusing primarily on core development accessions may mean that the potential contributions of landraces or other secondary accessions may be overlooked. Second, elite cultivars may accumulate deleterious alleles away from nontarget loci due to the strong effects of artificial selection. Finally, a tendency to focus solely on SNP-based methods may cause incomplete or erroneous identification of functional variants. In practice, integration of local breeding programs with findings from global database projects may be challenging. First, local GWAS experiments may only indicate useful functional variants according to the diversity of the experimental panel, while other potentially useful loci—identifiable at a global level—may remain undiscovered. Second, large-scale experiments such as GWAS may prove prohibitively costly or logistically challenging for some agencies. Here, we present a fully automated bioinformatics pipeline (riceExplorer) that can easily integrate local breeding program sequence data with international database resources, without relying on any phenotypic experimental procedure. It identifies associated functional haplotypes that may prove more robust in determining the genotypic determinants of desirable crop phenotypes. In brief, riceExplorer evaluates a global crop database (IRRI 3000 Rice Genomes) to identify haplotypes that are associated with extreme phenotypic variation at the global level and recorded in the database. It then examines which potentially useful variants are present in the local crop panel, before distinguishing between those that are already incorporated into the elite breeding accessions and those only found among secondary varieties (e.g., landraces). Results highlight the effectiveness of our pipeline, identifying potentially useful functional haplotypes across the genome that are absent from elite cultivars and found among landraces and other secondary varieties in our breeding program. riceExplorer can automatically conduct a full genome analysis and produces annotated graphical output of chromosomal maps, potential global diversity sources, and summary tables.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The organizations that contribute to the longevity of 67 long-lived molecular biology databases published in Nucleic Acids Research (NAR) between 1991-2016 were identified to address two research questions 1) which organizations fund these databases? and 2) which organizations maintain these databases? Funders were determined by examining funding acknowledgements in each database's most recent NAR Database Issue update article published (prior to 2017) and organizations operating the databases were determine through review of database websites.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This FASTA file is the NCBI Nt (Nucleotide) database (public domain) used for holistic metagenomic screening of ancient DNA data at the Department of Archaeogenetics at the Max Planck Institute for the Science of Human History. We offer here the FASTA file used to construct MALT databases (https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/malt/), which are generally too large for uploading. Please see each relevent publications that use the database for MALT database construction commands.
NCBI does not retain older versions of this database which is why this has been uploaded here. It was downloaded on 2017-10-26 12:39 from: ftp://ftp-trace.ncbi.nih.gov/blast/db/FASTA/nt.gz. The NCBI Nt database is released into the public domain as per https://www.ncbi.nlm.nih.gov/home/about/policies/.
Facebook
TwitterBioinformatics resource system including web server and web service for functional annotation and enrichment analyses of gene lists. Consists of comprehensive knowledgebase and set of functional analysis tools. Includes gene centered database integrating heterogeneous gene annotation resources to facilitate high throughput gene functional analysis., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.
Facebook
TwitterHarvester is a Web-based tool that bulk-collects bioinformatic data on human proteins from various databases and prediction servers. It is a meta search engine for gene and protein information. It searches 16 major databases and prediction servers and combines the results on pregenerated HTML pages. In this way Harvester can provide comprehensive gene-protein information from different servers in a convenient and fast manner. As full text meta search engine, similar to Google trade mark, Harvester allows screening of the whole genome proteome for current protein functions and predictions in a few seconds. With Harvester it is now possible to compare and check the quality of different database entries and prediction algorithms on a single page. Sponsors: This work has been supported by the BMBF with grants 01GR0101 and 01KW0013.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of bioinformatics tools and databases students used.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv).
- Testing: 4,000 samples (proteinas_test.csv).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Knowledge-based databases and the codes for collecting these databases are stored.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research addresses the pressing issue of antibiotic resistance, a global health challenge that undermines the efficacy of treatments against infectious diseases. Focusing on Pseudomonas aeruginosa—a Gram-negative bacterium known for causing opportunistic infections—this study emphasizes its prioritization by the World Health Organization (WHO) as a critical-level pathogen requiring new therapeutic approaches.
To identify antibiotics associated with P. aeruginosa, the study employed text mining techniques on the Scielo database. The resulting dataset comprises 98 antibiotics, each documented with detailed textual information and referencing data. Additionally, the dataset includes structural files of the antibiotics in several formats suitable for computational modeling and simulations. These formats encompass Protein Data Bank, Partial Charge & Atom Type (PDBQT), Simplified Molecular Input Line Entry System (SMI), IUPAC International Chemical Identifier (INCHI), Molecular Design Limited Molfile (MOL2), Structure-Data File (SDF), Chemical Markup Language (CML), Cartesian Coordinates File (XYZ), Scalable Vector Graphics (SVG), Molecular File (MOL) and Protein Data Bank (PDB) files, with molecular models generated via OpenBabel to facilitate advanced studies in drug development and resistance mechanisms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Databases used for MyCodentifier a Nextflow pipeline to identify Mycobacterium tuberculosis complex (MTBC) and Nontuberculous mycobacteria (NTM) species from Next-generation sequencing (NGS) data.
Short description:
The pipeline is constructed using nextflow as workflow manager running in a docker container. It is able to identify species of MTBC/NTM from positive Mycobacterial Growth Indicator Tube (MGIT) cultures. To do so it uses an hsp65 database for fast identification coupled with a Metagenomic method using centrifuge to identify on genome level. For TB it also is able to identify subspecies. Results are presented in automated pdf and html reports.
| Name | Short Description |
| 20220726_ref.tar.gz | 7 major mycobacterial genomes as centrifuge classification database, used for reference-based mapping and genotype resistance prediction |
| 20220726_wgs_centrifuge_db_Radboudumc_MB.tar.gz | centrifuge classification database using Tortoli et al 2017 Mycobacterium strains + additional strains |
| genomes.tar.gz | 7 major mycobacterial genomes, annotation and Genbank files. Files are paired with 20220726_ref.tar.gz |
| snpEff.tar.gz | 7 major mycobacterial genomes annotation models for snpEff. |
| Tortoli_etal_hsp65.tar.gz | KMA database of hsp65 gene extractions of the Tortoli et al 2017 Mycobacterium strains. |
|
Used in the study: |
Databases available via ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data or https://ccb.jhu.edu/software/centrifuge/manual.shtml#custom-database |
MyCodentifier Github:
https://jordycoolen.github.io/MyCodentifier/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the result of experiments conducted using Python and rdkit library.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zipped tarball (.tar.gz) contains a pre-built database for Bacannot (https://github.com/fmalmeida/bacannot).
Files are in the naming convention YEAR_MONTH_DAY.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
expam reference database used for benchmarking and comparison against metagenome profilers.
Facebook
TwitterThis is the HQSNP DB (high-quality SNP database) developed by CHG bioinformatics group. The high-quality SNP is defined as a SNP having allele frequency or genotyping data. The majority of the HQSNPs come from HapMap, others come from JSNP (Japanese SNP database), TSC (The SNP Consortium), Affymetrix 120K SNP, and Perlegen SNP. There are four kinds of SNP search you can do: * Get SNPs by dbSNP rs#: Choose this search if you have already selected a list of SNPs and you just want to get the SNP information. The program will generate a Excel file containing the SNP flanking sequence, variation, quality, function, etc. In the Excel file, there are 10 highlighted fields. You can send only those highlighted information to Illumina to get SNP pre-score. (The same fields are presented in other types of searches as well.) * Get gene SNPs by gene names: Choose this search if you have a list of gene names and you want to get the SNP information in these genes. The gene name can be official gene symbol, Ensembl gene ID, RefSeq accession ID, LocusLink number, etc. * Get gene SNPs by genome regions: Choose this search if you have a list of genome regions and you want to get all gene SNP information in these regions. The software will find all the Ensembl genes in the regions and find SNPs associated to each Ensembl gene. * Get genome scan SNPs by genome regions: Choose this search if you have a list of genome regions and you want to get evenly spaced SNPs in these regions. A SNP selection tool (SNPselector) was built upon HQSNP. It took snp ID list, gene name list, or genome region list as input and searched SNPs for genome scan or gene assoctiation study. It could take an optional ABI SNP file (exported from ABI SNP search web page) as input for checking whether the candidate SNP is available from ABI. It could also take an optional Illumina SNP pre-score file as input to select SNP for Illumina SNP assay. It generated results sorted by tag SNP in LD block, SNP quality, SNP function, SNP regulatory potential, and SNP mutation risk. SNPselector is now retired from public use (as of September 30, 2010).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data repository contains the mandatory DB for Bakta (db.tar.gz).
Bakta is a tool for the rapid & standardized local annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readble JSON & bioinformatics standard file formats for automatic downstream analysis: https://github.com/oschwengers/bakta
This db provides protein sequence hash digests and lengths of UniProt's UniRef100/UniRef90 clusters for ultra-fast identification & lookups. It has been pre-annotated with several specialized db and enriched with Dbxrefs. All conducted pre-annotations are logged and provided in the db.log.gz file.
External DB versions:
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.