Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the result of experiments conducted using Python and rdkit library.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MicroRNAs (miRNA) are small endogenous RNA molecules, which regulate target gene expression at post-transcriptional level. Besides, miRNA activity can be controlled by a newly discovered regulatory mechanism called endogenous target mimicry (eTM). In target mimicry, eTMs bind to the corresponding miRNAs to block the binding of specific transcript leading to increase mRNA expression. Thus, miRNA-eTM-target-mRNA regulation modules involving a wide range of biological processes; an increasing need for a comprehensive eTM database arose. Except miRSponge with limited number of Arabidopsis eTM data no available database and/or repository was developed and released for plant eTMs yet. Here, we present an online plant eTM database, called PeTMbase (http://petmbase.org), with a highly efficient search tool. To establish the repository a number of identified eTMs was obtained utilizing from high-throughput RNA-sequencing data of 11 plant species. Each transcriptome libraries is first mapped to corresponding plant genome, then long non-coding RNA (lncRNA) transcripts are characterized. Furthermore, additional lncRNAs retrieved from GREENC and PNRD were incorporated into the lncRNA catalog. Then, utilizing the lncRNA and miRNA sources a total of 2,728 eTMs were successfully predicted. Our regularly updated database, PeTMbase, provides high quality information regarding miRNA:eTM modules and will aid functional genomics studies particularly, on miRNA regulatory networks.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Uniprot accession numbers are listed for homologues of the 29 proteins expressed in the murine loop of Henle structure (data provided by the GUDMAP Consortium via www.gudmap.org) as determined by BLAST (run via the uniprot.org website). The Drosophila proteins in parentheses are homologous to multiple mammalian proteins. (n/a = not applicable).
Facebook
TwitterThis laboratory module, published on CourseSource, leads introductory biology students in the exploration of a basic set of bioinformatics concepts and tools.
Facebook
TwitterCommunity-wide effort (Challenge) for evaluating text mining and information extraction systems applied to the biological domain. It is focused on the comparison of methods and the community assessment of scientific progress, rather than on the purely competitive aspects. There is a considerable difficulty in constructing suitable gold standard data for training and testing new information extraction systems which handle life science literature. Thus the data sets derived from the BioCreAtIvE challenge - because they have been examined by biological database curators and domain experts - serve as useful resources for the development of new applications as well as helping to improve existing ones. Two main issues are addressed at BioCreAtIvE, both concerned with the extraction of biologically relevant and useful information from the literature. The first one is concerned with the detection of biologically significant entities (names) such as gene and protein names and their association to existing database entries. The second one is concerned with the detection of entity-fact associations (e.g. protein - functional term associations ).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A summary of the significantly enriched GO terms from the Ontologizer [28] and GO-Elite [27] analyses, which are relevant to kidney development, using the pre-annotation (2009; Tables S2–S5 in File S1) and post-annotation datasets (2012; Tables S6–S9, in File S1). Terms in italics indicate parent terms where the descendants are indicated directly underneath as follows: > descendant of term above in italics. Rank refers to the position of the term in the results of the enrichment analyses (see Tables S2–S9 in File S1) where significance of the enriched term has a p-value of
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a table of results from comparison of the annotated proteins from nine different mycobacterial species. The goal of its creation was to suggest which proteins are likely to have identical functions between species. This table reports only those protein comparisons with greater than 75% amino acid identity. Each row is a different gene used to search for close matches, each column is the genome used for searching. In parentheses next to the name of each match is the percent identity between the sequences (query vs each match).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
Facebook
TwitterThis comparision highlgihts the key aspects of several research studies focused on predicting drug interactions and drug-target associations using various machine learning techniques. The studies use various datasets such as DrugBank, LINCS signatures, and biological databases, employing algorithms like convolutional neural networks, graph convolutional networks, and deep learning methods. Evaluation metrics include accuracy, F-score, area under the curve (AUC), and precision-recall metrics, showing advancements in computational methods for pharmacological research.
Facebook
TwitterVersion 3 (22 November, 2021) See https://doi.org/10.24072/pcjournal.173 for a detailed description of the database. See http://evocellbio.com/eukprot/ for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at https://doi.org/10.6084/m9.figshare.21586065. See https://github.com/beaplab/EukProt for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above). Scroll to the end of this page for changes since version 2. Are we missing anything? Please let us know! EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification. This release contains 5 files: EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17). EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity (https://github.com/cjneely10/EukMetaSanity) for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file. EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns: EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier. Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files. Strain: the strain(s) of the species sequenced. Previous_Names: any previous names that this species was known by. Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table). Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691). Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited). Merged_Strains: whether multiple strains of the same species were merged to create the data set. Data_Source_URL: the URL(s) from which the data were downloaded. Data_Source_Name: the name of the data set (as assigned by the data source). Paper_DOI: the DOI(s) of the paper(s) that published the data set. Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details): ‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/ ‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/ ‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/ ‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/ ‘gffread’: v.0.12.3 https://github.com/gpertea/gffread ‘predict genes’: EukMetaSanity https://github.com/cjneely10/EukMetaSanity (cloned on 21 September, 2021) All parameter values were default, unless otherwise specified. Data_Source_Type: the type o...
Facebook
TwitterBackground: Viruses that infect prokaryotes (phages) constitute the most abundant group of biological agents, playing pivotal roles in microbial systems. They are known to impact microbial community dynamics, microbial ecology, and evolution. Efforts to document the diversity, host range, infection dynamics, and effects of bacteriophage infection on host cell metabolism are still at the surface level. Among phages, some adopt the lysogenic mode of infection, where the genome integrates into the host cell genome, forming a prophage. Prophages enable viral genome replication without host cell lysis and often contribute novel and beneficial traits to the host genome. Despite their importance, research on prophages is limited. Current phage research predominantly focuses on lytic phages, leaving a significant gap in knowledge regarding prophages, including their biology, diversity, and ecological roles. Results: To bridge this gap, the creation of Prophage-DB, a prophage database, aims to a..., , , # Prophage-DB: A comprehensive database to explore diversity, distribution, and ecology of prophages
https://doi.org/10.5061/dryad.3n5tb2rs5
This dataset contains prophage sequences (available as .fna files) identified from prokaryotic genomes from three public databases (Genome Taxonomy Database (GTDB) (release 207), National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (accessed March 2023), and Searchable Planetary-scale mIcrobiome REsource (SPIRE). The downloaded prokaryotic genomes from these databases contained both archaeal and bacterial representative genomes (SPIRE also included data from unknown hosts).Â
Prophage identification from downloaded representative genomes was carried out using VIBRANT (v1.2.1). We used the default arguments when using VIBRANT (minimum scaffold length requirement = 1000 base pairs, minimum number of open readings frames (ORFs, or proteins) per scaffold requi...
Facebook
TwitterThis data collection contains all currently published nucleotide (DNA/RNA) and protein sequences from Australian Amphibolis antarctica, commonly known as Sea Nymph. Other information about this group:
The nucleotide (DNA/RNA) and protein sequences have been sourced through the European Nucleotide Archive (ENA) and Universal Protein Resource (UniProt), databases that contains comprehensive sets of nucleotide (DNA/RNA) and protein sequences from all organisms that have been published by the International Research Community.
The identification of species in Amphibolis antarctica as Australian dwelling organisms has been achieved by accessing the Australian Plant Census (APC) or Australian Faunal Directory (AFD) through the Atlas of Living Australia.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is an expertly curated database of literature-derived functional information for the model organism budding yeast, Saccharomyces cerevisiae. SGD constantly strives to synergize new types of experimental data and bioinformatics predictions with existing data, and to organize them into a comprehensive and up-to-date information resource. The primary mission of SGD is to facilitate research into the biology of yeast and to provide this wealth of information to advance, in many ways, research on other organisms, even those as evolutionarily distant as humans. To build such a bridge between biological kingdoms, SGD is curating data regarding yeast-human complementation, in which a human gene can successfully replace the function of a yeast gene, and/or vice versa. These data are manually curated from published literature, made available for download, and incorporated into a variety of analysis tools provided by SGD.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
6,698 records indicated the presence and abundance of animal species, including representatives across trophic groups and size classes documented at 254 sites throughout the world, encompassing a variety of habitats. We accessed peer-reviewed articles, government publications, and theses that were freely available with the Utah State University library subscription and were published in English. We extracted data from articles that reported species-level abundance for a control community and at least one manipulated community. The data here represent a single data point each for the control treatment and the manipulated treatment(s) in each study. Data came from a wide variety of sites including artificial experiments (i.e., caged exclosures, habitat modules, nutrient addition) and human-mediated “natural” experiments (e.g., wildfire or controlled burn, logging, grazed plots, pollution). Sites represent all continents except Antarctica, and widely varying terrestrial animal groups (arachnid, insect, herpetofauna [reptiles and amphibians], mammal, and bird).
Facebook
TwitterThe exposure of human DNA to genotoxic compounds induces the formation of covalent DNA adducts, which may contribute to the initiation of carcinogenesis. Liquid chromatography (LC) coupled with high-resolution mass spectrometry (HRMS) is a powerful tool for DNA adductomics, a new research field aiming at screening known and unknown DNA adducts in biological samples. The lack of databases and bioinformatics tool in this field limits the applicability of DNA adductomics. Establishing a comprehensive database will make the identification process faster and more efficient and will provide new insight into the occurrence of DNA modification from a wide range of genotoxicants. In this paper, we present a four-step approach used to compile and curate a database for the annotation of DNA adducts in biological samples. The first step included a literature search, selecting only DNA adducts that were unequivocally identified by either comparison with reference standards or with nuclear magnetic resonance (NMR), and tentatively identified by tandem HRMS/MS. The second step consisted in harmonizing structures, molecular formulas, and names, for building a systematic database of 279 DNA adducts. The source, the study design and the technique used for DNA adduct identification were reported. The third step consisted in implementing the database with 303 new potential DNA adducts coming from different combinations of genotoxicants with nucleobases, and reporting monoisotopic masses, chemical formulas, .cdxml files, .mol files, SMILES, InChI, InChIKey and IUPAC nomenclature. In the fourth step, a preliminary spectral library was built by acquiring experimental MS/MS spectra of 15 reference standards, generating in silico MS/MS fragments for all the adducts, and reporting both experimental and predicted fragments into interactive web datatables. The database, including 582 entries, is publicly available (https://gitlab.com/nexs-metabolomics/projects/dna_adductomics_database). This database is a powerful tool for the annotation of DNA adducts measured in (HR)MS. The inclusion of metadata indicating the source of DNA adducts, the study design and technique used, allows for prioritization of the DNA adducts of interests and/or to enhance the annotation confidence. DNA adducts identification can be further improved by integrating the present database with the generation of authentic MS/MS spectra, and with user-friendly bioinformatics tools.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this paper, we present KEGGscape a pathway data integration and visualization app for Cytoscape (http://apps.cytoscape.org/apps/keggscape). KEGG is a comprehensive public biological database that contains large collection of human curated pathways. KEGGscape utilizes the database to reproduce the corresponding hand-drawn pathway diagrams with as much detail as possible in Cytoscape. Further, it allows users to import pathway data sets to visualize biologist-friendly diagrams using the Cytoscape core visualization function (Visual Style) and the ability to perform pathway analysis with a variety of Cytoscape apps. From the analyzed data, users can create complex and interactive visualizations which cannot be done in the KEGG PATHWAY web application. Experimental data with Affymetrix E. coli chips are used as an example to demonstrate how users can integrate pathways, annotations, and experimental data sets to create complex visualizations that clarify biological systems using KEGGscape and other Cytoscape apps.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv).
- Testing: 4,000 samples (proteinas_test.csv).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the result of experiments conducted using Python and rdkit library.