Facebook
TwitterThe Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.
Facebook
TwitterCollection of eukaryotic promoters derived from published articles. Annotated non-redundant collection of eukaryotic POL II promoters, for which transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.
Facebook
TwitterA promoter database of Saccharomyces cerevisiae. Users can explore the promoter regions of ~6000 genes and ORFs in yeast genome, annotate putative regulatory sites of all genes and ORFs, locate intergenic regions, and retrieve sequence of the promoter region. In regards to regulatory elements and transcription factors, users can provide information on transcriptionally related genes, browse matrix and consensus sequences, view the correlation between elements, observe binding affinity and expression, and look at genomewise distribution. SCPD also provides some simple but useful tools for promoter sequence analysis. Gene, consensus and matrix records may be submitted.
Facebook
TwitterAnnotated, non-redundant database of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s) (TSS) from various plant species. It contains 578 unrelated entries including 151, 396 and 31 promoters with experimentally verified TSS from monocot, dicot and other plants, respectively (April 2014). This DB presents the published promoter sequences with TSS(s) determined by direct experimental approaches and therefore serves as the most accurate source for development of computational promoter prediction tools.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the training, validation, and testing datasets used in our research for ProkBERT, optimized for microbiome studies.
There are 4 different datasets:
The datasets support the development and evaluation of ProkBERT models. They include raw sequence data in compressed TSV format and tokenized datasets in compressed HDF format, using various k-mer sizes and shift values.
filename: eskape_genomic_features.tsv.bz2
This dataset includes genomic segments from ESKAPE pathogens, characterized by various genomic features such as coding sequences (CDS), intergenic regions, ncRNA, and pseudogenes. It was analyzed to understand the representations captured by models like ProkBERT-mini, ProkBERT-mini-c, and ProkBERT-mini-long.
contig_id: Identifier of the contig.segment_id: Unique identifier for each genomic segment.strand: DNA strand of the segment (+ or -).seq_start: Starting position of the segment in the contig.seq_end: Ending position of the segment in the contig.segment_start: Starting position of the segment in the sequence.segment_end: Ending position of the segment in the sequence.label: Genomic feature category (e.g., CDS, intergenic).segment_length: Length of the genomic segment.segment: Genomic sequence of the segment.For a more detailed description, please visit: https://huggingface.co/datasets/neuralbioinfo/ESKAPE-genomic-features
filename: bacterial_promoter_db.tsv.bz2
segment_id: Unique identifier for each segment.ppd_original_SpeciesName: Original species name from the PPD.Strand: The strand of the DNA sequence.segment: The DNA sequence of the promoter region.label: The label indicating whether the sequence is a promoter or non-promoter.L: Length of the DNA sequence.prom_class: The class of the promoter.y: Binary label indicating the presence of a promoter.filename: eskape_masking_dataset.tsv.bz2
This dataset was used to evaluate different models on the masking exercise, measuring how well the different models can recover the original character.
The dataset is compiled from the RefSeq database and other sources, focusing on ESKAPE pathogens. The genomic features were sampled randomly, followed by contiguous segmentation. This dataset contains various segments with lengths: [128, 256, 512, 1024]. The segments were randomly selected, and one of the characters was replaced by '*' (masked_segment column) to create a masking task. The reference_segment contains the original, non-replaced nucleotides. We performed 10,000 maskings per set, with a maximum of 2,000 genomic features. Only the genomic features: 'CDS', 'intergenic', 'pseudogene', and 'ncRNA' were considered.
reference_segment_id: A mapping of segment identifiers to their respective reference IDs in the database.masked_segment: The DNA sequence of the segment with certain positions masked (marked with '*') for prediction or testing purposes.position_to_mask: The specific position(s) in the sequence that have been masked, indicated by index numbers.masked_segment_id: Unique identifiers assigned to the masked segments. (unique only with respect to length)contig_id: Identifier of the contig to which the segment belongs.segment_id: Unique identifier for each genomic segment (same as reference segment id).strand: The DNA strand of the segment, indicated as '+' (positive) or '-' (negative).seq_start: Starting position of the segment within the contig.seq_end: Ending position of the segment within the contig.segment_start: Starting position of the genomic segment in the sequence.segment_end: Ending position of the genomic segment in the sequence.label: Category label for the genomic segment (e.g., 'CDS', 'intergenic').segment_length: The length of the genomic segment.original_segment: The original genomic sequence without any masking.We assembled a phage sequence database from RefSeq and other sources, refining it to reduce redundancy and ensure balance between phage and bacterial sequences. The final dataset targets important bacterial genera, aiding in understanding phage-host interactions and their implications for health.
For tokenized datasets:
RND_balanced_(test|val|train)_Ls(\d+)_k(\d+)s(\d+)\.h5\.bz2For sampled raw data:
RND_balanced_(test|val|train)_Ls(\d+)\.tsv\.bz2segment_id: Unique identifier for each genomic segment.contig_id: Identifier for the contig from which the segment is derived.segment_start: Start position of the segment in the contig.segment_end: End position of the segment in the contig.L: Length of the genomic segment (512, 1024, or 2048).segment: The genomic sequence of the segment.label: Classification label (e.g., 'phage').y: Binary label (1 for phage, 0 for non-phage).These datasets are for academic use. Reference our paper when using them.
For any questions, feedback, or contributions regarding the datasets or ProkBERT, please feel free to reach out:
We welcome your input and collaboration to improve our resources and research.
@Article{ProkBERT2024,
author = {Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
journal = {Frontiers in Microbiology},
title = {{ProkBERT} family: genomic language models for microbiome applications},
year = {2024},
volume = {14},
URL = {https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
DOI = {10.3389/fmicb.2023.1331233}
}
Facebook
TwitterDNA sequence and relationships for DB narG promoter (promoter)
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented on October 30, 2012. A database that displays the observed frequencies of individual 5' end SAGE tags and previously unknown transcription start sites in the promoter regions, introns and intergenic regions of known genes. 5'SAGE will be useful for analyzing promoter regions and start site variation in different tissues, and is freely available.
Facebook
TwitterAs the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
Facebook
TwitterDBTGR provides information on tunicate gene regulation, such as the location of expression, or the identified regulatory elements present in promoter sequences. The database also contains the promoters of homologous genes in multiple species to allow identification of conserved cis elements.
Facebook
Twitter500 nt promoter sequences for the DoOP database, chordate section, v1.4.
Facebook
TwitterFULL-malaria is a database for a full-length-enriched cDNA library from the human malaria parasite Plasmodium falciparum. Because of its medical importance, this organism is the first target for genome sequencing of a eukaryotic pathogen; the sequences of two of its 14 chromosomes have already been determined. However, for the full exploitation of this rapidly accumulating information, correct identification of the genes and study of their expression are essential. Using the oligo-capping method, this database has produced a full-length-enriched cDNA library from erythrocytic stage parasites and performed one-pass reading. The database consists of nucleotide sequences of 2490 random clones that include 390 (16%) known malaria genes according to BLASTN analysis of the nr-nt database in GenBank; these represent 98 genes, and the clones for 48 of these genes contain the complete protein-coding sequence (49%). On the other hand, comparisons with the complete chromosome 2 sequence revealed that 35 of 210 predicted genes are expressed, and in addition led to detection of three new gene candidates that were not previously known. In total, 19 of these 38 clones (50%) were full-length. From these observations, it is expected that the database contains approximately 1000 genes, including 500 full-length clones. It should be an invaluable resource for the development of vaccines and novel drugs. Full-malaria has been updated in at least three points. (i) 8934 sequences generated from the addition of new libraries added so that the database collection of 11,424 full-length cDNAs covers 1375 (25%) of the estimated number of the entire 5409 parasite genes. (ii) All of its full-length cDNAs and GenBank EST sequences were mapped to genomic sequences together with publicly available annotated genes and other predictions. This precisely determined the gene structures and positions of the transcriptional start sites, which are indispensable for the identification of the promoter regions. (iii) A total of 4257 cDNA sequences were newly generated from murine malaria parasites, Plasmodium yoelii yoelii. The genome/cDNA sequences were compared at both nucleotide and amino acid levels, with those of P.falciparum, and the sequence alignment for each gene is presented graphically. This part of the database serves as a versatile platform to elucidate the function(s) of malaria genes by a comparative genomic approach. It should also be noted that all of the cDNAs represented in this database are supported by physical cDNA clones, which are publicly and freely available, and should serve as indispensable resources to explore functional analyses of malaria genomes. Sponsors: This database has been constructed and maintained by a Grant-in-Aid for Publication of Scientific Research Results from the Japan Society for the Promotion of Science (JSPS). This work was also supported by a Special Coordination Funds for Promoting Science and Technology from the Science and Technology Agency of Japan (STA) and a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science, Sports and Culture of Japan.
Facebook
TwitterDNA sequence and relationships for consensus elements (promoter)
Facebook
TwitterFour different histones (H2A, H2B, H3, and H4; two subunits each) constitute a histone octamer, around which DNA wraps to form histone-DNA complexes called nucleosomes. Amino acid residues in each histone are occasionally modified, resulting in several biological effects, including differential regulation of transcription. Core promoters that encompass the transcription start site have well-conserved DNA motifs, including the initiator (Inr), TATA box, and DPE, which are collectively called the core promoter elements (CPEs). In this study, we systematically studied the associations between the CPEs and histone modifications by integrating the Drosophila Core Promoter Database and time-series ChIP-seq data for histone modifications (H3K4me3, H3K27ac, and H3K27me3) during development in Drosophila melanogaster via the modENCODE project. We classified 96 core promoters into four groups based on the presence or absence of the TATA box or DPE, calculated the histone modification ratio at the core promoter region, and transcribed region for each core promoter. We found that the histone modifications in TATA-less groups were static during development and that the core promoters could be clearly divided into three types: i) core promoters with continuous active marks (H3K4me3 and H3K27ac), ii) core promoters with a continuous inactive mark (H3K27me3) and occasional active marks, and iii) core promoters with occasional histone modifications. Linear regression analysis and non-linear regression by random forest showed that the TATA-containing groups included core promoters without histone modifications, for which the measured RNA expression values were not predictable accurately from the histone modification status. DPE-containing groups had a higher relative frequency of H3K27me3 in both the core promoter region and transcribed region. In summary, our analysis showed that there was a systematic link between the existence of the CPEs and the dynamics, frequency and influence on transcriptional activity of histone modifications.
Facebook
TwitterPublic database of known binding sites identified in promoters of orthologous vertebrate genes that have been manually curated from bibliography. We have annotated 650 experimental binding sites from 68 transcription factors and 100 orthologous target genes in human, mouse, rat or chicken genome sequences. Computational predictions and promoter alignment information are also provided for each entry. For each gene, TFBSs conserved in orthologous sequences from at least two different species must be available. Promoter sequences as well as the original GenBank or RefSeq entries are additionally supplied in case of future identification conflicts. The final TSS annotation has been refined using the database dbTSS. Up to this release, 500 bps upstream the annotated transcription start site (TSS) according to REFSEQ annotations have been always extracted to form the collection of promoter sequences from human, mouse, rat and chicken. For each regulatory site, the position, the motif and the sequence in which the site is present are available in a simple format. Cross-references to EntrezGene, PubMed and RefSeq are also provided for each annotation. Apart from the experimental promoter annotations, predictions by popular collections of weight matrices are also provided for each promoter sequence. In addition, global and local alignments and graphical dotplots are also available.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1000 nt promoter sequences for the DoOP database, chordate section, v1.4.
Facebook
TwitterFULL-malaria is a database for a full-length-enriched cDNA library from the human malaria parasite Plasmodium falciparum. Because of its medical importance, this organism is the first target for genome sequencing of a eukaryotic pathogen; the sequences of two of its 14 chromosomes have already been determined. However, for the full exploitation of this rapidly accumulating information, correct identification of the genes and study of their expression are essential. Using the oligo-capping method, this database has produced a full-length-enriched cDNA library from erythrocytic stage parasites and performed one-pass reading. The database consists of nucleotide sequences of 2490 random clones that include 390 (16%) known malaria genes according to BLASTN analysis of the nr-nt database in GenBank; these represent 98 genes, and the clones for 48 of these genes contain the complete protein-coding sequence (49%). On the other hand, comparisons with the complete chromosome 2 sequence revealed that 35 of 210 predicted genes are expressed, and in addition led to detection of three new gene candidates that were not previously known. In total, 19 of these 38 clones (50%) were full-length. From these observations, it is expected that the database contains approximately 1000 genes, including 500 full-length clones. It should be an invaluable resource for the development of vaccines and novel drugs. Full-malaria has been updated in at least three points. (i) 8934 sequences generated from the addition of new libraries added so that the database collection of 11,424 full-length cDNAs covers 1375 (25%) of the estimated number of the entire 5409 parasite genes. (ii) All of its full-length cDNAs and GenBank EST sequences were mapped to genomic sequences together with publicly available annotated genes and other predictions. This precisely determined the gene structures and positions of the transcriptional start sites, which are indispensable for the identification of the promoter regions. (iii) A total of 4257 cDNA sequences were newly generated from murine malaria parasites, Plasmodium yoelii yoelii. The genome/cDNA sequences were compared at both nucleotide and amino acid levels, with those of P.falciparum, and the sequence alignment for each gene is presented graphically. This part of the database serves as a versatile platform to elucidate the function(s) of malaria genes by a comparative genomic approach. It should also be noted that all of the cDNAs represented in this database are supported by physical cDNA clones, which are publicly and freely available, and should serve as indispensable resources to explore functional analyses of malaria genomes. Sponsors: This database has been constructed and maintained by a Grant-in-Aid for Publication of Scientific Research Results from the Japan Society for the Promotion of Science (JSPS). This work was also supported by a Special Coordination Funds for Promoting Science and Technology from the Science and Technology Agency of Japan (STA) and a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science, Sports and Culture of Japan.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Values were normalized such that the sum of all values was 1.
Facebook
TwitterEngineering microorganisms into biological factories that convert renewable feedstocks into valuable materials is a major goal of synthetic biology; however, for many nonmodel organisms, we do not yet have the genetic tools, such as suites of strong promoters, necessary to effectively engineer them. In this work, we developed a computational framework that can leverage standard RNA-seq data sets to identify sets of constitutive, strongly expressed genes and predict strong promoter signals within their upstream regions. The framework was applied to a diverse collection of RNA-seq data measured for the methanotroph Methylotuvimicrobium buryatense 5GB1 and identified 25 genes that were constitutively, strongly expressed across 12 experimental conditions. For each gene, the framework predicted short (27–30 nucleotide) sequences as candidate promoters and derived −35 and −10 consensus promoter motifs (TTGACA and TATAAT, respectively) for strong expression in M. buryatense. This consensus closely matches the canonical E. coli sigma-70 motif and was found to be enriched in promoter regions of the genome. A subset of promoter predictions was experimentally validated in a XylE reporter assay, including the consensus promoter, which showed high expression. The pmoC, pqqA, and ssrA promoter predictions were additionally screened in an experiment that scrambled the −35 and −10 signal sequences, confirming that transcription initiation was disrupted when these specific regions of the predicted sequence were altered. These results indicate that the computational framework can make biologically meaningful promoter predictions and identify key pieces of regulatory systems that can serve as foundational tools for engineering diverse microorganisms for biomolecule production.
Facebook
TwitterDatabase that annotates SNPs with known and predicted regulatory elements in intergenic regions of H. sapiens genome. Known and predicted regulatory DNA elements include regions of DNAase hypersensitivity, binding sites of transcription factors, and promoter regions that have been biochemically characterized to regulation transcription. Source of these data include public datasets from GEO, ENCODE project, and published literature.
Facebook
TwitterDatabase for conserved sequence motifs identified by genome scale motif discovery, similarity, clustering, co-occurrence and coexpression calculations. Sequence inputs include low-coverage genome sequence data and ENCODE data. The database offers information on atomic motifs, motif groups and patterns. In promoter-based cisRED databases, sequence search regions for motif discovery extend from 1.5 Kb upstream to 200b downstream of a transcription start site, net of most types of repeats and of coding exons. Many transcription factor binding sites are located in such regions. For each target gene's search region, a base set of probabilistic ab initio discovery tools is used, in parallel, to find over-represented atomic motifs. Discovery methods use comparative genomics with over 40 vertebrate input genomes. In ChIP-seq-based cisRED databases, sequence search regions for motif discovery correspond to significant peaks that represent genome-wide sites of protein-DNA binding. Because such peaks occur in a wide range of genic and intergenic locations, ChIP-seq and promoter-based databases are complementary. Currently, motif discovery for ChIP-seq data uses scan-based approaches that make more explicit use of sets of sequences known to be functional transcription factor binding sites, and that consider a wide range of levels of conservation. For the human STAT1 ChIP-seq database search regions in the target species (human) was selected +/- 300 bp around the ChIP-seq peak maximum. Repeats and coding regions were masked. Multiple sequence alignment were used to assemble orthologous input sequences from other species.
Facebook
TwitterThe Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.