The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.
A promoter database of Saccharomyces cerevisiae. Users can explore the promoter regions of ~6000 genes and ORFs in yeast genome, annotate putative regulatory sites of all genes and ORFs, locate intergenic regions, and retrieve sequence of the promoter region. In regards to regulatory elements and transcription factors, users can provide information on transcriptionally related genes, browse matrix and consensus sequences, view the correlation between elements, observe binding affinity and expression, and look at genomewise distribution. SCPD also provides some simple but useful tools for promoter sequence analysis. Gene, consensus and matrix records may be submitted.
Public database of known binding sites identified in promoters of orthologous vertebrate genes that have been manually curated from bibliography. We have annotated 650 experimental binding sites from 68 transcription factors and 100 orthologous target genes in human, mouse, rat or chicken genome sequences. Computational predictions and promoter alignment information are also provided for each entry. For each gene, TFBSs conserved in orthologous sequences from at least two different species must be available. Promoter sequences as well as the original GenBank or RefSeq entries are additionally supplied in case of future identification conflicts. The final TSS annotation has been refined using the database dbTSS. Up to this release, 500 bps upstream the annotated transcription start site (TSS) according to REFSEQ annotations have been always extracted to form the collection of promoter sequences from human, mouse, rat and chicken. For each regulatory site, the position, the motif and the sequence in which the site is present are available in a simple format. Cross-references to EntrezGene, PubMed and RefSeq are also provided for each annotation. Apart from the experimental promoter annotations, predictions by popular collections of weight matrices are also provided for each promoter sequence. In addition, global and local alignments and graphical dotplots are also available.
THIS RESOURCE IS NO LONGER IN SERVICE, documented on October 30, 2012. A database that displays the observed frequencies of individual 5' end SAGE tags and previously unknown transcription start sites in the promoter regions, introns and intergenic regions of known genes. 5'SAGE will be useful for analyzing promoter regions and start site variation in different tissues, and is freely available.
Annotated, non-redundant database of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s) (TSS) from various plant species. It contains 578 unrelated entries including 151, 396 and 31 promoters with experimentally verified TSS from monocot, dicot and other plants, respectively (April 2014). This DB presents the published promoter sequences with TSS(s) determined by direct experimental approaches and therefore serves as the most accurate source for development of computational promoter prediction tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The matrices were normalized such that their sum was 1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Values were normalized such that the sum of all values was 1.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Protein-Protein, Genetic, and Chemical Interactions for Baumann DG (2017):A sequence-specific core promoter-binding transcription factor recruits TRF2 to coordinately transcribe ribosomal protein genes. curated by BioGRID (https://thebiogrid.org); ABSTRACT: Ribosomal protein (RP) genes must be coordinately expressed for proper assembly of the ribosome yet the mechanisms that control expression of RP genes in metazoans are poorly understood. Recently, TATA-binding protein-related factor 2 (TRF2) rather than the TATA-binding protein (TBP) was found to function in transcription of RP genes in Drosophila. Unlike TBP, TRF2 lacks sequence-specific DNA binding activity, so the mechanism by which TRF2 is recruited to promoters is unclear. We show that the transcription factor M1BP, which associates with the core promoter region, activates transcription of RP genes. Moreover, M1BP directly interacts with TRF2 to recruit it to the RP gene promoter. High resolution ChIP-exo was used to analyze in vivo the association of M1BP, TRF2 and TFIID subunit, TAF1. Despite recent work suggesting that TFIID does not associate with RP genes in Drosophila, we find that TAF1 is present at RP gene promoters and that its interaction might also be directed by M1BP. Although M1BP associates with thousands of genes, its colocalization with TRF2 is largely restricted to RP genes, suggesting that this combination is key to coordinately regulating transcription of the majority of RP genes in Drosophila.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1000 nt promoter alignments for the DoOP database, chordate section, v1.4.
DBTGR provides information on tunicate gene regulation, such as the location of expression, or the identified regulatory elements present in promoter sequences. The database also contains the promoters of homologous genes in multiple species to allow identification of conserved cis elements.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
500 nt promoter sequences, alignments and conserved motifs for the DoOP database, chordate section, v1.3.
The RNA polymerase II (Pol II) core promoter is the strategic site of convergence of the signals that lead to the initiation of DNA transcription, but the downstream core promoter in humans has been difficult to understand. Here we analyse the human Pol II core promoter and use machine learning to generate predictive models for the downstream core promoter region (DPR) and the TATA box. We developed a method termed HARPE (high-throughput analysis of randomized promoter elements) to create hundreds of thousands of DPR (or TATA box) variants, each with known transcriptional strength. We then analysed the HARPE data by support vector regression (SVR) to provide comprehensive models for the sequence motifs, and found that the SVR-based approach is more effective than a consensus-based method for predicting transcriptional activity. These results show that the DPR is a functionally important core promoter element that is widely used in human promoters. Notably, there appears to be a duality between the DPR and the TATA box, as many promoters contain one or the other element. More broadly, these findings show that functional DNA motifs can be identified by machine learning analysis of a comprehensive set of sequence variants. Analysis of human core promoters using HARPE, applied to the Downstream core Promoter Region and TATA-box.
A database of genome-wide annotations of regulatory sites. The predictions are based on Bayesian probabilistic analysis of a combination of input information including: * Experimentally determined binding sites reported in the literature. * Known sequence-specificities of transcription factors. * ChIP-chip and ChIP-seq data. * Alignments of orthologous non-coding regions. Predictions were made using the PhyloGibbs, MotEvo, IRUS and ISMARA algorithms developed in their group, depending on the data available for each organism. Annotations can be viewed in a Gbrowse genome browser and can also be downloaded in flat file format.
Copy-number and point mutations form the basis for most evolutionary novelty through the process of gene duplication and divergence. While a plethora of genomic sequence data reveals the long-term fate of diverging coding sequences and their cis-regulatory elements, little is known about the early dynamics around the duplication event itself. In microorganisms, selection for increased gene expression often drives the expansion of gene copy-number mutations, which serves as a crude adaptation, prior to divergence through refining point mutations. Using a simple synthetic genetic system that allows us to distinguish copy-number and point mutations, we study their early and transient adaptive dynamics in real-time in Escherichia coli. We find two qualitatively different routes of adaptation depending on the level of functional improvement selected for: In conditions of high gene expression demand, the two types of mutations occur as a combination. Under low gene expression demand, negative...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Protein-Protein, Genetic, and Chemical Interactions for Meier D (2011):Fanconi anemia core complex gene promoters harbor conserved transcription regulatory elements. curated by BioGRID (https://thebiogrid.org); ABSTRACT: The Fanconi anemia (FA) gene family is a recent addition to the complex network of proteins that respond to and repair certain types of DNA damage in the human genome. Since little is known about the regulation of this novel group of genes at the DNA level, we characterized the promoters of the eight genes (FANCA, B, C, E, F, G, L and M) that compose the FA core complex. The promoters of these genes show the characteristic attributes of housekeeping genes, such as a high GC content and CpG islands, a lack of TATA boxes and a low conservation. The promoters functioned in a monodirectional way and were, in their most active regions, comparable in strength to the SV40 promoter in our reporter plasmids. They were also marked by a distinctive transcriptional start site (TSS). In the 5' region of each promoter, we identified a region that was able to negatively regulate the promoter activity in HeLa and HEK 293 cells in isolation. The central and 3' regions of the promoter sequences harbor binding sites for several common and rare transcription factors, including STAT, SMAD, E2F, AP1 and YY1, which indicates that there may be cross-connections to several established regulatory pathways. Electrophoretic mobility shift assays and siRNA experiments confirmed the shared regulatory responses between the prominent members of the TGF-β and JAK/STAT pathways and members of the FA core complex. Although the promoters are not well conserved, they share region and sequence specific regulatory motifs and transcription factor binding sites (TBFs), and we identified a bi-partite nature to these promoters. These results support a hypothesis based on the co-evolution of the FA core complex genes that was expanded to include their promoters.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Promoters regulate both the amplitude and pattern of gene expression—key factors needed for optimization of many synthetic biology applications. Previous work in Arabidopsis found that promoters that contain a TATA-box element tend to be expressed only under specific conditions or in particular tissues, while promoters which lack any known promoter elements, thus designated as Coreless, tend to be expressed more ubiquitously. To test whether this trend represents a conserved promoter design rule, we identified stably expressed genes across multiple angiosperm species using publicly available RNA-seq data. Comparisons between core promoter architectures and gene expression stability revealed differences in core promoter usage in monocots and eudicots. Furthermore, when tracing the evolution of a given promoter across species, we found that core promoter type was not a strong predictor of expression stability. Our analysis suggests that core promoter types are correlative rather than causative in promoter expression patterns and highlights the challenges in finding or building constitutive promoters that will work across diverse plant species. Methods RNA-seq dataset processing (Relevant files: 0_Slurm_Pipeline) RNA-seq atlases were located in the NCBI Sequence Read Archive (SRA) database. The references for the datasets can be found in Supplemental Table S1. The individual datasets were retrieved using sratoolkit-3.0.1 prefetch followed by fasterq-dump functions. Fastqc-0.11.9 were used to generate a QC report for each dataset. Trimmomatic-0.39 were used for adaptor and low quality ends trimming using the following settings: ‘SLIDINGWINDOW:4:20 MINLEN:36’. ILLUMINACLIP files TruSEq3-PE-2.fa was supplied for paired end data and TruSEq3-SE.fa were supplied for single end data. Reference transcriptome were downloaded from the Ensembl Plants (http://plants.ensembl.org/index.html) for Arabidopsis thaliana, Camelina sativa, Cucumis melo, Glycine max, Phaseolus vulgaris, Pisum sativum, Vigna unguiculata, Sorghum bicolor, Zea mays, Solanum lycopersicum, Actinidia chinensis, Triticum aestivum and Phytozome (https://phytozome-next.jgi.doe.gov) for Arachis hypogaea, Cicer arietinum, and Solanum tuberosum (Cunningham et al., 2021; Goodstein et al., 2012). An index file was generated and the reads aligned and counted using Kallisto-0.44.0 with ‘-o counts -b 500’. For single end data, Fragment Length and Standard Deviation were required, but the information is difficult to locate, and so a default value of ‘-l 200 -s 20’ were used across the board. Another Fastqc was performed on the trimmed files, and a final MultiQC-1.13 were run on the entire folder encompassing all the log files that Fastqc, Trimmomatic, and Kallisto generated. The MultiQC report was inspected to ensure the trimming step improved read quality and there were no major warnings. Normalizing count, Calculating CV and Percent Ranking (Relevant files: 1_Metadata_from_RUNselector.Rmd, 2_MOR_Normalization.Rmd) Using an R script, the raw counts for each species were normalized using the DESeq2 package using a metadata file curated from the original study for the RNA-seq datasets. The coefficient of variation across all samples for a given atlas was used as a metric for stability for each gene, and the percentile ranking for each gene was calculated. The geometric mean for each gene was also calculated across all samples. Extracting intergenic region and 5’UTR (Relevant files: 3_ExtractPromUTR(ALL_Transcripts).ipynb, 8_ExtractPromUTR(Orthologs).ipynb) Gff3 annotation files and reference genomes were downloaded from Ensembl or Phytozome depending on where the reference transcriptomes were retrieved from. 40% of transcripts were selected from the total transcriptome and their intergenic region and 5’UTR were extracted from the Gff3 annotation. Intergenic region and 5’UTRs of identified orthologs were extracted in a similar manner. Labeling core promoter types (Relevant files: 4_Label_Promoters.Rmd, 9_Motif_Scan.Rmd, 10_Octamer_Scan.ipynb) Motif Scan: Intergenic regions and 5’UTR sequences are trimmed to only regions to be scanned for each core promoter types: TATA box (-100 to TSS), Y patch (-100 to +100), and Inr (-10 to +10). Intergenic regions shorter than 100bps were excluded from analysis. Each regions were scanned for their respective motifs according using motif files as well as methods outlined in (Jores et al., 2021). A motif is considered to be present when the relative motif scores are above 0.85. Octamer Scan: Intergenic regions and 5’UTR sequences were trimmed based on the positions relative to the TSS outlined in Yamamoto et al. 2009 (TATA, −45 to −18; Y Patch, −50 to +50; CA, −35 to −1; GA, −35 to +75). Each region was scanned for the presence of octamer motifs from the TATA, Y patch, GA, and CA lists outlined in Yamamoto et al. 2009. If the specified region contained at least one motif for a given promoter type, it was labeled as positive. Ortholog Analysis (Relevant files: 5_At_gene_ranking.Rmd, 6_Identifying_orthologs.Rmd, 7_Processing_orthologs.Rmd) The Arabidopsis transcriptome was filtered to only include primary transcripts, and mitochondria as well as chloroplast transcripts were removed. Top 5% stable genes by CV, bottom 5% stable genes by CV and a random set of 1343 genes (5%) were randomly selected. Using biomaRt in R, the Ensembl and Phytozome databases were queried for orthologs for the selected set of Arabdiopsis genes for each species (Durinck et al., 2009). Orthologs from Arachis hypogaea, Cicer arietinum, and Solanum tuberosum were retrieved from Phytozome, and the rest of the species from Ensembl. For analysis in Figure3B, significance test of done by ANOVA followed by Tukey’s HSD. For each target gene that matched to an Arabidopsis transcript, only the highest expressing transcript was kept. If an Arabidopsis transcript retrieved more than one orthologs from a target species, these pairs of orthologs were removed from analysis. We only kept orthologous gene groups that had a “change” in expression pattern, defined as crossing the 50th percentile CV, in two target species, and the remaining candidates were manually mapped onto the phylogenetic tree to identify gene groups that had changes in expression pattern that are consistent with the tree. This means having changes in expression pattern that are mostly found in the same clade. Gene trees were built for these candidates using blast-align-tree (https://github.com/steinbrennerlab/blast-align-tree) and the candidate lists were further trimmed based on the gene trees to ensure a 1:1 relationship between all members in the gene group. The dataset contains all the necessary scripts to transform the data as described in the manuscript and perform the analysis in the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
3000 nt promoter alignments for the DoOP database, chordate section, v1.4.
FULL-malaria is a database for a full-length-enriched cDNA library from the human malaria parasite Plasmodium falciparum. Because of its medical importance, this organism is the first target for genome sequencing of a eukaryotic pathogen; the sequences of two of its 14 chromosomes have already been determined. However, for the full exploitation of this rapidly accumulating information, correct identification of the genes and study of their expression are essential. Using the oligo-capping method, this database has produced a full-length-enriched cDNA library from erythrocytic stage parasites and performed one-pass reading. The database consists of nucleotide sequences of 2490 random clones that include 390 (16%) known malaria genes according to BLASTN analysis of the nr-nt database in GenBank; these represent 98 genes, and the clones for 48 of these genes contain the complete protein-coding sequence (49%). On the other hand, comparisons with the complete chromosome 2 sequence revealed that 35 of 210 predicted genes are expressed, and in addition led to detection of three new gene candidates that were not previously known. In total, 19 of these 38 clones (50%) were full-length. From these observations, it is expected that the database contains approximately 1000 genes, including 500 full-length clones. It should be an invaluable resource for the development of vaccines and novel drugs. Full-malaria has been updated in at least three points. (i) 8934 sequences generated from the addition of new libraries added so that the database collection of 11,424 full-length cDNAs covers 1375 (25%) of the estimated number of the entire 5409 parasite genes. (ii) All of its full-length cDNAs and GenBank EST sequences were mapped to genomic sequences together with publicly available annotated genes and other predictions. This precisely determined the gene structures and positions of the transcriptional start sites, which are indispensable for the identification of the promoter regions. (iii) A total of 4257 cDNA sequences were newly generated from murine malaria parasites, Plasmodium yoelii yoelii. The genome/cDNA sequences were compared at both nucleotide and amino acid levels, with those of P.falciparum, and the sequence alignment for each gene is presented graphically. This part of the database serves as a versatile platform to elucidate the function(s) of malaria genes by a comparative genomic approach. It should also be noted that all of the cDNAs represented in this database are supported by physical cDNA clones, which are publicly and freely available, and should serve as indispensable resources to explore functional analyses of malaria genomes. Sponsors: This database has been constructed and maintained by a Grant-in-Aid for Publication of Scientific Research Results from the Japan Society for the Promotion of Science (JSPS). This work was also supported by a Special Coordination Funds for Promoting Science and Technology from the Science and Technology Agency of Japan (STA) and a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science, Sports and Culture of Japan.
Database for conserved sequence motifs identified by genome scale motif discovery, similarity, clustering, co-occurrence and coexpression calculations. Sequence inputs include low-coverage genome sequence data and ENCODE data. The database offers information on atomic motifs, motif groups and patterns. In promoter-based cisRED databases, sequence search regions for motif discovery extend from 1.5 Kb upstream to 200b downstream of a transcription start site, net of most types of repeats and of coding exons. Many transcription factor binding sites are located in such regions. For each target gene's search region, a base set of probabilistic ab initio discovery tools is used, in parallel, to find over-represented atomic motifs. Discovery methods use comparative genomics with over 40 vertebrate input genomes. In ChIP-seq-based cisRED databases, sequence search regions for motif discovery correspond to significant peaks that represent genome-wide sites of protein-DNA binding. Because such peaks occur in a wide range of genic and intergenic locations, ChIP-seq and promoter-based databases are complementary. Currently, motif discovery for ChIP-seq data uses scan-based approaches that make more explicit use of sets of sequences known to be functional transcription factor binding sites, and that consider a wide range of levels of conservation. For the human STAT1 ChIP-seq database search regions in the target species (human) was selected +/- 300 bp around the ChIP-seq peak maximum. Repeats and coding regions were masked. Multiple sequence alignment were used to assemble orthologous input sequences from other species.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.