100+ datasets found

b
Eukaryotic Promoter Database
bioregistry.io
Updated Dec 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Eukaryotic Promoter Database [Dataset]. https://bioregistry.io/epd
Explore at:
Dataset updated
Dec 12, 2021
Description
The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.
n
Eukaryotic Promoter Database
neuinfo.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Eukaryotic Promoter Database [Dataset]. http://identifiers.org/RRID:SCR_002132
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002132 https://identifiers.org/RRID:SCR_002132/resolver?q=&i=rrid
Dataset updated
Jan 29, 2022
Description
Collection of eukaryotic promoters derived from published articles. Annotated non-redundant collection of eukaryotic POL II promoters, for which transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.
d
SCPD - Saccharomyces cerevisiae promoter database
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). SCPD - Saccharomyces cerevisiae promoter database [Dataset]. http://identifiers.org/RRID:SCR_004412
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004412 https://identifiers.org/RRID:SCR_004412/resolver
Dataset updated
Jan 29, 2022
Description
A promoter database of Saccharomyces cerevisiae. Users can explore the promoter regions of ~6000 genes and ORFs in yeast genome, annotate putative regulatory sites of all genes and ORFs, locate intergenic regions, and retrieve sequence of the promoter region. In regards to regulatory elements and transcription factors, users can provide information on transcriptionally related genes, browse matrix and consensus sequences, view the correlation between elements, observe binding affinity and expression, and look at genomewise distribution. SCPD also provides some simple but useful tools for promoter sequence analysis. Gene, consensus and matrix records may be submitted.
r
PlantProm DB
rrid.site
Updated Dec 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). PlantProm DB [Dataset]. http://identifiers.org/RRID:SCR_003359
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_003359
Dataset updated
Dec 22, 2019
Description
Annotated, non-redundant database of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s) (TSS) from various plant species. It contains 578 unrelated entries including 151, 396 and 31 promoters with experimentally verified TSS from monocot, dicot and other plants, respectively (April 2014). This DB presents the published promoter sequences with TSS(s) determined by direct experimental approaches and therefore serves as the most accurate source for development of computational promoter prediction tools.
ProkBERT datasets
zenodo.org
bin, bz2
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balázs Ligeti; Balázs Ligeti (2024). ProkBERT datasets [Dataset]. http://doi.org/10.5281/zenodo.10057832
Explore at:
bz2, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10057832
Dataset updated
Oct 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Balázs Ligeti; Balázs Ligeti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets for ProkBERT

This repository contains the training, validation, and testing datasets used in our research for ProkBERT, optimized for microbiome studies.

There are 4 different datasets:

ESKAPE genomic features

Bacterial promoter database

Phage training, test and evaluation datasets

ESKAPE masked sequences dataset

Description

The datasets support the development and evaluation of ProkBERT models. They include raw sequence data in compressed TSV format and tokenized datasets in compressed HDF format, using various k-mer sizes and shift values.

ESKAPE genomic features

filename: eskape_genomic_features.tsv.bz2

This dataset includes genomic segments from ESKAPE pathogens, characterized by various genomic features such as coding sequences (CDS), intergenic regions, ncRNA, and pseudogenes. It was analyzed to understand the representations captured by models like ProkBERT-mini, ProkBERT-mini-c, and ProkBERT-mini-long.

Data Fields

contig_id: Identifier of the contig.

segment_id: Unique identifier for each genomic segment.

strand: DNA strand of the segment (+ or -).

seq_start: Starting position of the segment in the contig.

seq_end: Ending position of the segment in the contig.

segment_start: Starting position of the segment in the sequence.

segment_end: Ending position of the segment in the sequence.

label: Genomic feature category (e.g., CDS, intergenic).

segment_length: Length of the genomic segment.

segment: Genomic sequence of the segment.

For a more detailed description, please visit: https://huggingface.co/datasets/neuralbioinfo/ESKAPE-genomic-features

PROMOTER dataset

filename: bacterial_promoter_db.tsv.bz2

Data collection and processing

Data source: The positive samples, known promoters, are primarily drawn from the Prokaryotic Promoter Database (PPD), containing experimentally validated promoter sequences from 75 organisms. Non-promoter sequences are obtained from the NCBI RefSeq database, sampled specifically from CDS regions.

Preprocessing: The dataset includes non-promoter sequences constructed via higher and zero-order Markov chains, which mirror compositional characteristics of known promoters. An independent test set based on E.coli sigma70 promoters is also included.

Dataset structure

Dataset splits: The dataset is systematically divided into training, validation, and test subsets.

Data fields:

segment_id: Unique identifier for each segment.

ppd_original_SpeciesName: Original species name from the PPD.

Strand: The strand of the DNA sequence.

segment: The DNA sequence of the promoter region.

label: The label indicating whether the sequence is a promoter or non-promoter.

L: Length of the DNA sequence.

prom_class: The class of the promoter.

y: Binary label indicating the presence of a promoter.

Dataset splits

Training set: Primary dataset used for model training.

Test set (Sigma70): Independent test set focusing on E.coli sigma70 promoters.

Multispecies set: Additional test set including various species, ensuring generalization across different organisms.

ESKAPE masked sequences dataset

filename: eskape_masking_dataset.tsv.bz2

Dataset description

This dataset was used to evaluate different models on the masking exercise, measuring how well the different models can recover the original character.

Dataset overview

The dataset is compiled from the RefSeq database and other sources, focusing on ESKAPE pathogens. The genomic features were sampled randomly, followed by contiguous segmentation. This dataset contains various segments with lengths: [128, 256, 512, 1024]. The segments were randomly selected, and one of the characters was replaced by '*' (masked_segment column) to create a masking task. The reference_segment contains the original, non-replaced nucleotides. We performed 10,000 maskings per set, with a maximum of 2,000 genomic features. Only the genomic features: 'CDS', 'intergenic', 'pseudogene', and 'ncRNA' were considered.

Dataset Structure

Data Fields:

reference_segment_id: A mapping of segment identifiers to their respective reference IDs in the database.

masked_segment: The DNA sequence of the segment with certain positions masked (marked with '*') for prediction or testing purposes.

position_to_mask: The specific position(s) in the sequence that have been masked, indicated by index numbers.

masked_segment_id: Unique identifiers assigned to the masked segments. (unique only with respect to length)

contig_id: Identifier of the contig to which the segment belongs.

segment_id: Unique identifier for each genomic segment (same as reference segment id).

strand: The DNA strand of the segment, indicated as '+' (positive) or '-' (negative).

seq_start: Starting position of the segment within the contig.

seq_end: Ending position of the segment within the contig.

segment_start: Starting position of the genomic segment in the sequence.

segment_end: Ending position of the genomic segment in the sequence.

label: Category label for the genomic segment (e.g., 'CDS', 'intergenic').

segment_length: The length of the genomic segment.

original_segment: The original genomic sequence without any masking.

PHAGE dataset description

We assembled a phage sequence database from RefSeq and other sources, refining it to reduce redundancy and ensure balance between phage and bacterial sequences. The final dataset targets important bacterial genera, aiding in understanding phage-host interactions and their implications for health.

Data file naming conventions

For tokenized datasets:

Pattern: RND_balanced_(test|val|train)_Ls(\d+)_k(\d+)s(\d+)\.h5\.bz2

Matches files indicating type (test or validation), segment length, k-mer size, and shift value.

For sampled raw data:

Pattern: RND_balanced_(test|val|train)_Ls(\d+)\.tsv\.bz2

Matches files indicating type (test or validation) and segment length.

Data fields

segment_id: Unique identifier for each genomic segment.

contig_id: Identifier for the contig from which the segment is derived.

segment_start: Start position of the segment in the contig.

segment_end: End position of the segment in the contig.

L: Length of the genomic segment (512, 1024, or 2048).

segment: The genomic sequence of the segment.

label: Classification label (e.g., 'phage').

y: Binary label (1 for phage, 0 for non-phage).

Usage

These datasets are for academic use. Reference our paper when using them.

Contact information

For any questions, feedback, or contributions regarding the datasets or ProkBERT, please feel free to reach out:

Name: Balázs Ligeti

Email: obalasz@gmail.com

We welcome your input and collaboration to improve our resources and research.

Citation

@Article{ProkBERT2024, author = {Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János}, journal = {Frontiers in Microbiology}, title = {{ProkBERT} family: genomic language models for microbiome applications}, year = {2024}, volume = {14}, URL = {https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233}, DOI = {10.3389/fmicb.2023.1331233} }
b
DB narG promoter (promoter) Sequence Data
biocomplete.it
text/x-fasta
Updated Oct 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). DB narG promoter (promoter) Sequence Data [Dataset]. https://biocomplete.it/sequences/39272/sequence
Explore at:
text/x-fastaAvailable download formats
Dataset updated
Oct 24, 2025
Measurement technique
DNA sequencing
Description
DNA sequence and relationships for DB narG promoter (promoter)
n
5 prime end Serial Analysis of Gene Expression Database
neuinfo.org
Updated Oct 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). 5 prime end Serial Analysis of Gene Expression Database [Dataset]. http://identifiers.org/RRID:SCR_001680
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_001680 https://identifiers.org/RRID:SCR_001680/resolver/mentions?q=&i=rrid
Dataset updated
Oct 11, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on October 30, 2012. A database that displays the observed frequencies of individual 5' end SAGE tags and previously unknown transcription start sites in the promoter regions, introns and intergenic regions of known genes. 5'SAGE will be useful for analyzing promoter regions and start site variation in different tissues, and is freely available.
f
Data from: Assessing the Effects of Data Selection and Representation on the...
datasetcatalog.nlm.nih.gov
Updated Mar 24, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abbas, Mostafa M.; EL-Manzalawy, Yasser; Mohie-Eldin, Mostafa M. (2015). Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001911563
Explore at:
Dataset updated
Mar 24, 2015
Authors
Abbas, Mostafa M.; EL-Manzalawy, Yasser; Mohie-Eldin, Mostafa M.
Description
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
n
DataBase of Tunicate Gene Regulation
neuinfo.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). DataBase of Tunicate Gene Regulation [Dataset]. http://identifiers.org/RRID:SCR_007620
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007620
Dataset updated
Jan 29, 2022
Description
DBTGR provides information on tunicate gene regulation, such as the location of expression, or the identified regulatory elements present in promoter sequences. The database also contains the promoters of homologous genes in multiple species to allow identification of conserved cis elements.
f
DoOP chordate v1.4 dataset, sequence data for 500 nt promoter regions
datasetcatalog.nlm.nih.gov
Updated Dec 2, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebestyén, Endre; Tóth, Gábor; Barta, Endre; Pálfy, Tamás B; Nagy, Tibor (2015). DoOP chordate v1.4 dataset, sequence data for 500 nt promoter regions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001928916
Explore at:
Dataset updated
Dec 2, 2015
Authors
Sebestyén, Endre; Tóth, Gábor; Barta, Endre; Pálfy, Tamás B; Nagy, Tibor
Description
500 nt promoter sequences for the DoOP database, chordate section, v1.4.
n
Full-Malaria: Malaria Full-Length cDNA Database
neuinfo.org
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Full-Malaria: Malaria Full-Length cDNA Database [Dataset]. http://identifiers.org/RRID:SCR_002348
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002348
Dataset updated
Jun 28, 2024
Description
FULL-malaria is a database for a full-length-enriched cDNA library from the human malaria parasite Plasmodium falciparum. Because of its medical importance, this organism is the first target for genome sequencing of a eukaryotic pathogen; the sequences of two of its 14 chromosomes have already been determined. However, for the full exploitation of this rapidly accumulating information, correct identification of the genes and study of their expression are essential. Using the oligo-capping method, this database has produced a full-length-enriched cDNA library from erythrocytic stage parasites and performed one-pass reading. The database consists of nucleotide sequences of 2490 random clones that include 390 (16%) known malaria genes according to BLASTN analysis of the nr-nt database in GenBank; these represent 98 genes, and the clones for 48 of these genes contain the complete protein-coding sequence (49%). On the other hand, comparisons with the complete chromosome 2 sequence revealed that 35 of 210 predicted genes are expressed, and in addition led to detection of three new gene candidates that were not previously known. In total, 19 of these 38 clones (50%) were full-length. From these observations, it is expected that the database contains approximately 1000 genes, including 500 full-length clones. It should be an invaluable resource for the development of vaccines and novel drugs. Full-malaria has been updated in at least three points. (i) 8934 sequences generated from the addition of new libraries added so that the database collection of 11,424 full-length cDNAs covers 1375 (25%) of the estimated number of the entire 5409 parasite genes. (ii) All of its full-length cDNAs and GenBank EST sequences were mapped to genomic sequences together with publicly available annotated genes and other predictions. This precisely determined the gene structures and positions of the transcriptional start sites, which are indispensable for the identification of the promoter regions. (iii) A total of 4257 cDNA sequences were newly generated from murine malaria parasites, Plasmodium yoelii yoelii. The genome/cDNA sequences were compared at both nucleotide and amino acid levels, with those of P.falciparum, and the sequence alignment for each gene is presented graphically. This part of the database serves as a versatile platform to elucidate the function(s) of malaria genes by a comparative genomic approach. It should also be noted that all of the cDNAs represented in this database are supported by physical cDNA clones, which are publicly and freely available, and should serve as indispensable resources to explore functional analyses of malaria genomes. Sponsors: This database has been constructed and maintained by a Grant-in-Aid for Publication of Scientific Research Results from the Japan Society for the Promotion of Science (JSPS). This work was also supported by a Special Coordination Funds for Promoting Science and Technology from the Science and Technology Agency of Japan (STA) and a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science, Sports and Culture of Japan.
b
consensus elements (promoter) Sequence Data
biocomplete.it
text/x-fasta
Updated Oct 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). consensus elements (promoter) Sequence Data [Dataset]. https://biocomplete.it/sequences/77062/sequence
Explore at:
text/x-fastaAvailable download formats
Dataset updated
Oct 22, 2025
Measurement technique
DNA sequencing
Description
DNA sequence and relationships for consensus elements (promoter)
f
Data from: Classification of Promoters Based on the Combination of Core...
datasetcatalog.nlm.nih.gov
Updated Mar 30, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mamitsuka, Hiroshi; Natsume-Kitatani, Yayoi (2016). Classification of Promoters Based on the Combination of Core Promoter Elements Exhibits Different Histone Modification Patterns [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001547836
Explore at:
Dataset updated
Mar 30, 2016
Authors
Mamitsuka, Hiroshi; Natsume-Kitatani, Yayoi
Description
Four different histones (H2A, H2B, H3, and H4; two subunits each) constitute a histone octamer, around which DNA wraps to form histone-DNA complexes called nucleosomes. Amino acid residues in each histone are occasionally modified, resulting in several biological effects, including differential regulation of transcription. Core promoters that encompass the transcription start site have well-conserved DNA motifs, including the initiator (Inr), TATA box, and DPE, which are collectively called the core promoter elements (CPEs). In this study, we systematically studied the associations between the CPEs and histone modifications by integrating the Drosophila Core Promoter Database and time-series ChIP-seq data for histone modifications (H3K4me3, H3K27ac, and H3K27me3) during development in Drosophila melanogaster via the modENCODE project. We classified 96 core promoters into four groups based on the presence or absence of the TATA box or DPE, calculated the histone modification ratio at the core promoter region, and transcribed region for each core promoter. We found that the histone modifications in TATA-less groups were static during development and that the core promoters could be clearly divided into three types: i) core promoters with continuous active marks (H3K4me3 and H3K27ac), ii) core promoters with a continuous inactive mark (H3K27me3) and occasional active marks, and iii) core promoters with occasional histone modifications. Linear regression analysis and non-linear regression by random forest showed that the TATA-containing groups included core promoters without histone modifications, for which the measured RNA expression values were not predictable accurately from the histone modification status. DPE-containing groups had a higher relative frequency of H3K27me3 in both the core promoter region and transcribed region. In summary, our analysis showed that there was a systematic link between the existence of the CPEs and the dynamics, frequency and influence on transcriptional activity of histone modifications.
d
Data from: ABS: A Database of Annotated Regulatory Binding Sites From...
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ABS: A Database of Annotated Regulatory Binding Sites From Orthologous Promoters [Dataset]. http://identifiers.org/RRID:SCR_002276
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002276
Dataset updated
Jan 29, 2022
Description
Public database of known binding sites identified in promoters of orthologous vertebrate genes that have been manually curated from bibliography. We have annotated 650 experimental binding sites from 68 transcription factors and 100 orthologous target genes in human, mouse, rat or chicken genome sequences. Computational predictions and promoter alignment information are also provided for each entry. For each gene, TFBSs conserved in orthologous sequences from at least two different species must be available. Promoter sequences as well as the original GenBank or RefSeq entries are additionally supplied in case of future identification conflicts. The final TSS annotation has been refined using the database dbTSS. Up to this release, 500 bps upstream the annotated transcription start site (TSS) according to REFSEQ annotations have been always extracted to form the collection of promoter sequences from human, mouse, rat and chicken. For each regulatory site, the position, the motif and the sequence in which the site is present are available in a simple format. Cross-references to EntrezGene, PubMed and RefSeq are also provided for each annotation. Apart from the experimental promoter annotations, predictions by popular collections of weight matrices are also provided for each promoter sequence. In addition, global and local alignments and graphical dotplots are also available.
DoOP chordate v1.4 dataset, sequence data for 1000 nt promoter regions
figshare.com
application/gzip
Updated Jan 20, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Endre Sebestyén; Tamás B Pálfy; Tibor Nagy; Gábor Tóth; Endre Barta (2016). DoOP chordate v1.4 dataset, sequence data for 1000 nt promoter regions [Dataset]. http://doi.org/10.6084/m9.figshare.1615052.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1615052.v1
Dataset updated
Jan 20, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Endre Sebestyén; Tamás B Pálfy; Tibor Nagy; Gábor Tóth; Endre Barta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
1000 nt promoter sequences for the DoOP database, chordate section, v1.4.
d
Full-Malaria: Malaria Full-Length cDNA Database
dknet.org
Updated Jul 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Full-Malaria: Malaria Full-Length cDNA Database [Dataset]. http://identifiers.org/RRID:SCR_002348
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002348
Dataset updated
Jul 3, 2024
Description
FULL-malaria is a database for a full-length-enriched cDNA library from the human malaria parasite Plasmodium falciparum. Because of its medical importance, this organism is the first target for genome sequencing of a eukaryotic pathogen; the sequences of two of its 14 chromosomes have already been determined. However, for the full exploitation of this rapidly accumulating information, correct identification of the genes and study of their expression are essential. Using the oligo-capping method, this database has produced a full-length-enriched cDNA library from erythrocytic stage parasites and performed one-pass reading. The database consists of nucleotide sequences of 2490 random clones that include 390 (16%) known malaria genes according to BLASTN analysis of the nr-nt database in GenBank; these represent 98 genes, and the clones for 48 of these genes contain the complete protein-coding sequence (49%). On the other hand, comparisons with the complete chromosome 2 sequence revealed that 35 of 210 predicted genes are expressed, and in addition led to detection of three new gene candidates that were not previously known. In total, 19 of these 38 clones (50%) were full-length. From these observations, it is expected that the database contains approximately 1000 genes, including 500 full-length clones. It should be an invaluable resource for the development of vaccines and novel drugs. Full-malaria has been updated in at least three points. (i) 8934 sequences generated from the addition of new libraries added so that the database collection of 11,424 full-length cDNAs covers 1375 (25%) of the estimated number of the entire 5409 parasite genes. (ii) All of its full-length cDNAs and GenBank EST sequences were mapped to genomic sequences together with publicly available annotated genes and other predictions. This precisely determined the gene structures and positions of the transcriptional start sites, which are indispensable for the identification of the promoter regions. (iii) A total of 4257 cDNA sequences were newly generated from murine malaria parasites, Plasmodium yoelii yoelii. The genome/cDNA sequences were compared at both nucleotide and amino acid levels, with those of P.falciparum, and the sequence alignment for each gene is presented graphically. This part of the database serves as a versatile platform to elucidate the function(s) of malaria genes by a comparative genomic approach. It should also be noted that all of the cDNAs represented in this database are supported by physical cDNA clones, which are publicly and freely available, and should serve as indispensable resources to explore functional analyses of malaria genomes. Sponsors: This database has been constructed and maintained by a Grant-in-Aid for Publication of Scientific Research Results from the Japan Society for the Promotion of Science (JSPS). This work was also supported by a Special Coordination Funds for Promoting Science and Technology from the Science and Technology Agency of Japan (STA) and a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science, Sports and Culture of Japan.
The relative importance of each histone modification in determining RNA...
figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yayoi Natsume-Kitatani; Hiroshi Mamitsuka (2023). The relative importance of each histone modification in determining RNA expression levels was calculated using the regression equations obtained by linear regression analysis using the LMG method [12]. [Dataset]. http://doi.org/10.1371/journal.pone.0151917.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0151917.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yayoi Natsume-Kitatani; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Values were normalized such that the sum of all values was 1.
f
A Computational Framework for Identifying Promoter Sequences in Nonmodel...
datasetcatalog.nlm.nih.gov
Updated May 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarfatis, M. Claire; Wilson, Erin H.; Beck, David A. C.; Groom, Joseph D.; Lidstrom, Mary E.; Ford, Stephanie M. (2021). A Computational Framework for Identifying Promoter Sequences in Nonmodel Organisms Using RNA-seq Data Sets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000919400
Explore at:
Dataset updated
May 14, 2021
Authors
Sarfatis, M. Claire; Wilson, Erin H.; Beck, David A. C.; Groom, Joseph D.; Lidstrom, Mary E.; Ford, Stephanie M.
Description
Engineering microorganisms into biological factories that convert renewable feedstocks into valuable materials is a major goal of synthetic biology; however, for many nonmodel organisms, we do not yet have the genetic tools, such as suites of strong promoters, necessary to effectively engineer them. In this work, we developed a computational framework that can leverage standard RNA-seq data sets to identify sets of constitutive, strongly expressed genes and predict strong promoter signals within their upstream regions. The framework was applied to a diverse collection of RNA-seq data measured for the methanotroph Methylotuvimicrobium buryatense 5GB1 and identified 25 genes that were constitutively, strongly expressed across 12 experimental conditions. For each gene, the framework predicted short (27–30 nucleotide) sequences as candidate promoters and derived −35 and −10 consensus promoter motifs (TTGACA and TATAAT, respectively) for strong expression in M. buryatense. This consensus closely matches the canonical E. coli sigma-70 motif and was found to be enriched in promoter regions of the genome. A subset of promoter predictions was experimentally validated in a XylE reporter assay, including the consensus promoter, which showed high expression. The pmoC, pqqA, and ssrA promoter predictions were additionally screened in an experiment that scrambled the −35 and −10 signal sequences, confirming that transcription initiation was disrupted when these specific regions of the predicted sequence were altered. These results indicate that the computational framework can make biologically meaningful promoter predictions and identify key pieces of regulatory systems that can serve as foundational tools for engineering diverse microorganisms for biomolecule production.
s
RegulomeDB
scicrunch.org
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). RegulomeDB [Dataset]. http://identifiers.org/RRID:SCR_017905
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_017905
Dataset updated
Mar 8, 2023
Description
Database that annotates SNPs with known and predicted regulatory elements in intergenic regions of H. sapiens genome. Known and predicted regulatory DNA elements include regions of DNAase hypersensitivity, binding sites of transcription factors, and promoter regions that have been biochemically characterized to regulation transcription. Source of these data include public datasets from GEO, ENCODE project, and published literature.
r
cisRED: cis-regulatory element
rrid.site
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cisRED: cis-regulatory element [Dataset]. http://identifiers.org/RRID:SCR_002098
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002098
Description
Database for conserved sequence motifs identified by genome scale motif discovery, similarity, clustering, co-occurrence and coexpression calculations. Sequence inputs include low-coverage genome sequence data and ENCODE data. The database offers information on atomic motifs, motif groups and patterns. In promoter-based cisRED databases, sequence search regions for motif discovery extend from 1.5 Kb upstream to 200b downstream of a transcription start site, net of most types of repeats and of coding exons. Many transcription factor binding sites are located in such regions. For each target gene's search region, a base set of probabilistic ab initio discovery tools is used, in parallel, to find over-represented atomic motifs. Discovery methods use comparative genomics with over 40 vertebrate input genomes. In ChIP-seq-based cisRED databases, sequence search regions for motif discovery correspond to significant peaks that represent genome-wide sites of protein-DNA binding. Because such peaks occur in a wide range of genic and intergenic locations, ChIP-seq and promoter-based databases are complementary. Currently, motif discovery for ChIP-seq data uses scan-based approaches that make more explicit use of sets of sequences known to be functional transcription factor binding sites, and that consider a wide range of levels of conservation. For the human STAT1 ChIP-seq database search regions in the target species (human) was selected +/- 300 bp around the ChIP-seq peak maximum. Repeats and coding regions were masked. Multiple sequence alignment were used to assemble orthologous input sequences from other species.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2021). Eukaryotic Promoter Database [Dataset]. https://bioregistry.io/epd

Eukaryotic Promoter Database

Explore at:

Dataset updated

Dec 12, 2021

Description

The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.

Clear search

Close search

Google apps

Main menu

Eukaryotic Promoter Database

Eukaryotic Promoter Database

SCPD - Saccharomyces cerevisiae promoter database

PlantProm DB

ProkBERT datasets

Datasets for ProkBERT

Description

ESKAPE genomic features

Data Fields

PROMOTER dataset

Data collection and processing

Dataset structure

Dataset splits

ESKAPE masked sequences dataset

Dataset description

Dataset overview

Dataset Structure

PHAGE dataset description

Data file naming conventions

Data fields

Usage

Contact information

Citation

DB narG promoter (promoter) Sequence Data

5 prime end Serial Analysis of Gene Expression Database

Data from: Assessing the Effects of Data Selection and Representation on the...

DataBase of Tunicate Gene Regulation

DoOP chordate v1.4 dataset, sequence data for 500 nt promoter regions

Full-Malaria: Malaria Full-Length cDNA Database

consensus elements (promoter) Sequence Data

Data from: Classification of Promoters Based on the Combination of Core...

Data from: ABS: A Database of Annotated Regulatory Binding Sites From...

DoOP chordate v1.4 dataset, sequence data for 1000 nt promoter regions

Full-Malaria: Malaria Full-Length cDNA Database

The relative importance of each histone modification in determining RNA...

A Computational Framework for Identifying Promoter Sequences in Nonmodel...

RegulomeDB

cisRED: cis-regulatory element

Eukaryotic Promoter DatabaseSee More Versions

Eukaryotic Promoter Database