100+ datasets found
  1. b

    Eukaryotic Promoter Database

    • bioregistry.io
    Updated Dec 12, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Eukaryotic Promoter Database [Dataset]. https://bioregistry.io/epd
    Explore at:
    Dataset updated
    Dec 12, 2021
    Description

    The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.

  2. n

    Eukaryotic Promoter Database

    • neuinfo.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Eukaryotic Promoter Database [Dataset]. http://identifiers.org/RRID:SCR_002132
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Collection of eukaryotic promoters derived from published articles. Annotated non-redundant collection of eukaryotic POL II promoters, for which transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.

  3. d

    SCPD - Saccharomyces cerevisiae promoter database

    • dknet.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). SCPD - Saccharomyces cerevisiae promoter database [Dataset]. http://identifiers.org/RRID:SCR_004412
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    A promoter database of Saccharomyces cerevisiae. Users can explore the promoter regions of ~6000 genes and ORFs in yeast genome, annotate putative regulatory sites of all genes and ORFs, locate intergenic regions, and retrieve sequence of the promoter region. In regards to regulatory elements and transcription factors, users can provide information on transcriptionally related genes, browse matrix and consensus sequences, view the correlation between elements, observe binding affinity and expression, and look at genomewise distribution. SCPD also provides some simple but useful tools for promoter sequence analysis. Gene, consensus and matrix records may be submitted.

  4. r

    PlantProm DB

    • rrid.site
    Updated Dec 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). PlantProm DB [Dataset]. http://identifiers.org/RRID:SCR_003359
    Explore at:
    Dataset updated
    Dec 22, 2019
    Description

    Annotated, non-redundant database of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s) (TSS) from various plant species. It contains 578 unrelated entries including 151, 396 and 31 promoters with experimentally verified TSS from monocot, dicot and other plants, respectively (April 2014). This DB presents the published promoter sequences with TSS(s) determined by direct experimental approaches and therefore serves as the most accurate source for development of computational promoter prediction tools.

  5. ProkBERT datasets

    • zenodo.org
    bin, bz2
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balázs Ligeti; Balázs Ligeti (2024). ProkBERT datasets [Dataset]. http://doi.org/10.5281/zenodo.10057832
    Explore at:
    bz2, binAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Balázs Ligeti; Balázs Ligeti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets for ProkBERT

    This repository contains the training, validation, and testing datasets used in our research for ProkBERT, optimized for microbiome studies.

    There are 4 different datasets:

    1. ESKAPE genomic features
    2. Bacterial promoter database
    3. Phage training, test and evaluation datasets
    4. ESKAPE masked sequences dataset

    Description

    The datasets support the development and evaluation of ProkBERT models. They include raw sequence data in compressed TSV format and tokenized datasets in compressed HDF format, using various k-mer sizes and shift values.

    ESKAPE genomic features

    filename: eskape_genomic_features.tsv.bz2

    This dataset includes genomic segments from ESKAPE pathogens, characterized by various genomic features such as coding sequences (CDS), intergenic regions, ncRNA, and pseudogenes. It was analyzed to understand the representations captured by models like ProkBERT-mini, ProkBERT-mini-c, and ProkBERT-mini-long.

    Data Fields

    • contig_id: Identifier of the contig.
    • segment_id: Unique identifier for each genomic segment.
    • strand: DNA strand of the segment (+ or -).
    • seq_start: Starting position of the segment in the contig.
    • seq_end: Ending position of the segment in the contig.
    • segment_start: Starting position of the segment in the sequence.
    • segment_end: Ending position of the segment in the sequence.
    • label: Genomic feature category (e.g., CDS, intergenic).
    • segment_length: Length of the genomic segment.
    • segment: Genomic sequence of the segment.

    For a more detailed description, please visit: https://huggingface.co/datasets/neuralbioinfo/ESKAPE-genomic-features

    PROMOTER dataset

    filename: bacterial_promoter_db.tsv.bz2

    Data collection and processing

    • Data source: The positive samples, known promoters, are primarily drawn from the Prokaryotic Promoter Database (PPD), containing experimentally validated promoter sequences from 75 organisms. Non-promoter sequences are obtained from the NCBI RefSeq database, sampled specifically from CDS regions.
    • Preprocessing: The dataset includes non-promoter sequences constructed via higher and zero-order Markov chains, which mirror compositional characteristics of known promoters. An independent test set based on E.coli sigma70 promoters is also included.

    Dataset structure

    • Dataset splits: The dataset is systematically divided into training, validation, and test subsets.
    • Data fields:
      • segment_id: Unique identifier for each segment.
      • ppd_original_SpeciesName: Original species name from the PPD.
      • Strand: The strand of the DNA sequence.
      • segment: The DNA sequence of the promoter region.
      • label: The label indicating whether the sequence is a promoter or non-promoter.
      • L: Length of the DNA sequence.
      • prom_class: The class of the promoter.
      • y: Binary label indicating the presence of a promoter.

    Dataset splits

    • Training set: Primary dataset used for model training.
    • Test set (Sigma70): Independent test set focusing on E.coli sigma70 promoters.
    • Multispecies set: Additional test set including various species, ensuring generalization across different organisms.

    ESKAPE masked sequences dataset

    filename: eskape_masking_dataset.tsv.bz2

    Dataset description

    This dataset was used to evaluate different models on the masking exercise, measuring how well the different models can recover the original character.

    Dataset overview

    The dataset is compiled from the RefSeq database and other sources, focusing on ESKAPE pathogens. The genomic features were sampled randomly, followed by contiguous segmentation. This dataset contains various segments with lengths: [128, 256, 512, 1024]. The segments were randomly selected, and one of the characters was replaced by '*' (masked_segment column) to create a masking task. The reference_segment contains the original, non-replaced nucleotides. We performed 10,000 maskings per set, with a maximum of 2,000 genomic features. Only the genomic features: 'CDS', 'intergenic', 'pseudogene', and 'ncRNA' were considered.

    Dataset Structure

    • Data Fields:
    • reference_segment_id: A mapping of segment identifiers to their respective reference IDs in the database.
    • masked_segment: The DNA sequence of the segment with certain positions masked (marked with '*') for prediction or testing purposes.
    • position_to_mask: The specific position(s) in the sequence that have been masked, indicated by index numbers.
    • masked_segment_id: Unique identifiers assigned to the masked segments. (unique only with respect to length)
    • contig_id: Identifier of the contig to which the segment belongs.
    • segment_id: Unique identifier for each genomic segment (same as reference segment id).
    • strand: The DNA strand of the segment, indicated as '+' (positive) or '-' (negative).
    • seq_start: Starting position of the segment within the contig.
    • seq_end: Ending position of the segment within the contig.
    • segment_start: Starting position of the genomic segment in the sequence.
    • segment_end: Ending position of the genomic segment in the sequence.
    • label: Category label for the genomic segment (e.g., 'CDS', 'intergenic').
    • segment_length: The length of the genomic segment.
    • original_segment: The original genomic sequence without any masking.

    PHAGE dataset description

    We assembled a phage sequence database from RefSeq and other sources, refining it to reduce redundancy and ensure balance between phage and bacterial sequences. The final dataset targets important bacterial genera, aiding in understanding phage-host interactions and their implications for health.

    Data file naming conventions

    For tokenized datasets:

    • Pattern: RND_balanced_(test|val|train)_Ls(\d+)_k(\d+)s(\d+)\.h5\.bz2
    • Matches files indicating type (test or validation), segment length, k-mer size, and shift value.

    For sampled raw data:

    • Pattern: RND_balanced_(test|val|train)_Ls(\d+)\.tsv\.bz2
    • Matches files indicating type (test or validation) and segment length.

    Data fields

    • segment_id: Unique identifier for each genomic segment.
    • contig_id: Identifier for the contig from which the segment is derived.
    • segment_start: Start position of the segment in the contig.
    • segment_end: End position of the segment in the contig.
    • L: Length of the genomic segment (512, 1024, or 2048).
    • segment: The genomic sequence of the segment.
    • label: Classification label (e.g., 'phage').
    • y: Binary label (1 for phage, 0 for non-phage).

    Usage

    These datasets are for academic use. Reference our paper when using them.

    Contact information

    For any questions, feedback, or contributions regarding the datasets or ProkBERT, please feel free to reach out:

    We welcome your input and collaboration to improve our resources and research.

    Citation

    @Article{ProkBERT2024,
     author = {Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
     journal = {Frontiers in Microbiology},
     title  = {{ProkBERT} family: genomic language models for microbiome applications},
     year  = {2024},
     volume = {14},
     URL   = {https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
     DOI   = {10.3389/fmicb.2023.1331233}
    }
  6. b

    DB narG promoter (promoter) Sequence Data

    • biocomplete.it
    text/x-fasta
    Updated Oct 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). DB narG promoter (promoter) Sequence Data [Dataset]. https://biocomplete.it/sequences/39272/sequence
    Explore at:
    text/x-fastaAvailable download formats
    Dataset updated
    Oct 24, 2025
    Measurement technique
    DNA sequencing
    Description

    DNA sequence and relationships for DB narG promoter (promoter)

  7. n

    5 prime end Serial Analysis of Gene Expression Database

    • neuinfo.org
    Updated Oct 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). 5 prime end Serial Analysis of Gene Expression Database [Dataset]. http://identifiers.org/RRID:SCR_001680
    Explore at:
    Dataset updated
    Oct 11, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on October 30, 2012. A database that displays the observed frequencies of individual 5' end SAGE tags and previously unknown transcription start sites in the promoter regions, introns and intergenic regions of known genes. 5'SAGE will be useful for analyzing promoter regions and start site variation in different tissues, and is freely available.

  8. f

    Data from: Assessing the Effects of Data Selection and Representation on the...

    • datasetcatalog.nlm.nih.gov
    Updated Mar 24, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abbas, Mostafa M.; EL-Manzalawy, Yasser; Mohie-Eldin, Mostafa M. (2015). Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001911563
    Explore at:
    Dataset updated
    Mar 24, 2015
    Authors
    Abbas, Mostafa M.; EL-Manzalawy, Yasser; Mohie-Eldin, Mostafa M.
    Description

    As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.

  9. n

    DataBase of Tunicate Gene Regulation

    • neuinfo.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). DataBase of Tunicate Gene Regulation [Dataset]. http://identifiers.org/RRID:SCR_007620
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    DBTGR provides information on tunicate gene regulation, such as the location of expression, or the identified regulatory elements present in promoter sequences. The database also contains the promoters of homologous genes in multiple species to allow identification of conserved cis elements.

  10. f

    DoOP chordate v1.4 dataset, sequence data for 500 nt promoter regions

    • datasetcatalog.nlm.nih.gov
    Updated Dec 2, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebestyén, Endre; Tóth, Gábor; Barta, Endre; Pálfy, Tamás B; Nagy, Tibor (2015). DoOP chordate v1.4 dataset, sequence data for 500 nt promoter regions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001928916
    Explore at:
    Dataset updated
    Dec 2, 2015
    Authors
    Sebestyén, Endre; Tóth, Gábor; Barta, Endre; Pálfy, Tamás B; Nagy, Tibor
    Description

    500 nt promoter sequences for the DoOP database, chordate section, v1.4.

  11. n

    Full-Malaria: Malaria Full-Length cDNA Database

    • neuinfo.org
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Full-Malaria: Malaria Full-Length cDNA Database [Dataset]. http://identifiers.org/RRID:SCR_002348
    Explore at:
    Dataset updated
    Jun 28, 2024
    Description

    FULL-malaria is a database for a full-length-enriched cDNA library from the human malaria parasite Plasmodium falciparum. Because of its medical importance, this organism is the first target for genome sequencing of a eukaryotic pathogen; the sequences of two of its 14 chromosomes have already been determined. However, for the full exploitation of this rapidly accumulating information, correct identification of the genes and study of their expression are essential. Using the oligo-capping method, this database has produced a full-length-enriched cDNA library from erythrocytic stage parasites and performed one-pass reading. The database consists of nucleotide sequences of 2490 random clones that include 390 (16%) known malaria genes according to BLASTN analysis of the nr-nt database in GenBank; these represent 98 genes, and the clones for 48 of these genes contain the complete protein-coding sequence (49%). On the other hand, comparisons with the complete chromosome 2 sequence revealed that 35 of 210 predicted genes are expressed, and in addition led to detection of three new gene candidates that were not previously known. In total, 19 of these 38 clones (50%) were full-length. From these observations, it is expected that the database contains approximately 1000 genes, including 500 full-length clones. It should be an invaluable resource for the development of vaccines and novel drugs. Full-malaria has been updated in at least three points. (i) 8934 sequences generated from the addition of new libraries added so that the database collection of 11,424 full-length cDNAs covers 1375 (25%) of the estimated number of the entire 5409 parasite genes. (ii) All of its full-length cDNAs and GenBank EST sequences were mapped to genomic sequences together with publicly available annotated genes and other predictions. This precisely determined the gene structures and positions of the transcriptional start sites, which are indispensable for the identification of the promoter regions. (iii) A total of 4257 cDNA sequences were newly generated from murine malaria parasites, Plasmodium yoelii yoelii. The genome/cDNA sequences were compared at both nucleotide and amino acid levels, with those of P.falciparum, and the sequence alignment for each gene is presented graphically. This part of the database serves as a versatile platform to elucidate the function(s) of malaria genes by a comparative genomic approach. It should also be noted that all of the cDNAs represented in this database are supported by physical cDNA clones, which are publicly and freely available, and should serve as indispensable resources to explore functional analyses of malaria genomes. Sponsors: This database has been constructed and maintained by a Grant-in-Aid for Publication of Scientific Research Results from the Japan Society for the Promotion of Science (JSPS). This work was also supported by a Special Coordination Funds for Promoting Science and Technology from the Science and Technology Agency of Japan (STA) and a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science, Sports and Culture of Japan.

  12. b

    consensus elements (promoter) Sequence Data

    • biocomplete.it
    text/x-fasta
    Updated Oct 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). consensus elements (promoter) Sequence Data [Dataset]. https://biocomplete.it/sequences/77062/sequence
    Explore at:
    text/x-fastaAvailable download formats
    Dataset updated
    Oct 22, 2025
    Measurement technique
    DNA sequencing
    Description

    DNA sequence and relationships for consensus elements (promoter)

  13. f

    Data from: Classification of Promoters Based on the Combination of Core...

    • datasetcatalog.nlm.nih.gov
    Updated Mar 30, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mamitsuka, Hiroshi; Natsume-Kitatani, Yayoi (2016). Classification of Promoters Based on the Combination of Core Promoter Elements Exhibits Different Histone Modification Patterns [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001547836
    Explore at:
    Dataset updated
    Mar 30, 2016
    Authors
    Mamitsuka, Hiroshi; Natsume-Kitatani, Yayoi
    Description

    Four different histones (H2A, H2B, H3, and H4; two subunits each) constitute a histone octamer, around which DNA wraps to form histone-DNA complexes called nucleosomes. Amino acid residues in each histone are occasionally modified, resulting in several biological effects, including differential regulation of transcription. Core promoters that encompass the transcription start site have well-conserved DNA motifs, including the initiator (Inr), TATA box, and DPE, which are collectively called the core promoter elements (CPEs). In this study, we systematically studied the associations between the CPEs and histone modifications by integrating the Drosophila Core Promoter Database and time-series ChIP-seq data for histone modifications (H3K4me3, H3K27ac, and H3K27me3) during development in Drosophila melanogaster via the modENCODE project. We classified 96 core promoters into four groups based on the presence or absence of the TATA box or DPE, calculated the histone modification ratio at the core promoter region, and transcribed region for each core promoter. We found that the histone modifications in TATA-less groups were static during development and that the core promoters could be clearly divided into three types: i) core promoters with continuous active marks (H3K4me3 and H3K27ac), ii) core promoters with a continuous inactive mark (H3K27me3) and occasional active marks, and iii) core promoters with occasional histone modifications. Linear regression analysis and non-linear regression by random forest showed that the TATA-containing groups included core promoters without histone modifications, for which the measured RNA expression values were not predictable accurately from the histone modification status. DPE-containing groups had a higher relative frequency of H3K27me3 in both the core promoter region and transcribed region. In summary, our analysis showed that there was a systematic link between the existence of the CPEs and the dynamics, frequency and influence on transcriptional activity of histone modifications.

  14. d

    Data from: ABS: A Database of Annotated Regulatory Binding Sites From...

    • dknet.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ABS: A Database of Annotated Regulatory Binding Sites From Orthologous Promoters [Dataset]. http://identifiers.org/RRID:SCR_002276
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Public database of known binding sites identified in promoters of orthologous vertebrate genes that have been manually curated from bibliography. We have annotated 650 experimental binding sites from 68 transcription factors and 100 orthologous target genes in human, mouse, rat or chicken genome sequences. Computational predictions and promoter alignment information are also provided for each entry. For each gene, TFBSs conserved in orthologous sequences from at least two different species must be available. Promoter sequences as well as the original GenBank or RefSeq entries are additionally supplied in case of future identification conflicts. The final TSS annotation has been refined using the database dbTSS. Up to this release, 500 bps upstream the annotated transcription start site (TSS) according to REFSEQ annotations have been always extracted to form the collection of promoter sequences from human, mouse, rat and chicken. For each regulatory site, the position, the motif and the sequence in which the site is present are available in a simple format. Cross-references to EntrezGene, PubMed and RefSeq are also provided for each annotation. Apart from the experimental promoter annotations, predictions by popular collections of weight matrices are also provided for each promoter sequence. In addition, global and local alignments and graphical dotplots are also available.

  15. DoOP chordate v1.4 dataset, sequence data for 1000 nt promoter regions

    • figshare.com
    application/gzip
    Updated Jan 20, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Endre Sebestyén; Tamás B Pálfy; Tibor Nagy; Gábor Tóth; Endre Barta (2016). DoOP chordate v1.4 dataset, sequence data for 1000 nt promoter regions [Dataset]. http://doi.org/10.6084/m9.figshare.1615052.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 20, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Endre Sebestyén; Tamás B Pálfy; Tibor Nagy; Gábor Tóth; Endre Barta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1000 nt promoter sequences for the DoOP database, chordate section, v1.4.

  16. d

    Full-Malaria: Malaria Full-Length cDNA Database

    • dknet.org
    Updated Jul 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Full-Malaria: Malaria Full-Length cDNA Database [Dataset]. http://identifiers.org/RRID:SCR_002348
    Explore at:
    Dataset updated
    Jul 3, 2024
    Description

    FULL-malaria is a database for a full-length-enriched cDNA library from the human malaria parasite Plasmodium falciparum. Because of its medical importance, this organism is the first target for genome sequencing of a eukaryotic pathogen; the sequences of two of its 14 chromosomes have already been determined. However, for the full exploitation of this rapidly accumulating information, correct identification of the genes and study of their expression are essential. Using the oligo-capping method, this database has produced a full-length-enriched cDNA library from erythrocytic stage parasites and performed one-pass reading. The database consists of nucleotide sequences of 2490 random clones that include 390 (16%) known malaria genes according to BLASTN analysis of the nr-nt database in GenBank; these represent 98 genes, and the clones for 48 of these genes contain the complete protein-coding sequence (49%). On the other hand, comparisons with the complete chromosome 2 sequence revealed that 35 of 210 predicted genes are expressed, and in addition led to detection of three new gene candidates that were not previously known. In total, 19 of these 38 clones (50%) were full-length. From these observations, it is expected that the database contains approximately 1000 genes, including 500 full-length clones. It should be an invaluable resource for the development of vaccines and novel drugs. Full-malaria has been updated in at least three points. (i) 8934 sequences generated from the addition of new libraries added so that the database collection of 11,424 full-length cDNAs covers 1375 (25%) of the estimated number of the entire 5409 parasite genes. (ii) All of its full-length cDNAs and GenBank EST sequences were mapped to genomic sequences together with publicly available annotated genes and other predictions. This precisely determined the gene structures and positions of the transcriptional start sites, which are indispensable for the identification of the promoter regions. (iii) A total of 4257 cDNA sequences were newly generated from murine malaria parasites, Plasmodium yoelii yoelii. The genome/cDNA sequences were compared at both nucleotide and amino acid levels, with those of P.falciparum, and the sequence alignment for each gene is presented graphically. This part of the database serves as a versatile platform to elucidate the function(s) of malaria genes by a comparative genomic approach. It should also be noted that all of the cDNAs represented in this database are supported by physical cDNA clones, which are publicly and freely available, and should serve as indispensable resources to explore functional analyses of malaria genomes. Sponsors: This database has been constructed and maintained by a Grant-in-Aid for Publication of Scientific Research Results from the Japan Society for the Promotion of Science (JSPS). This work was also supported by a Special Coordination Funds for Promoting Science and Technology from the Science and Technology Agency of Japan (STA) and a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science, Sports and Culture of Japan.

  17. The relative importance of each histone modification in determining RNA...

    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yayoi Natsume-Kitatani; Hiroshi Mamitsuka (2023). The relative importance of each histone modification in determining RNA expression levels was calculated using the regression equations obtained by linear regression analysis using the LMG method [12]. [Dataset]. http://doi.org/10.1371/journal.pone.0151917.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yayoi Natsume-Kitatani; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Values were normalized such that the sum of all values was 1.

  18. f

    A Computational Framework for Identifying Promoter Sequences in Nonmodel...

    • datasetcatalog.nlm.nih.gov
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarfatis, M. Claire; Wilson, Erin H.; Beck, David A. C.; Groom, Joseph D.; Lidstrom, Mary E.; Ford, Stephanie M. (2021). A Computational Framework for Identifying Promoter Sequences in Nonmodel Organisms Using RNA-seq Data Sets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000919400
    Explore at:
    Dataset updated
    May 14, 2021
    Authors
    Sarfatis, M. Claire; Wilson, Erin H.; Beck, David A. C.; Groom, Joseph D.; Lidstrom, Mary E.; Ford, Stephanie M.
    Description

    Engineering microorganisms into biological factories that convert renewable feedstocks into valuable materials is a major goal of synthetic biology; however, for many nonmodel organisms, we do not yet have the genetic tools, such as suites of strong promoters, necessary to effectively engineer them. In this work, we developed a computational framework that can leverage standard RNA-seq data sets to identify sets of constitutive, strongly expressed genes and predict strong promoter signals within their upstream regions. The framework was applied to a diverse collection of RNA-seq data measured for the methanotroph Methylotuvimicrobium buryatense 5GB1 and identified 25 genes that were constitutively, strongly expressed across 12 experimental conditions. For each gene, the framework predicted short (27–30 nucleotide) sequences as candidate promoters and derived −35 and −10 consensus promoter motifs (TTGACA and TATAAT, respectively) for strong expression in M. buryatense. This consensus closely matches the canonical E. coli sigma-70 motif and was found to be enriched in promoter regions of the genome. A subset of promoter predictions was experimentally validated in a XylE reporter assay, including the consensus promoter, which showed high expression. The pmoC, pqqA, and ssrA promoter predictions were additionally screened in an experiment that scrambled the −35 and −10 signal sequences, confirming that transcription initiation was disrupted when these specific regions of the predicted sequence were altered. These results indicate that the computational framework can make biologically meaningful promoter predictions and identify key pieces of regulatory systems that can serve as foundational tools for engineering diverse microorganisms for biomolecule production.

  19. s

    RegulomeDB

    • scicrunch.org
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). RegulomeDB [Dataset]. http://identifiers.org/RRID:SCR_017905
    Explore at:
    Dataset updated
    Mar 8, 2023
    Description

    Database that annotates SNPs with known and predicted regulatory elements in intergenic regions of H. sapiens genome. Known and predicted regulatory DNA elements include regions of DNAase hypersensitivity, binding sites of transcription factors, and promoter regions that have been biochemically characterized to regulation transcription. Source of these data include public datasets from GEO, ENCODE project, and published literature.

  20. r

    cisRED: cis-regulatory element

    • rrid.site
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cisRED: cis-regulatory element [Dataset]. http://identifiers.org/RRID:SCR_002098
    Explore at:
    Description

    Database for conserved sequence motifs identified by genome scale motif discovery, similarity, clustering, co-occurrence and coexpression calculations. Sequence inputs include low-coverage genome sequence data and ENCODE data. The database offers information on atomic motifs, motif groups and patterns. In promoter-based cisRED databases, sequence search regions for motif discovery extend from 1.5 Kb upstream to 200b downstream of a transcription start site, net of most types of repeats and of coding exons. Many transcription factor binding sites are located in such regions. For each target gene's search region, a base set of probabilistic ab initio discovery tools is used, in parallel, to find over-represented atomic motifs. Discovery methods use comparative genomics with over 40 vertebrate input genomes. In ChIP-seq-based cisRED databases, sequence search regions for motif discovery correspond to significant peaks that represent genome-wide sites of protein-DNA binding. Because such peaks occur in a wide range of genic and intergenic locations, ChIP-seq and promoter-based databases are complementary. Currently, motif discovery for ChIP-seq data uses scan-based approaches that make more explicit use of sets of sequences known to be functional transcription factor binding sites, and that consider a wide range of levels of conservation. For the human STAT1 ChIP-seq database search regions in the target species (human) was selected +/- 300 bp around the ChIP-seq peak maximum. Repeats and coding regions were masked. Multiple sequence alignment were used to assemble orthologous input sequences from other species.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2021). Eukaryotic Promoter Database [Dataset]. https://bioregistry.io/epd

Eukaryotic Promoter Database

Explore at:
Dataset updated
Dec 12, 2021
Description

The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.

Search
Clear search
Close search
Google apps
Main menu