100+ datasets found
  1. MPRA data of synthetic enhancers in hematopoiesis

    • figshare.com
    bin
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lars Velten; Robert Frömel (2025). MPRA data of synthetic enhancers in hematopoiesis [Dataset]. http://doi.org/10.6084/m9.figshare.25713519.v3
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 15, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Lars Velten; Robert Frömel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewR Data Archive file containing MPRA data measuring the activity of synthetic DNA constructs in 7 cell states of murine primary hematopietic stem and progenitor cells (HSPCs) , and K562 cells. See our manuscript, https://www.biorxiv.org/content/10.1101/2024.08.26.609645v1This file contains a main data object, mpra.data, a list over the different experiments:HSPC.libA : Library A (38 factors, one TFBS per enhancer), HSPC experimentHSPC.libB : Library B (10 factors, TFBS pairs), HSPC experimentHSPC.libC : Library C (42 factors, TFBS pairs), HSPC experimentHSPC.libC.aggregate : Library C, HSPC experiment, aggregated across cell statesHSPC.libD: Library D (automated enhancer design), HSPC experimentHSPC.libF: Library F (Fli1-Spi1 and Gata2-Cebpa combinations), HSPC experimentHSPC.libG: Library G (Genomic sequences), HSPC experimentHSPC.libH: Library H (complex synthetic sequences with 3-12 FBS)K562.libA.minP.tra : Library A, K562 cell experimentK562.libB.minP.tra : Library B, K562 cell experimentK562.libC.minP.tra : Library C, K562 cell experimentK562.libB.minCMV.tra : Library B, K562 cell experiment, measured in a vector with the minimal CMV promoter instead of the default minimal promoter.K562.libB.minP.int : Library B, K562 cell experiment, measured 7 days after infection instead of 4 days after infectionEach list entry is a list of data frames, the exact format of which is explained belowDATA : Data of main constructsCONTROLS.GENERAL : Various controls, including random DNA measurements obtained as part of the same experimentCONTROLS.TP53 : An identical set of sequences from library A that was included in each experimentBACKGROUND : 90% confidence intervals of activity achieved by random DNADetailled explanation of main data frames (DATA)Each row corresponds to a single gene regulatory element, measured in a single cell state. The following columns are present for all libraries:clusterID : The cell state where the measurement was performed. To map the entries to labels, use the vector cellstate.mapCRS : The unique ID of the gene regulatory elementLibrary : The library (A, B or C)Seq : The DNA sequence. Capital letters correspond to placed motifs, small letters correspond to background DNA.RNA.1 , RNA.2 , DNA.1 , DNA.2 : Molecule counts on DNA and RNA level in replicate 1 and 2RNA.norm.1 , RNA.norm.2 , DNA.norm.1 , DNA.norm.2 : Library-size normalized molecule counts (???)norm.1.raw , norm.2.raw : Raw log2 of RNA/DNA counts in replicate 1 and 2norm.1.adj , norm.2.adj : log2 of RNA/DNA counts in replicate 1 and 2, subtracting the median activity of random DNAmean.norm.raw : Mean raw activity across replicates (log2 scale RNA/DNA)mean.norm.adj : Mean activity across replicates, using a scale where 0 is the median activity of random DNA in that cell statemean.scaled.final : Scaled activity measurement for visualization only, using a scale where 0 is the median activity of random DNA and 1 is the maximal activity achieved by Tp53 (not computed in K562 screens). Use mean.norm.adj for model training, statistics etc. The purpose of this is to account for different baselines and offsets achieved in different screens.The following columns regarding sequence design are only present in the single-factor library A:TF : The transcription factor placed on the DNAnrepeats : Number of placed motifsaffinitynum : Affinity quantile (on a scale where 1 is the most likely sequence given the PWM, and 0 is any extremely unlikely sequence)sum.biophys.affinity : Sum of actual affinities across the sequence, computed using considerations from statistical thermodynamics.orientation : Orientation of the motif. forward (fwd), reverse(rev) or alternating fwd-rev (tandem)spacer : Numbers of base pairs spacing between motifs.The following columns are only present in the dual-factor library B+C:TF1.name : Name of the transcription factor whose motif appears first, coming from 5''TF1.affinity : Corresponding affinity (on a scale from 0 to 1)TF1.orientation : Corresponding orientationTF2.name : Name of the transcription factor whose motif appears second, coming from 5''TF2.affinity : Corresponding affinity (on a scale from 0 to 1)TF2.orientation : Corresponding orientationspacer : Spacing between sitesTFnumber : Number of sites for each factorTForder : Arrangement of sites (Alternate or Block)The following colums are only present in the automated designn library D:SubLibrary: Whether the goal was to design enhancers with specific activation or repressionTask_MegEry, Task_Basophil, Task_Eosinophil, Task_Monocyte, Task_Neutrophil, Task_Immature: Task definition in the different cell states. -3 for repression, -0.2 / 0.6 for inactivity (depending on whether the task was activation or repression), 1 for activity.design_strategy: Whether the design was initialized with a random sequence or a random forest model was used to identify an optimal TFBS combination (model-guided)design_search: Whether optimization was done with a local or global searchThe following columns are only present in the dual-factor library F:spacer: Spacing between sitesnFli1, nSpi1, nCebpa, nGata2 Number of Fli1/Spi1/Cebpa/Gata2 sitesFli1_affinities_sum, Spi1_affinities_sum, Cebpa_affinities_sum, Gata2_affinities_sum: : Sum of motif scores for Fli1, Spi1, Cebpa, Gata2The following columns are only present in the genomic library G:chromosome, start_coordinate, end_coordinate: Genomic coordinates (mm10)Example code for working with main DATA framesSubsetting LibB/LibC data framesTo extract all data belonging to a given pair of transcription factors from library B and C in a format that ignores the order of the sites (i.e. which TF comes first and which TF comes second), the RDA file contains a function getsubset.libBC. This function takes as arguments a DATA frame and two transcription factors, e.g.getsubset.libBC("Spi1", "Fli1", mpra.data$HSPC.libC$DATA)It returns a similar data frame, except that now, TF1.name is always Spi1 (no matter if Spi1 is placed first, or if Sli1 is placed first), and TF2.name is always Fli1. It also adds some convenient columns:oricomb : Orientation of both factorsaffnum : Number of weak and strong binding sites for woth factors (e.g. 3W-3S means that the first factor has 3 weak binding sites, and the second factor has 3 strong binding sites)Casting data framesTo convert the data frame into a format where one row is one sequence, and columns are measurements in different cell states, you can use:require(reshape2)casted.dataframe

  2. H

    LS-MPRA / d-MPRA Data Repository

    • dataverse.harvard.edu
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alastair Tulloch (2025). LS-MPRA / d-MPRA Data Repository [Dataset]. http://doi.org/10.7910/DVN/TW0ZQL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Alastair Tulloch
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Resources used for the manuscript titled: "Massively parallel reporter assay for mapping gene-specific regulatory regions at single nucleotide resolution". The dataset includes scripts used to analyze data, raw sequencing files, and HOMER de novo motif analyses.

  3. Sequencing data for reporter assay in Jindal et al Dev Cell 2023 article

    • figshare.com
    application/gzip
    Updated Aug 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Granton Jindal (2023). Sequencing data for reporter assay in Jindal et al Dev Cell 2023 article [Dataset]. http://doi.org/10.6084/m9.figshare.23834814.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Aug 24, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Granton Jindal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Heart-specific enhancers drive expression of genes specifically in heart tissues. We find that low-affinity ETS transcription factor binding sites are necessary for the FoxF enhancer in Ciona and the GATA4-G9 enhancer in mice. To determine if higher affinity sites would result in gain-of-function activity, we tested the human GATA4-G9 enhancer and 2 variants with optimized ETS sites in human iPSC-cardiomyocytes, using a reporter assay. We discovered that both variants with optimized ETS sites drove gain-of-function activity and that just a single nucleotide variant within a human GATA4 enhancer increases ETS binding affinity and causes gain-of-function enhancer activity. The prevalence of suboptimal-affinity sites within enhancers creates a vulnerability whereby affinity-optimizing SNVs can lead to gain-of-function gene expression, changes in cellular identity, and organismal-level phenotypes that could contribute to the evolution of novel traits or diseases.

  4. t

    BIOGRID CURATED DATA FOR MPRA (Escherichia coli (K12/W3110))

    • thebiogrid.org
    zip
    Updated Feb 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BioGRID Project (2017). BIOGRID CURATED DATA FOR MPRA (Escherichia coli (K12/W3110)) [Dataset]. https://thebiogrid.org/4262945/summary/escherichia-coli/mpra.html
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 24, 2017
    Dataset authored and provided by
    BioGRID Project
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Protein-Protein, Genetic, and Chemical Interactions for MPRA (Escherichia coli (K12/W3110)) curated by BioGRID (https://thebiogrid.org); DEFINITION: DNA-binding transcriptional regulator

  5. d

    Supporting data for: Three-dimensional genome re-wiring in loci with Human...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kathleen Keough (2023). Supporting data for: Three-dimensional genome re-wiring in loci with Human Accelerated Regions [Dataset]. http://doi.org/10.7272/Q6057D5N
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kathleen Keough
    Time period covered
    Jan 1, 2023
    Description

    Human Accelerated Regions (HARs) are conserved genomic loci that evolved at an accelerated rate in the human lineage and may underlie human-specific traits. We generated HARs and chimpanzee accelerated regions with an automated pipeline and an alignment of 241 mammalian genomes. Combining deep-learning with chromatin capture experiments in human and chimpanzee neural progenitor cells, we discovered a significant enrichment of HARs in topologically associating domains (TADs) containing human-specific genomic variants that change three-dimensional (3D) genome organization. Differential gene expression between humans and chimpanzees at these loci suggests rewiring of regulatory interactions between HARs and neurodevelopmental genes. Thus, comparative genomics together with models of 3D genome folding revealed enhancer hijacking as an explanation for the rapid evolution of HARs., Lentivirus-based massively parallel reporter assay (lentiMPRA) library design and synthesis Tiles of 270bp in length were generated from all 312 zooHARs. Multiple tiles were generated with a sliding window of 20bp if the zooHAR was longer than 270bp. In total, 549 oligos were designed to cover all zooHARs. We also included 143 oligos centered on active chromatin marks as positive controls. This oligo pool was synthesized by Twist Bioscience. Primary cortical cell culture for lentiMPRA De-identified tissue samples were collected with consent in strict observance of legal and institutional ethical regulations. Protocols were approved by the Human Gamete, Embryo, and Stem Cell Research Committee (institutional review board) at the University of California, San Francisco. Gestational week 18 cortical tissue was dissociated into a single-cell suspension using papain (LK003150, Worthington Biochemical) and plated on 15cm dishes coated with poly-O-lysine, laminin, and fibronectin. DMEM culture...,

  6. Data from: Massively Parallel Reporter Assays for High-Throughput In Vivo...

    • zenodo.org
    application/gzip, bin +1
    Updated Mar 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan J. VanDusen; Nathan J. VanDusen (2023). Massively Parallel Reporter Assays for High-Throughput In Vivo Analysis of Cis-Regulatory Elements [Dataset]. http://doi.org/10.5281/zenodo.7779156
    Explore at:
    application/gzip, bin, xlsAvailable download formats
    Dataset updated
    Mar 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nathan J. VanDusen; Nathan J. VanDusen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A library of 50 enhancers, each tested in three different lengths and with two different promoters (300 combinations), was packaged into AAV9 and delivered to newborn mice. Enhancers were selected from the VISTA Enhancer Browser of transgenic reporter data, and included 25 candidates active in the embryonic myocardium and 25 negative control candidates active in embryonic endothelium but not in myocardium. In the heart, AAV9 selectively transduces cardiomyocytes. After collecting ventricles at P28, the reporter transcripts were sequenced, and the frequency of each barcode was compared to its frequency in the viral pool DNA.

    Here we provide fastq files for each sample, an Excel spreadsheet (MPRA-Metadata.xls) containing annotation, and an Excel spreadsheet (MPRA-counts.xlsx) containing extracted barcode counts for each enhancer, as well as additional annotation and calculated enhancer activity.

  7. f

    RData file of estimate comparisons and primary MPRA data.

    • plos.figshare.com
    application/gzip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew R. Ghazi; Xianguo Kong; Ed S. Chen; Leonard C. Edelstein; Chad A. Shaw (2023). RData file of estimate comparisons and primary MPRA data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1007504.s005
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Andrew R. Ghazi; Xianguo Kong; Ed S. Chen; Leonard C. Edelstein; Chad A. Shaw
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An RData file that contains three data frames: ulirsch_comparisons, primary_comparisons, and primary_mpra_data. The first two data frames are the data necessary to produce Fig 4. Each row corresponds to one variant, and each column corresponds to a given analysis method. The values in the table give the transcription shift estimates. The third data frame gives the barcode counts from our primary MPRA dataset with anonymized variant identifiers. (RDATA)

  8. Source Data for Supplementary Note Figures

    • figshare.com
    xlsx
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Bravo Gonzalez-Blas; Stein Aerts (2023). Source Data for Supplementary Note Figures [Dataset]. http://doi.org/10.6084/m9.figshare.24532951.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 9, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Carmen Bravo Gonzalez-Blas; Stein Aerts
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source Data for Supplementary Note Figures from Bravo et al. 2023.

  9. E

    ENCSR186NQR

    • encodeproject.org
    Updated May 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jay Shendure (2021). ENCSR186NQR [Dataset]. www.encodeproject.org/functional-characterization-experiments/ENCSR186NQR/
    Explore at:
    Dataset updated
    May 19, 2021
    Dataset provided by
    The ENCODE Data Coordination Center
    Authors
    Jay Shendure
    License

    www.encodeproject.org/help/citing-encode/www.encodeproject.org/help/citing-encode/

    Measurement technique
    Control MPRA (OBI:0002675)
    Description

    Control MPRA - Homo sapiens K562 genetically modified (insertion) using transduction - ENCODE - UM1HG009408 - Nadav Ahituv, UCSF

  10. N

    Data from: Systematic dissection and optimization of inducible enhancers in...

    • data.niaid.nih.gov
    Updated May 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melnikov A; Murugan A; Zhang X; Mikkelsen TS (2019). Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay [Dataset]. https://data.niaid.nih.gov/resources?id=gse31982
    Explore at:
    Dataset updated
    May 15, 2019
    Dataset provided by
    Broad Institute
    Authors
    Melnikov A; Murugan A; Zhang X; Mikkelsen TS
    Description

    We apply a massively parallel reporter assay (MPRA) that relies on mRNA and plasmid tag sequencing (Tag-Seq) to compare the regulatory activities of more than 27,000 distinct variants of two inducible enhancers in human cells: a synthetic cAMP-regulated enhancer and the virus-inducible interferon beta enhancer. The resulting data define accurate maps of functional transcription factor binding sites in both enhancers at single-nucleotide resolution and can be used the to train quantitative sequence-activity models (QSAMs). Reporter Tag-Seq from HEK293 cells transfected with each of six MPRA plasmid pools, with and without stimulation (forskolin or Sendai virus). The reporter mRNAs contain unique 10 nucleotide tags that facilitates quantitation of their abundances. The same tags were also sequenced from each ransfected plasmid pool to facilitate normalization to plasmid copy numbers. The reporter constructs were designed according to two different mutagenesis strategies: 'single-hit scanning' and 'multi-hit sampling'. The specific variants are included in the processed data files.

  11. f

    Deciphering regulatory DNA sequences and noncoding genetic variants using...

    • plos.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajiv Movva; Peyton Greenside; Georgi K. Marinov; Surag Nair; Avanti Shrikumar; Anshul Kundaje (2023). Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays [Dataset]. http://doi.org/10.1371/journal.pone.0218073
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rajiv Movva; Peyton Greenside; Georgi K. Marinov; Surag Nair; Avanti Shrikumar; Anshul Kundaje
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

  12. Data from: Functional dissection of human cardiac enhancers and non-coding...

    • zenodo.org
    txt
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaoran Zhang; Xiaoran Zhang (2023). Functional dissection of human cardiac enhancers and non-coding de novo variants in congenital heart disease [Dataset]. http://doi.org/10.5281/zenodo.8162058
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiaoran Zhang; Xiaoran Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the CHD MPRA motif analysis input file. Please see the detail in : https://github.com/pulab/CHD_DNVs/tree/main/MPRA-Enhancer/CHD_MPRA_project/CHD_MPRA_library

  13. N

    Distinct roles for motif affinity, chromatin state, and co-regulatory motifs...

    • data.niaid.nih.gov
    Updated May 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grossman SR (2019). Distinct roles for motif affinity, chromatin state, and co-regulatory motifs in PPARγ binding and enhancer activity [Dataset]. https://data.niaid.nih.gov/resources?id=gse84888
    Explore at:
    Dataset updated
    May 15, 2019
    Dataset provided by
    Broad Institute
    Authors
    Grossman SR
    Description

    Sequence-specific transcription factors (TFs) regulate gene expression by binding to cognate motifs in promoters and enhancers. However, predicting genomic TF binding events and their quantitative contribution to expression remains a major challenge. In principle, the binding and enhancer activity of specific sites in vivo might depend on: (i) latent properties of the motif instance, (ii) cooperative interactions with other TFs that bind in the immediate vicinity, and (iii) the chromatin state of the sites in the genome. Here, we used massively parallel reporter assays (MPRA) involving 32,115 natural and synthetic enhancers, together with high-throughput in vivo assays, to systematically dissect the contributions of motif affinity, cooperative interactions, and chromatin accessibility to the binding and regulatory activity of genomic sequences that contain motifs for PPARγ, a TF that serves as a key regulator of adipogenesis. We show that PPARγ binding and enhancer activity are governed by distinct features. Genomic PPARγ binding to motif sites is largely governed by on larger-scale features, such as chromatin accessibility, whereas the degree to which a PPARγ motif site enhances transcriptional activity depends on the sequence immediately surround the motif. We detect and functionally validate a network of TFs comprised of multiple functional classes that collaborate with PPARγ to drive transcription. We extensively perturb this network, revealing functional cooperativity among classes of TFs that does not depend on precise positioning. Together, these results present a clear picture of how chromatin and TFs from distinct functional classes interact with PPARγ to determine binding and enhancer activity, and provide a paradigm for studying any TF. The study consisted of 7 MPRA experiments and 2 ChIP-seq experiments. Raw data for MPRA experiments are provided as Illumina reads of the 16 bp barcode from the RNA extracted 16 hours post transfection as well as from the plasmid library used for transfection. Raw data for ChIP-seq experiments are provided as paired-end Illumina reads for PPARg ChIP DNA fragments extracted 16 hours post transfection as well as input DNA fragments. For pools 4-7, we have provided barcode/oligo combinations as paired-end Illumina reads covering the barcode and enhancer sequence. Processed count files are counts corresponding to each barcode (Pools 1-3) or counts summed across all barcodes for each oligo (Pools 4-7).

  14. Supplementary tables of "Synthetic enhancers reveal design principles of...

    • figshare.com
    zip
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lars Velten (2024). Supplementary tables of "Synthetic enhancers reveal design principles of cell state specific regulatory elements in hematopoiesis" [Dataset]. http://doi.org/10.6084/m9.figshare.26927866.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 3, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Lars Velten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  15. Personalized genomes for DL models supporting data

    • zenodo.org
    tar
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam He; Charles Danko; Charles Danko; Nathan Palamuttam; Adam He; Nathan Palamuttam (2024). Personalized genomes for DL models supporting data [Dataset]. http://doi.org/10.5281/zenodo.14037356
    Explore at:
    tarAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Adam He; Charles Danko; Charles Danko; Nathan Palamuttam; Adam He; Nathan Palamuttam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Archive of models and data associated with our manuscript "Training deep learning models on personalized genomic sequences improves variant effect prediction".

    Code for training and benchmarking LCL models is available at https://github.com/Danko-Lab/clipnet_ablation, whereas code for training and benchmarking K562 models is available at https://github.com/Danko-Lab/clipnet_k562/.

    Model files & metadata:

    • n{i}_run{j}.tar
      • CLIPNET LCL models trained on i individuals
    • subsample_individuals_ids.tar
      • text files containing lists of the individuals used to train the above models.
    • reference_models.tar
      • CLIPNET LCL model trained on data from 67 PRO-cap libraries, but using hg38 sequences instead of personal genomes.
    • clipnet_k562_reference.tar
      • hg38-trained model described above transfer learned to K562.

    Benchmark data:

  16. m

    LOC127829729

    • rgd.mcw.edu
    Updated May 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rat Genome Database (2023). LOC127829729 [Dataset]. https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=155751899
    Explore at:
    Dataset updated
    May 26, 2023
    Dataset authored and provided by
    Rat Genome Database
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This genomic region was validated as an active enhancer by the ChIP-STARR-seq massively parallel reporter assay (MPRA) in primed human embryonic stem cells, where it is marked by the H3K27ac histone modification. This locus also includes an accessible chromatin subregion that was validated as a silencer based on its ability to repress an origin of replication minimal core promoter by the ATAC-STARR-seq (assay for transposase-accessible chromatin with self-transcribing active regulatory region sequencing) MPRA in GM12878 lymphoblastoid cells. [provided by RefSeq, Jun 2023]

  17. Z

    APARENT2 Training Data and Models

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linder, Johannes (2022). APARENT2 Training Data and Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7317445
    Explore at:
    Dataset updated
    Nov 14, 2022
    Dataset authored and provided by
    Linder, Johannes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Processed training data for the APARENT2 model (measurements from the random MPRA and designed oligo pool originally published by Bogard et al., 2019; see https://doi.org/10.1016/j.cell.2019.04.046 for reference). This repository also contains the APARENT2 model file. For more information on the training procedure, see the Genome Biology article "Deciphering the impact of genetic variation on human polyadenylation using APARENT2" (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02799-4). Two versions of the model are available:

    (a) aparent_all_libs_resnet_no_clinvar_wt_ep_5.h5: The originally trained APARENT2 model. (b) aparent_all_libs_resnet_no_clinvar_wt_ep_5_var_batch_size_inference_mode_no_drop.h5: Identical weights and predictions as model (a), but the normalization layers have been set to inference mode and the dropout layers have been removed (thus making it compatible with the scrambler pipeline).

  18. f

    Data Sheet 1_An in vivo systemic massively parallel platform for deciphering...

    • frontiersin.figshare.com
    docx
    Updated Apr 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashley R. Brown; Grant A. Fox; Irene M. Kaplow; Alyssa J. Lawler; BaDoi N. Phan; Lahari Gadey; Morgan E. Wirthlin; Easwaran Ramamurthy; Gemma E. May; Ziheng Chen; Qiao Su; C. Joel McManus; Robert van de Weerd; Andreas R. Pfenning (2025). Data Sheet 1_An in vivo systemic massively parallel platform for deciphering animal tissue-specific regulatory function.docx [Dataset]. http://doi.org/10.3389/fgene.2025.1533900.s011
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    Frontiers
    Authors
    Ashley R. Brown; Grant A. Fox; Irene M. Kaplow; Alyssa J. Lawler; BaDoi N. Phan; Lahari Gadey; Morgan E. Wirthlin; Easwaran Ramamurthy; Gemma E. May; Ziheng Chen; Qiao Su; C. Joel McManus; Robert van de Weerd; Andreas R. Pfenning
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: Transcriptional regulation is an important process wherein non-protein coding enhancer sequences play a key role in determining cell type identity and phenotypic diversity. In neural tissue, these gene regulatory processes are crucial for coordinating a plethora of interconnected and regionally specialized cell types, ensuring their synchronized activity in generating behavior. Recognizing the intricate interplay of gene regulatory processes in the brain is imperative, as mounting evidence links neurodevelopment and neurological disorders to non-coding genome regions. While genome-wide association studies are swiftly identifying non-coding human disease-associated loci, decoding regulatory mechanisms is challenging due to causal variant ambiguity and their specific tissue impacts.Methods: Massively parallel reporter assays (MPRAs) are widely used in cell culture to study the non-coding enhancer regions, linking genome sequence differences to tissue-specific regulatory function. However, widespread use in animals encounters significant challenges, including insufficient viral library delivery and library quantification, irregular viral transduction rates, and injection site inflammation disrupting gene expression. Here, we introduce a systemic MPRA (sysMPRA) to address these challenges through systemic intravenous AAV viral delivery.Results: We demonstrate successful transduction of the MPRA library into diverse mouse tissues, efficiently identifying tissue specificity in candidate enhancers and aligning well with predictions from machine learning models. We highlight that sysMPRA effectively uncovers regulatory effects stemming from the disruption of MEF2C transcription factor binding sites, single-nucleotide polymorphisms, and the consequences of genetic variations associated with late-onset Alzheimer‘s disease.Conclusion: SysMPRA is an effective library delivering method that simultaneously determines the transcriptional functions of hundreds of enhancers in vivo across multiple tissues.

  19. Gosai et al. (2024) Evaluator Container for Genomic API for Model Evaluation...

    • zenodo.org
    bin, zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ishika Luthra; Ishika Luthra (2025). Gosai et al. (2024) Evaluator Container for Genomic API for Model Evaluation (GAME) [Dataset]. http://doi.org/10.5281/zenodo.14908238
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ishika Luthra; Ishika Luthra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Evaluator container for Gosai et al. 2024 MPRA sequences. A total of 776,474 sequences (200bp) were measured in 3 human cell lines.

    Gosai, S.J., Castro, R.I., Fuentes, N. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211–1220 (2024). https://doi.org/10.1038/s41586-024-08070-z

    gosai_evaluator.sif contains all dependencies and scripts required for the Evaluator container to read in the raw MPRA data, parse into the correct API format, and connect with any Predictor container via TCP sockets.

    test_gosai_predictor.sif contains all dependencies and scripts required for a test Predictor container that can be used with the Gosai Evaluator container.

    Additional information can be found here: https://github.com/de-Boer-Lab/Genomic-Model-Evaluation-API

  20. m

    LOC112942286

    • rgd.mcw.edu
    Updated Feb 9, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rat Genome Database (2018). LOC112942286 [Dataset]. https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=38616171
    Explore at:
    Dataset updated
    Feb 9, 2018
    Dataset authored and provided by
    Rat Genome Database
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This genomic sequence was predicted to be a transcriptional regulatory region based on chromatin state analysis from the ENCODE (ENCyclopedia Of DNA Elements) project. It was validated as an active enhancer by the ChIP-STARR-seq massively parallel reporter assay (MPRA) in naive and primed human embryonic stem cells, where it is marked by the H3K27ac histone modification. A subregion was also validated as an enhancer by Sharpr-MPRA (Systematic high-resolution activation and repression profiling with reporter tiling using massively parallel reporter assays) in both HepG2 liver carcinoma cells (group: HepG2 Activating DNase unmatched - State 12:CtcfO, distal CTCF/candidate insulator with open chromatin) and K562 erythroleukemia cells (group: K562 Activating DNase matched - State 13:Ctcf, distal CTCF/candidate insulator without open chromatin). This locus also includes an accessible chromatin subregion that was validated as an enhancer based on its ability to activate an origin of replication minimal core promoter by the ATAC-STARR-seq (assay for transposase-accessible chromatin with self-transcribing active regulatory region sequencing) MPRA in GM12878 lymphoblastoid cells. [provided by RefSeq, May 2023]

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lars Velten; Robert Frömel (2025). MPRA data of synthetic enhancers in hematopoiesis [Dataset]. http://doi.org/10.6084/m9.figshare.25713519.v3
Organization logo

MPRA data of synthetic enhancers in hematopoiesis

Explore at:
binAvailable download formats
Dataset updated
Mar 15, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lars Velten; Robert Frömel
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

OverviewR Data Archive file containing MPRA data measuring the activity of synthetic DNA constructs in 7 cell states of murine primary hematopietic stem and progenitor cells (HSPCs) , and K562 cells. See our manuscript, https://www.biorxiv.org/content/10.1101/2024.08.26.609645v1This file contains a main data object, mpra.data, a list over the different experiments:HSPC.libA : Library A (38 factors, one TFBS per enhancer), HSPC experimentHSPC.libB : Library B (10 factors, TFBS pairs), HSPC experimentHSPC.libC : Library C (42 factors, TFBS pairs), HSPC experimentHSPC.libC.aggregate : Library C, HSPC experiment, aggregated across cell statesHSPC.libD: Library D (automated enhancer design), HSPC experimentHSPC.libF: Library F (Fli1-Spi1 and Gata2-Cebpa combinations), HSPC experimentHSPC.libG: Library G (Genomic sequences), HSPC experimentHSPC.libH: Library H (complex synthetic sequences with 3-12 FBS)K562.libA.minP.tra : Library A, K562 cell experimentK562.libB.minP.tra : Library B, K562 cell experimentK562.libC.minP.tra : Library C, K562 cell experimentK562.libB.minCMV.tra : Library B, K562 cell experiment, measured in a vector with the minimal CMV promoter instead of the default minimal promoter.K562.libB.minP.int : Library B, K562 cell experiment, measured 7 days after infection instead of 4 days after infectionEach list entry is a list of data frames, the exact format of which is explained belowDATA : Data of main constructsCONTROLS.GENERAL : Various controls, including random DNA measurements obtained as part of the same experimentCONTROLS.TP53 : An identical set of sequences from library A that was included in each experimentBACKGROUND : 90% confidence intervals of activity achieved by random DNADetailled explanation of main data frames (DATA)Each row corresponds to a single gene regulatory element, measured in a single cell state. The following columns are present for all libraries:clusterID : The cell state where the measurement was performed. To map the entries to labels, use the vector cellstate.mapCRS : The unique ID of the gene regulatory elementLibrary : The library (A, B or C)Seq : The DNA sequence. Capital letters correspond to placed motifs, small letters correspond to background DNA.RNA.1 , RNA.2 , DNA.1 , DNA.2 : Molecule counts on DNA and RNA level in replicate 1 and 2RNA.norm.1 , RNA.norm.2 , DNA.norm.1 , DNA.norm.2 : Library-size normalized molecule counts (???)norm.1.raw , norm.2.raw : Raw log2 of RNA/DNA counts in replicate 1 and 2norm.1.adj , norm.2.adj : log2 of RNA/DNA counts in replicate 1 and 2, subtracting the median activity of random DNAmean.norm.raw : Mean raw activity across replicates (log2 scale RNA/DNA)mean.norm.adj : Mean activity across replicates, using a scale where 0 is the median activity of random DNA in that cell statemean.scaled.final : Scaled activity measurement for visualization only, using a scale where 0 is the median activity of random DNA and 1 is the maximal activity achieved by Tp53 (not computed in K562 screens). Use mean.norm.adj for model training, statistics etc. The purpose of this is to account for different baselines and offsets achieved in different screens.The following columns regarding sequence design are only present in the single-factor library A:TF : The transcription factor placed on the DNAnrepeats : Number of placed motifsaffinitynum : Affinity quantile (on a scale where 1 is the most likely sequence given the PWM, and 0 is any extremely unlikely sequence)sum.biophys.affinity : Sum of actual affinities across the sequence, computed using considerations from statistical thermodynamics.orientation : Orientation of the motif. forward (fwd), reverse(rev) or alternating fwd-rev (tandem)spacer : Numbers of base pairs spacing between motifs.The following columns are only present in the dual-factor library B+C:TF1.name : Name of the transcription factor whose motif appears first, coming from 5''TF1.affinity : Corresponding affinity (on a scale from 0 to 1)TF1.orientation : Corresponding orientationTF2.name : Name of the transcription factor whose motif appears second, coming from 5''TF2.affinity : Corresponding affinity (on a scale from 0 to 1)TF2.orientation : Corresponding orientationspacer : Spacing between sitesTFnumber : Number of sites for each factorTForder : Arrangement of sites (Alternate or Block)The following colums are only present in the automated designn library D:SubLibrary: Whether the goal was to design enhancers with specific activation or repressionTask_MegEry, Task_Basophil, Task_Eosinophil, Task_Monocyte, Task_Neutrophil, Task_Immature: Task definition in the different cell states. -3 for repression, -0.2 / 0.6 for inactivity (depending on whether the task was activation or repression), 1 for activity.design_strategy: Whether the design was initialized with a random sequence or a random forest model was used to identify an optimal TFBS combination (model-guided)design_search: Whether optimization was done with a local or global searchThe following columns are only present in the dual-factor library F:spacer: Spacing between sitesnFli1, nSpi1, nCebpa, nGata2 Number of Fli1/Spi1/Cebpa/Gata2 sitesFli1_affinities_sum, Spi1_affinities_sum, Cebpa_affinities_sum, Gata2_affinities_sum: : Sum of motif scores for Fli1, Spi1, Cebpa, Gata2The following columns are only present in the genomic library G:chromosome, start_coordinate, end_coordinate: Genomic coordinates (mm10)Example code for working with main DATA framesSubsetting LibB/LibC data framesTo extract all data belonging to a given pair of transcription factors from library B and C in a format that ignores the order of the sites (i.e. which TF comes first and which TF comes second), the RDA file contains a function getsubset.libBC. This function takes as arguments a DATA frame and two transcription factors, e.g.getsubset.libBC("Spi1", "Fli1", mpra.data$HSPC.libC$DATA)It returns a similar data frame, except that now, TF1.name is always Spi1 (no matter if Spi1 is placed first, or if Sli1 is placed first), and TF2.name is always Fli1. It also adds some convenient columns:oricomb : Orientation of both factorsaffnum : Number of weak and strong binding sites for woth factors (e.g. 3W-3S means that the first factor has 3 weak binding sites, and the second factor has 3 strong binding sites)Casting data framesTo convert the data frame into a format where one row is one sequence, and columns are measurements in different cell states, you can use:require(reshape2)casted.dataframe

Search
Clear search
Close search
Google apps
Main menu