100+ datasets found

MPRA data of synthetic enhancers in hematopoiesis
figshare.com
bin
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lars Velten; Robert Frömel (2025). MPRA data of synthetic enhancers in hematopoiesis [Dataset]. http://doi.org/10.6084/m9.figshare.25713519.v3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25713519.v3
Dataset updated
Mar 15, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lars Velten; Robert Frömel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OverviewR Data Archive file containing MPRA data measuring the activity of synthetic DNA constructs in 7 cell states of murine primary hematopietic stem and progenitor cells (HSPCs) , and K562 cells. See our manuscript, https://www.biorxiv.org/content/10.1101/2024.08.26.609645v1This file contains a main data object, mpra.data, a list over the different experiments:HSPC.libA : Library A (38 factors, one TFBS per enhancer), HSPC experimentHSPC.libB : Library B (10 factors, TFBS pairs), HSPC experimentHSPC.libC : Library C (42 factors, TFBS pairs), HSPC experimentHSPC.libC.aggregate : Library C, HSPC experiment, aggregated across cell statesHSPC.libD: Library D (automated enhancer design), HSPC experimentHSPC.libF: Library F (Fli1-Spi1 and Gata2-Cebpa combinations), HSPC experimentHSPC.libG: Library G (Genomic sequences), HSPC experimentHSPC.libH: Library H (complex synthetic sequences with 3-12 FBS)K562.libA.minP.tra : Library A, K562 cell experimentK562.libB.minP.tra : Library B, K562 cell experimentK562.libC.minP.tra : Library C, K562 cell experimentK562.libB.minCMV.tra : Library B, K562 cell experiment, measured in a vector with the minimal CMV promoter instead of the default minimal promoter.K562.libB.minP.int : Library B, K562 cell experiment, measured 7 days after infection instead of 4 days after infectionEach list entry is a list of data frames, the exact format of which is explained belowDATA : Data of main constructsCONTROLS.GENERAL : Various controls, including random DNA measurements obtained as part of the same experimentCONTROLS.TP53 : An identical set of sequences from library A that was included in each experimentBACKGROUND : 90% confidence intervals of activity achieved by random DNADetailled explanation of main data frames (DATA)Each row corresponds to a single gene regulatory element, measured in a single cell state. The following columns are present for all libraries:clusterID : The cell state where the measurement was performed. To map the entries to labels, use the vector cellstate.mapCRS : The unique ID of the gene regulatory elementLibrary : The library (A, B or C)Seq : The DNA sequence. Capital letters correspond to placed motifs, small letters correspond to background DNA.RNA.1 , RNA.2 , DNA.1 , DNA.2 : Molecule counts on DNA and RNA level in replicate 1 and 2RNA.norm.1 , RNA.norm.2 , DNA.norm.1 , DNA.norm.2 : Library-size normalized molecule counts (???)norm.1.raw , norm.2.raw : Raw log2 of RNA/DNA counts in replicate 1 and 2norm.1.adj , norm.2.adj : log2 of RNA/DNA counts in replicate 1 and 2, subtracting the median activity of random DNAmean.norm.raw : Mean raw activity across replicates (log2 scale RNA/DNA)mean.norm.adj : Mean activity across replicates, using a scale where 0 is the median activity of random DNA in that cell statemean.scaled.final : Scaled activity measurement for visualization only, using a scale where 0 is the median activity of random DNA and 1 is the maximal activity achieved by Tp53 (not computed in K562 screens). Use mean.norm.adj for model training, statistics etc. The purpose of this is to account for different baselines and offsets achieved in different screens.The following columns regarding sequence design are only present in the single-factor library A:TF : The transcription factor placed on the DNAnrepeats : Number of placed motifsaffinitynum : Affinity quantile (on a scale where 1 is the most likely sequence given the PWM, and 0 is any extremely unlikely sequence)sum.biophys.affinity : Sum of actual affinities across the sequence, computed using considerations from statistical thermodynamics.orientation : Orientation of the motif. forward (fwd), reverse(rev) or alternating fwd-rev (tandem)spacer : Numbers of base pairs spacing between motifs.The following columns are only present in the dual-factor library B+C:TF1.name : Name of the transcription factor whose motif appears first, coming from 5''TF1.affinity : Corresponding affinity (on a scale from 0 to 1)TF1.orientation : Corresponding orientationTF2.name : Name of the transcription factor whose motif appears second, coming from 5''TF2.affinity : Corresponding affinity (on a scale from 0 to 1)TF2.orientation : Corresponding orientationspacer : Spacing between sitesTFnumber : Number of sites for each factorTForder : Arrangement of sites (Alternate or Block)The following colums are only present in the automated designn library D:SubLibrary: Whether the goal was to design enhancers with specific activation or repressionTask_MegEry, Task_Basophil, Task_Eosinophil, Task_Monocyte, Task_Neutrophil, Task_Immature: Task definition in the different cell states. -3 for repression, -0.2 / 0.6 for inactivity (depending on whether the task was activation or repression), 1 for activity.design_strategy: Whether the design was initialized with a random sequence or a random forest model was used to identify an optimal TFBS combination (model-guided)design_search: Whether optimization was done with a local or global searchThe following columns are only present in the dual-factor library F:spacer: Spacing between sitesnFli1, nSpi1, nCebpa, nGata2 Number of Fli1/Spi1/Cebpa/Gata2 sitesFli1_affinities_sum, Spi1_affinities_sum, Cebpa_affinities_sum, Gata2_affinities_sum: : Sum of motif scores for Fli1, Spi1, Cebpa, Gata2The following columns are only present in the genomic library G:chromosome, start_coordinate, end_coordinate: Genomic coordinates (mm10)Example code for working with main DATA framesSubsetting LibB/LibC data framesTo extract all data belonging to a given pair of transcription factors from library B and C in a format that ignores the order of the sites (i.e. which TF comes first and which TF comes second), the RDA file contains a function getsubset.libBC. This function takes as arguments a DATA frame and two transcription factors, e.g.getsubset.libBC("Spi1", "Fli1", mpra.data$HSPC.libC$DATA)It returns a similar data frame, except that now, TF1.name is always Spi1 (no matter if Spi1 is placed first, or if Sli1 is placed first), and TF2.name is always Fli1. It also adds some convenient columns:oricomb : Orientation of both factorsaffnum : Number of weak and strong binding sites for woth factors (e.g. 3W-3S means that the first factor has 3 weak binding sites, and the second factor has 3 strong binding sites)Casting data framesTo convert the data frame into a format where one row is one sequence, and columns are measurements in different cell states, you can use:require(reshape2)casted.dataframe
H
LS-MPRA / d-MPRA Data Repository
dataverse.harvard.edu
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alastair Tulloch (2025). LS-MPRA / d-MPRA Data Repository [Dataset]. http://doi.org/10.7910/DVN/TW0ZQL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/TW0ZQL
Dataset updated
May 26, 2025
Dataset provided by
Harvard Dataverse
Authors
Alastair Tulloch
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Resources used for the manuscript titled: "Massively parallel reporter assay for mapping gene-specific regulatory regions at single nucleotide resolution". The dataset includes scripts used to analyze data, raw sequencing files, and HOMER de novo motif analyses.
Sequencing data for reporter assay in Jindal et al Dev Cell 2023 article
figshare.com
application/gzip
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Granton Jindal (2023). Sequencing data for reporter assay in Jindal et al Dev Cell 2023 article [Dataset]. http://doi.org/10.6084/m9.figshare.23834814.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23834814.v1
Dataset updated
Aug 24, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Granton Jindal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart-specific enhancers drive expression of genes specifically in heart tissues. We find that low-affinity ETS transcription factor binding sites are necessary for the FoxF enhancer in Ciona and the GATA4-G9 enhancer in mice. To determine if higher affinity sites would result in gain-of-function activity, we tested the human GATA4-G9 enhancer and 2 variants with optimized ETS sites in human iPSC-cardiomyocytes, using a reporter assay. We discovered that both variants with optimized ETS sites drove gain-of-function activity and that just a single nucleotide variant within a human GATA4 enhancer increases ETS binding affinity and causes gain-of-function enhancer activity. The prevalence of suboptimal-affinity sites within enhancers creates a vulnerability whereby affinity-optimizing SNVs can lead to gain-of-function gene expression, changes in cellular identity, and organismal-level phenotypes that could contribute to the evolution of novel traits or diseases.
t
BIOGRID CURATED DATA FOR MPRA (Escherichia coli (K12/W3110))
thebiogrid.org
zip
Updated Feb 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BioGRID Project (2017). BIOGRID CURATED DATA FOR MPRA (Escherichia coli (K12/W3110)) [Dataset]. https://thebiogrid.org/4262945/summary/escherichia-coli/mpra.html
Explore at:
zipAvailable download formats
Dataset updated
Feb 24, 2017
Dataset authored and provided by
BioGRID Project
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Protein-Protein, Genetic, and Chemical Interactions for MPRA (Escherichia coli (K12/W3110)) curated by BioGRID (https://thebiogrid.org); DEFINITION: DNA-binding transcriptional regulator
d
Supporting data for: Three-dimensional genome re-wiring in loci with Human...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kathleen Keough (2023). Supporting data for: Three-dimensional genome re-wiring in loci with Human Accelerated Regions [Dataset]. http://doi.org/10.7272/Q6057D5N
Explore at:
Unique identifier
https://doi.org/10.7272/Q6057D5N
Dataset updated
Nov 29, 2023
Dataset provided by
Dryad Digital Repository
Authors
Kathleen Keough
Time period covered
Jan 1, 2023
Description
Human Accelerated Regions (HARs) are conserved genomic loci that evolved at an accelerated rate in the human lineage and may underlie human-specific traits. We generated HARs and chimpanzee accelerated regions with an automated pipeline and an alignment of 241 mammalian genomes. Combining deep-learning with chromatin capture experiments in human and chimpanzee neural progenitor cells, we discovered a significant enrichment of HARs in topologically associating domains (TADs) containing human-specific genomic variants that change three-dimensional (3D) genome organization. Differential gene expression between humans and chimpanzees at these loci suggests rewiring of regulatory interactions between HARs and neurodevelopmental genes. Thus, comparative genomics together with models of 3D genome folding revealed enhancer hijacking as an explanation for the rapid evolution of HARs., Lentivirus-based massively parallel reporter assay (lentiMPRA) library design and synthesis Tiles of 270bp in length were generated from all 312 zooHARs. Multiple tiles were generated with a sliding window of 20bp if the zooHAR was longer than 270bp. In total, 549 oligos were designed to cover all zooHARs. We also included 143 oligos centered on active chromatin marks as positive controls. This oligo pool was synthesized by Twist Bioscience. Primary cortical cell culture for lentiMPRA De-identified tissue samples were collected with consent in strict observance of legal and institutional ethical regulations. Protocols were approved by the Human Gamete, Embryo, and Stem Cell Research Committee (institutional review board) at the University of California, San Francisco. Gestational week 18 cortical tissue was dissociated into a single-cell suspension using papain (LK003150, Worthington Biochemical) and plated on 15cm dishes coated with poly-O-lysine, laminin, and fibronectin. DMEM culture...,
Data from: Massively Parallel Reporter Assays for High-Throughput In Vivo...
zenodo.org
application/gzip, bin +1
Updated Mar 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan J. VanDusen; Nathan J. VanDusen (2023). Massively Parallel Reporter Assays for High-Throughput In Vivo Analysis of Cis-Regulatory Elements [Dataset]. http://doi.org/10.5281/zenodo.7779156
Explore at:
application/gzip, bin, xlsAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7779156
Dataset updated
Mar 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nathan J. VanDusen; Nathan J. VanDusen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A library of 50 enhancers, each tested in three different lengths and with two different promoters (300 combinations), was packaged into AAV9 and delivered to newborn mice. Enhancers were selected from the VISTA Enhancer Browser of transgenic reporter data, and included 25 candidates active in the embryonic myocardium and 25 negative control candidates active in embryonic endothelium but not in myocardium. In the heart, AAV9 selectively transduces cardiomyocytes. After collecting ventricles at P28, the reporter transcripts were sequenced, and the frequency of each barcode was compared to its frequency in the viral pool DNA.

Here we provide fastq files for each sample, an Excel spreadsheet (MPRA-Metadata.xls) containing annotation, and an Excel spreadsheet (MPRA-counts.xlsx) containing extracted barcode counts for each enhancer, as well as additional annotation and calculated enhancer activity.
f
RData file of estimate comparisons and primary MPRA data.
plos.figshare.com
application/gzip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew R. Ghazi; Xianguo Kong; Ed S. Chen; Leonard C. Edelstein; Chad A. Shaw (2023). RData file of estimate comparisons and primary MPRA data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1007504.s005
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1007504.s005
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS Computational Biology
Authors
Andrew R. Ghazi; Xianguo Kong; Ed S. Chen; Leonard C. Edelstein; Chad A. Shaw
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An RData file that contains three data frames: ulirsch_comparisons, primary_comparisons, and primary_mpra_data. The first two data frames are the data necessary to produce Fig 4. Each row corresponds to one variant, and each column corresponds to a given analysis method. The values in the table give the transcription shift estimates. The third data frame gives the barcode counts from our primary MPRA dataset with anonymized variant identifiers. (RDATA)
Source Data for Supplementary Note Figures
figshare.com
xlsx
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Bravo Gonzalez-Blas; Stein Aerts (2023). Source Data for Supplementary Note Figures [Dataset]. http://doi.org/10.6084/m9.figshare.24532951.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24532951.v2
Dataset updated
Nov 9, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Carmen Bravo Gonzalez-Blas; Stein Aerts
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source Data for Supplementary Note Figures from Bravo et al. 2023.
E
ENCSR186NQR
encodeproject.org
Updated May 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jay Shendure (2021). ENCSR186NQR [Dataset]. www.encodeproject.org/functional-characterization-experiments/ENCSR186NQR/
Explore at:
Dataset updated
May 19, 2021
Dataset provided by
The ENCODE Data Coordination Center
Authors
Jay Shendure
License
www.encodeproject.org/help/citing-encode/www.encodeproject.org/help/citing-encode/
Measurement technique
Control MPRA (OBI:0002675)
Description
Control MPRA - Homo sapiens K562 genetically modified (insertion) using transduction - ENCODE - UM1HG009408 - Nadav Ahituv, UCSF
N
Data from: Systematic dissection and optimization of inducible enhancers in...
data.niaid.nih.gov
Updated May 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melnikov A; Murugan A; Zhang X; Mikkelsen TS (2019). Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay [Dataset]. https://data.niaid.nih.gov/resources?id=gse31982
Explore at:
Dataset updated
May 15, 2019
Dataset provided by
Broad Institute
Authors
Melnikov A; Murugan A; Zhang X; Mikkelsen TS
Description
We apply a massively parallel reporter assay (MPRA) that relies on mRNA and plasmid tag sequencing (Tag-Seq) to compare the regulatory activities of more than 27,000 distinct variants of two inducible enhancers in human cells: a synthetic cAMP-regulated enhancer and the virus-inducible interferon beta enhancer. The resulting data define accurate maps of functional transcription factor binding sites in both enhancers at single-nucleotide resolution and can be used the to train quantitative sequence-activity models (QSAMs). Reporter Tag-Seq from HEK293 cells transfected with each of six MPRA plasmid pools, with and without stimulation (forskolin or Sendai virus). The reporter mRNAs contain unique 10 nucleotide tags that facilitates quantitation of their abundances. The same tags were also sequenced from each ransfected plasmid pool to facilitate normalization to plasmid copy numbers. The reporter constructs were designed according to two different mutagenesis strategies: 'single-hit scanning' and 'multi-hit sampling'. The specific variants are included in the processed data files.
f
Deciphering regulatory DNA sequences and noncoding genetic variants using...
plos.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajiv Movva; Peyton Greenside; Georgi K. Marinov; Surag Nair; Avanti Shrikumar; Anshul Kundaje (2023). Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays [Dataset]. http://doi.org/10.1371/journal.pone.0218073
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0218073
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Rajiv Movva; Peyton Greenside; Georgi K. Marinov; Surag Nair; Avanti Shrikumar; Anshul Kundaje
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.
Data from: Functional dissection of human cardiac enhancers and non-coding...
zenodo.org
txt
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaoran Zhang; Xiaoran Zhang (2023). Functional dissection of human cardiac enhancers and non-coding de novo variants in congenital heart disease [Dataset]. http://doi.org/10.5281/zenodo.8162058
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8162058
Dataset updated
Jul 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiaoran Zhang; Xiaoran Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the CHD MPRA motif analysis input file. Please see the detail in : https://github.com/pulab/CHD_DNVs/tree/main/MPRA-Enhancer/CHD_MPRA_project/CHD_MPRA_library
N
Distinct roles for motif affinity, chromatin state, and co-regulatory motifs...
data.niaid.nih.gov
Updated May 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grossman SR (2019). Distinct roles for motif affinity, chromatin state, and co-regulatory motifs in PPARγ binding and enhancer activity [Dataset]. https://data.niaid.nih.gov/resources?id=gse84888
Explore at:
Dataset updated
May 15, 2019
Dataset provided by
Broad Institute
Authors
Grossman SR
Description
Sequence-specific transcription factors (TFs) regulate gene expression by binding to cognate motifs in promoters and enhancers. However, predicting genomic TF binding events and their quantitative contribution to expression remains a major challenge. In principle, the binding and enhancer activity of specific sites in vivo might depend on: (i) latent properties of the motif instance, (ii) cooperative interactions with other TFs that bind in the immediate vicinity, and (iii) the chromatin state of the sites in the genome. Here, we used massively parallel reporter assays (MPRA) involving 32,115 natural and synthetic enhancers, together with high-throughput in vivo assays, to systematically dissect the contributions of motif affinity, cooperative interactions, and chromatin accessibility to the binding and regulatory activity of genomic sequences that contain motifs for PPARγ, a TF that serves as a key regulator of adipogenesis. We show that PPARγ binding and enhancer activity are governed by distinct features. Genomic PPARγ binding to motif sites is largely governed by on larger-scale features, such as chromatin accessibility, whereas the degree to which a PPARγ motif site enhances transcriptional activity depends on the sequence immediately surround the motif. We detect and functionally validate a network of TFs comprised of multiple functional classes that collaborate with PPARγ to drive transcription. We extensively perturb this network, revealing functional cooperativity among classes of TFs that does not depend on precise positioning. Together, these results present a clear picture of how chromatin and TFs from distinct functional classes interact with PPARγ to determine binding and enhancer activity, and provide a paradigm for studying any TF. The study consisted of 7 MPRA experiments and 2 ChIP-seq experiments. Raw data for MPRA experiments are provided as Illumina reads of the 16 bp barcode from the RNA extracted 16 hours post transfection as well as from the plasmid library used for transfection. Raw data for ChIP-seq experiments are provided as paired-end Illumina reads for PPARg ChIP DNA fragments extracted 16 hours post transfection as well as input DNA fragments. For pools 4-7, we have provided barcode/oligo combinations as paired-end Illumina reads covering the barcode and enhancer sequence. Processed count files are counts corresponding to each barcode (Pools 1-3) or counts summed across all barcodes for each oligo (Pools 4-7).
Supplementary tables of "Synthetic enhancers reveal design principles of...
figshare.com
zip
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lars Velten (2024). Supplementary tables of "Synthetic enhancers reveal design principles of cell state specific regulatory elements in hematopoiesis" [Dataset]. http://doi.org/10.6084/m9.figshare.26927866.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26927866.v1
Dataset updated
Sep 3, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lars Velten
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
See https://www.biorxiv.org/content/10.1101/2024.08.26.609645v1For complete MPRA data, see https://doi.org/10.6084/m9.figshare.25713519.v1
Personalized genomes for DL models supporting data
zenodo.org
tar
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam He; Charles Danko; Charles Danko; Nathan Palamuttam; Adam He; Nathan Palamuttam (2024). Personalized genomes for DL models supporting data [Dataset]. http://doi.org/10.5281/zenodo.14037356
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14037356
Dataset updated
Nov 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Adam He; Charles Danko; Charles Danko; Nathan Palamuttam; Adam He; Nathan Palamuttam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Archive of models and data associated with our manuscript "Training deep learning models on personalized genomic sequences improves variant effect prediction".

Code for training and benchmarking LCL models is available at https://github.com/Danko-Lab/clipnet_ablation, whereas code for training and benchmarking K562 models is available at https://github.com/Danko-Lab/clipnet_k562/.

Model files & metadata:

n{i}_run{j}.tar

CLIPNET LCL models trained on i individuals

subsample_individuals_ids.tar

text files containing lists of the individuals used to train the above models.

reference_models.tar

CLIPNET LCL model trained on data from 67 PRO-cap libraries, but using hg38 sequences instead of personal genomes.

clipnet_k562_reference.tar

hg38-trained model described above transfer learned to K562.

Benchmark data:

across_loci_metrics.tar

benchmarks of LCL models at predicting transcription initiation at individual CREs within the genome

qtl_metrics.tar

benchmarks of LCL models at predicting differences in transcription initiation between individuals at initiation QTLs

k562_data.tar

benchmarks of the reference-trained K562 model and one transferred over from the personalized CLIPNET model on MPRA data from https://www.biorxiv.org/content/10.1101/2024.05.05.592437v1
m
LOC127829729
rgd.mcw.edu
Updated May 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rat Genome Database (2023). LOC127829729 [Dataset]. https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=155751899
Explore at:
Dataset updated
May 26, 2023
Dataset authored and provided by
Rat Genome Database
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This genomic region was validated as an active enhancer by the ChIP-STARR-seq massively parallel reporter assay (MPRA) in primed human embryonic stem cells, where it is marked by the H3K27ac histone modification. This locus also includes an accessible chromatin subregion that was validated as a silencer based on its ability to repress an origin of replication minimal core promoter by the ATAC-STARR-seq (assay for transposase-accessible chromatin with self-transcribing active regulatory region sequencing) MPRA in GM12878 lymphoblastoid cells. [provided by RefSeq, Jun 2023]
Z
APARENT2 Training Data and Models
data.niaid.nih.gov
zenodo.org
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linder, Johannes (2022). APARENT2 Training Data and Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7317445
Explore at:
Dataset updated
Nov 14, 2022
Dataset authored and provided by
Linder, Johannes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Processed training data for the APARENT2 model (measurements from the random MPRA and designed oligo pool originally published by Bogard et al., 2019; see https://doi.org/10.1016/j.cell.2019.04.046 for reference). This repository also contains the APARENT2 model file. For more information on the training procedure, see the Genome Biology article "Deciphering the impact of genetic variation on human polyadenylation using APARENT2" (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02799-4). Two versions of the model are available:

(a) aparent_all_libs_resnet_no_clinvar_wt_ep_5.h5: The originally trained APARENT2 model. (b) aparent_all_libs_resnet_no_clinvar_wt_ep_5_var_batch_size_inference_mode_no_drop.h5: Identical weights and predictions as model (a), but the normalization layers have been set to inference mode and the dropout layers have been removed (thus making it compatible with the scrambler pipeline).
f
Data Sheet 1_An in vivo systemic massively parallel platform for deciphering...
frontiersin.figshare.com
docx
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashley R. Brown; Grant A. Fox; Irene M. Kaplow; Alyssa J. Lawler; BaDoi N. Phan; Lahari Gadey; Morgan E. Wirthlin; Easwaran Ramamurthy; Gemma E. May; Ziheng Chen; Qiao Su; C. Joel McManus; Robert van de Weerd; Andreas R. Pfenning (2025). Data Sheet 1_An in vivo systemic massively parallel platform for deciphering animal tissue-specific regulatory function.docx [Dataset]. http://doi.org/10.3389/fgene.2025.1533900.s011
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2025.1533900.s011
Dataset updated
Apr 9, 2025
Dataset provided by
Frontiers
Authors
Ashley R. Brown; Grant A. Fox; Irene M. Kaplow; Alyssa J. Lawler; BaDoi N. Phan; Lahari Gadey; Morgan E. Wirthlin; Easwaran Ramamurthy; Gemma E. May; Ziheng Chen; Qiao Su; C. Joel McManus; Robert van de Weerd; Andreas R. Pfenning
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction: Transcriptional regulation is an important process wherein non-protein coding enhancer sequences play a key role in determining cell type identity and phenotypic diversity. In neural tissue, these gene regulatory processes are crucial for coordinating a plethora of interconnected and regionally specialized cell types, ensuring their synchronized activity in generating behavior. Recognizing the intricate interplay of gene regulatory processes in the brain is imperative, as mounting evidence links neurodevelopment and neurological disorders to non-coding genome regions. While genome-wide association studies are swiftly identifying non-coding human disease-associated loci, decoding regulatory mechanisms is challenging due to causal variant ambiguity and their specific tissue impacts.Methods: Massively parallel reporter assays (MPRAs) are widely used in cell culture to study the non-coding enhancer regions, linking genome sequence differences to tissue-specific regulatory function. However, widespread use in animals encounters significant challenges, including insufficient viral library delivery and library quantification, irregular viral transduction rates, and injection site inflammation disrupting gene expression. Here, we introduce a systemic MPRA (sysMPRA) to address these challenges through systemic intravenous AAV viral delivery.Results: We demonstrate successful transduction of the MPRA library into diverse mouse tissues, efficiently identifying tissue specificity in candidate enhancers and aligning well with predictions from machine learning models. We highlight that sysMPRA effectively uncovers regulatory effects stemming from the disruption of MEF2C transcription factor binding sites, single-nucleotide polymorphisms, and the consequences of genetic variations associated with late-onset Alzheimer‘s disease.Conclusion: SysMPRA is an effective library delivering method that simultaneously determines the transcriptional functions of hundreds of enhancers in vivo across multiple tissues.
Gosai et al. (2024) Evaluator Container for Genomic API for Model Evaluation...
zenodo.org
bin, zip
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishika Luthra; Ishika Luthra (2025). Gosai et al. (2024) Evaluator Container for Genomic API for Model Evaluation (GAME) [Dataset]. http://doi.org/10.5281/zenodo.14908238
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14908238
Dataset updated
Feb 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ishika Luthra; Ishika Luthra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluator container for Gosai et al. 2024 MPRA sequences. A total of 776,474 sequences (200bp) were measured in 3 human cell lines.

Gosai, S.J., Castro, R.I., Fuentes, N. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211–1220 (2024). https://doi.org/10.1038/s41586-024-08070-z

gosai_evaluator.sif contains all dependencies and scripts required for the Evaluator container to read in the raw MPRA data, parse into the correct API format, and connect with any Predictor container via TCP sockets.

test_gosai_predictor.sif contains all dependencies and scripts required for a test Predictor container that can be used with the Gosai Evaluator container.

Additional information can be found here: https://github.com/de-Boer-Lab/Genomic-Model-Evaluation-API
m
LOC112942286
rgd.mcw.edu
Updated Feb 9, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rat Genome Database (2018). LOC112942286 [Dataset]. https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=38616171
Explore at:
Dataset updated
Feb 9, 2018
Dataset authored and provided by
Rat Genome Database
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This genomic sequence was predicted to be a transcriptional regulatory region based on chromatin state analysis from the ENCODE (ENCyclopedia Of DNA Elements) project. It was validated as an active enhancer by the ChIP-STARR-seq massively parallel reporter assay (MPRA) in naive and primed human embryonic stem cells, where it is marked by the H3K27ac histone modification. A subregion was also validated as an enhancer by Sharpr-MPRA (Systematic high-resolution activation and repression profiling with reporter tiling using massively parallel reporter assays) in both HepG2 liver carcinoma cells (group: HepG2 Activating DNase unmatched - State 12:CtcfO, distal CTCF/candidate insulator with open chromatin) and K562 erythroleukemia cells (group: K562 Activating DNase matched - State 13:Ctcf, distal CTCF/candidate insulator without open chromatin). This locus also includes an accessible chromatin subregion that was validated as an enhancer based on its ability to activate an origin of replication minimal core promoter by the ATAC-STARR-seq (assay for transposase-accessible chromatin with self-transcribing active regulatory region sequencing) MPRA in GM12878 lymphoblastoid cells. [provided by RefSeq, May 2023]

Facebook

Twitter

Click to copy link

Link copied

Cite

Lars Velten; Robert Frömel (2025). MPRA data of synthetic enhancers in hematopoiesis [Dataset]. http://doi.org/10.6084/m9.figshare.25713519.v3

MPRA data of synthetic enhancers in hematopoiesis

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.25713519.v3

Dataset updated

Mar 15, 2025

Dataset provided by

Figsharehttp://figshare.com/

Authors

Lars Velten; Robert Frömel

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

OverviewR Data Archive file containing MPRA data measuring the activity of synthetic DNA constructs in 7 cell states of murine primary hematopietic stem and progenitor cells (HSPCs) , and K562 cells. See our manuscript, https://www.biorxiv.org/content/10.1101/2024.08.26.609645v1This file contains a main data object, mpra.data, a list over the different experiments:HSPC.libA : Library A (38 factors, one TFBS per enhancer), HSPC experimentHSPC.libB : Library B (10 factors, TFBS pairs), HSPC experimentHSPC.libC : Library C (42 factors, TFBS pairs), HSPC experimentHSPC.libC.aggregate : Library C, HSPC experiment, aggregated across cell statesHSPC.libD: Library D (automated enhancer design), HSPC experimentHSPC.libF: Library F (Fli1-Spi1 and Gata2-Cebpa combinations), HSPC experimentHSPC.libG: Library G (Genomic sequences), HSPC experimentHSPC.libH: Library H (complex synthetic sequences with 3-12 FBS)K562.libA.minP.tra : Library A, K562 cell experimentK562.libB.minP.tra : Library B, K562 cell experimentK562.libC.minP.tra : Library C, K562 cell experimentK562.libB.minCMV.tra : Library B, K562 cell experiment, measured in a vector with the minimal CMV promoter instead of the default minimal promoter.K562.libB.minP.int : Library B, K562 cell experiment, measured 7 days after infection instead of 4 days after infectionEach list entry is a list of data frames, the exact format of which is explained belowDATA : Data of main constructsCONTROLS.GENERAL : Various controls, including random DNA measurements obtained as part of the same experimentCONTROLS.TP53 : An identical set of sequences from library A that was included in each experimentBACKGROUND : 90% confidence intervals of activity achieved by random DNADetailled explanation of main data frames (DATA)Each row corresponds to a single gene regulatory element, measured in a single cell state. The following columns are present for all libraries:clusterID : The cell state where the measurement was performed. To map the entries to labels, use the vector cellstate.mapCRS : The unique ID of the gene regulatory elementLibrary : The library (A, B or C)Seq : The DNA sequence. Capital letters correspond to placed motifs, small letters correspond to background DNA.RNA.1 , RNA.2 , DNA.1 , DNA.2 : Molecule counts on DNA and RNA level in replicate 1 and 2RNA.norm.1 , RNA.norm.2 , DNA.norm.1 , DNA.norm.2 : Library-size normalized molecule counts (???)norm.1.raw , norm.2.raw : Raw log2 of RNA/DNA counts in replicate 1 and 2norm.1.adj , norm.2.adj : log2 of RNA/DNA counts in replicate 1 and 2, subtracting the median activity of random DNAmean.norm.raw : Mean raw activity across replicates (log2 scale RNA/DNA)mean.norm.adj : Mean activity across replicates, using a scale where 0 is the median activity of random DNA in that cell statemean.scaled.final : Scaled activity measurement for visualization only, using a scale where 0 is the median activity of random DNA and 1 is the maximal activity achieved by Tp53 (not computed in K562 screens). Use mean.norm.adj for model training, statistics etc. The purpose of this is to account for different baselines and offsets achieved in different screens.The following columns regarding sequence design are only present in the single-factor library A:TF : The transcription factor placed on the DNAnrepeats : Number of placed motifsaffinitynum : Affinity quantile (on a scale where 1 is the most likely sequence given the PWM, and 0 is any extremely unlikely sequence)sum.biophys.affinity : Sum of actual affinities across the sequence, computed using considerations from statistical thermodynamics.orientation : Orientation of the motif. forward (fwd), reverse(rev) or alternating fwd-rev (tandem)spacer : Numbers of base pairs spacing between motifs.The following columns are only present in the dual-factor library B+C:TF1.name : Name of the transcription factor whose motif appears first, coming from 5''TF1.affinity : Corresponding affinity (on a scale from 0 to 1)TF1.orientation : Corresponding orientationTF2.name : Name of the transcription factor whose motif appears second, coming from 5''TF2.affinity : Corresponding affinity (on a scale from 0 to 1)TF2.orientation : Corresponding orientationspacer : Spacing between sitesTFnumber : Number of sites for each factorTForder : Arrangement of sites (Alternate or Block)The following colums are only present in the automated designn library D:SubLibrary: Whether the goal was to design enhancers with specific activation or repressionTask_MegEry, Task_Basophil, Task_Eosinophil, Task_Monocyte, Task_Neutrophil, Task_Immature: Task definition in the different cell states. -3 for repression, -0.2 / 0.6 for inactivity (depending on whether the task was activation or repression), 1 for activity.design_strategy: Whether the design was initialized with a random sequence or a random forest model was used to identify an optimal TFBS combination (model-guided)design_search: Whether optimization was done with a local or global searchThe following columns are only present in the dual-factor library F:spacer: Spacing between sitesnFli1, nSpi1, nCebpa, nGata2 Number of Fli1/Spi1/Cebpa/Gata2 sitesFli1_affinities_sum, Spi1_affinities_sum, Cebpa_affinities_sum, Gata2_affinities_sum: : Sum of motif scores for Fli1, Spi1, Cebpa, Gata2The following columns are only present in the genomic library G:chromosome, start_coordinate, end_coordinate: Genomic coordinates (mm10)Example code for working with main DATA framesSubsetting LibB/LibC data framesTo extract all data belonging to a given pair of transcription factors from library B and C in a format that ignores the order of the sites (i.e. which TF comes first and which TF comes second), the RDA file contains a function getsubset.libBC. This function takes as arguments a DATA frame and two transcription factors, e.g.getsubset.libBC("Spi1", "Fli1", mpra.data$HSPC.libC$DATA)It returns a similar data frame, except that now, TF1.name is always Spi1 (no matter if Spi1 is placed first, or if Sli1 is placed first), and TF2.name is always Fli1. It also adds some convenient columns:oricomb : Orientation of both factorsaffnum : Number of weak and strong binding sites for woth factors (e.g. 3W-3S means that the first factor has 3 weak binding sites, and the second factor has 3 strong binding sites)Casting data framesTo convert the data frame into a format where one row is one sequence, and columns are measurements in different cell states, you can use:require(reshape2)casted.dataframe

Clear search

Close search

Google apps

Main menu

MPRA data of synthetic enhancers in hematopoiesis

LS-MPRA / d-MPRA Data Repository

Sequencing data for reporter assay in Jindal et al Dev Cell 2023 article

BIOGRID CURATED DATA FOR MPRA (Escherichia coli (K12/W3110))

Supporting data for: Three-dimensional genome re-wiring in loci with Human...

Data from: Massively Parallel Reporter Assays for High-Throughput In Vivo...

RData file of estimate comparisons and primary MPRA data.

Source Data for Supplementary Note Figures

ENCSR186NQR

Data from: Systematic dissection and optimization of inducible enhancers in...

Deciphering regulatory DNA sequences and noncoding genetic variants using...

Data from: Functional dissection of human cardiac enhancers and non-coding...

Distinct roles for motif affinity, chromatin state, and co-regulatory motifs...

Supplementary tables of "Synthetic enhancers reveal design principles of...

Personalized genomes for DL models supporting data

LOC127829729

APARENT2 Training Data and Models

Data Sheet 1_An in vivo systemic massively parallel platform for deciphering...

Gosai et al. (2024) Evaluator Container for Genomic API for Model Evaluation...

LOC112942286

MPRA data of synthetic enhancers in hematopoiesis