Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2.ATACseq_workflow.txt—Example machine-readable Fig. 4 workflow including stepwise unix and R commands for ATAC-seq data processing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R scripts used to process and analyze multiome (joint GEX and ATAC) single-nucleus sequencing data from human neuroblastoma cell lines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RNA-seq files:
RNA_count_matrix.txt.gz - raw read counts
RNA_cqn_matrix.txt.gz - read counts quantile normalised with the cqn R package
RNA_gene_metadata.txt.gz - information about the genes
RNA_sample_metadata.txt.gz - information about the samples
ATAC-seq files:
ATAC_count_matrix.txt.gz - raw read counts
ATAC_cqn_matrix.txt.gz - read counts quantile normalised with the cqn R package
ATAC_peak_metadata.txt.gz - peak coordinates and other metadata
ATAC_sample_metadata.txt.gz - sample metadata
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 3 Results_of_RNASeq_data_analysis. Full list of the results of differential gene analysis with RNA-Seq data.
ChIP-seq (chromatin immunoprecipitation followed by sequencing) is commonly used to identify genome-wide protein-DNA interactions. However, ChIP-seq often gives a low yield, which is not ideal for quantitative outcomes. An alternative method to ChIP-seq is ChEC-seq (Chromatin endogenous cleavage with high-throughput sequencing). In this method, the endogenous TF (transcription factor) of interest is fused with MNase (micrococcal nuclease) that non-specifically cleaves DNA near binding sites. Compared to the original ChEC-seq method, the modified version requires far less amplification. Since MACS3 failed to identify peaks in data generated from the modified ChEC-seq method, a new peak finder has been developed specifically for it. There are three functions in the peak_finder/. callpeaks() is used to identify peaks from BAM files. goanalysis() is used to make GO (Gene Ontology) term plots from peaks. bedtomeme() is a wrapper function to perform MEME analysis in R after MEME Suite is inst..., ****EXCERPTED FROM BIORXIV PREPRINT; SEE PREPRINT OR PUBLISHED PAPER FOR REFERENCES AND DETAILS**** Yeast strains All yeast strains were derived from BY4741. A C-terminal micrococcal nuclease fusion was introduced to the protein of interest through transformation and homologous recombination of PCR-amplified DNA. Primers were designed with 50-bp of homology to the 3’ end of the coding sequence of interest. The 3xFLAG-MNase with a KanR marker was amplified from pGZ108 (Zentner et al., 2015) and transformed into BY4741 as previously described. Successful transformation was confirmed by immunoblotting and PCR, followed by sequencing. Lyophilized DNA oligonucleotides were resuspended in molecular-grade water to a concentration of 100 µM. For ligation, the following pair of oligonucleotides were annealed to produce the Y-adapter: Tn5ME-A (5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3’) and Y-Adapt-i5 R (5’-CTGTCTCTTATACACATCTTCATAGTAATCATC-3’). For Tn5 Tagmentation, the following i7 oligonucle..., , # DoubleChEC TF binding site finder
ChIP-seq (chromatin immunoprecipitation followed by sequencing) is commonly used to identify genome-wide protein-DNA interactions. However, ChIP-seq often gives a low yield, which is not ideal for quantitative outcomes. An alternative method to ChIP-seq is ChEC-seq (Chromatin endogenous cleavage with high-throughput sequencing). In this method, an endogenous TF (transcription factor) fused to MNase (micrococcal nuclease) cleaves DNA near binding sites. This package is designed to identify high-confidence binding sites from cleavage patterns from ChEC-seq2, a variant form of ChEC-seq.
There are three functions in the peak_finder/
. callpeaks()
is used to identify peaks from single-end mapped reads input as BAM files. goanalysis()
is used to make GO (Gene Ontology) term plots from peaks. bedtomeme()
is a wrapper function to perform MEME analysis in R **after [MEME Suite](https://meme-...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single cell ATAC-seq (scATAC-seq) has become the most widely used method for profiling open chromatin landscape of heterogeneous cell populations at a single-cell resolution. Although numerous software tools and pipelines have been developed, an easy-to-use, scalable, reproducible, and comprehensive pipeline for scATAC-seq data analyses is still lacking. To fill this gap, we developed scATACpipe, a Nextflow pipeline, for performing comprehensive analyses of scATAC-seq data including extensive quality assessment, preprocessing, dimension reduction, clustering, peak calling, differential accessibility inference, integration with scRNA-seq data, transcription factor activity and footprinting analysis, co-accessibility inference, and cell trajectory prediction. scATACpipe enables users to perform the end-to-end analysis of scATAC-seq data with three sub-workflow options for preprocessing that leverage 10x Genomics Cell Ranger ATAC software, the ultra-fast Chromap procedures, and a set of custom scripts implementing current best practices for scATAC-seq data preprocessing. The pipeline extends the R package ArchR for downstream analysis with added support to any eukaryotic species with an annotated reference genome. Importantly, scATACpipe generates an all-in-one HTML report for the entire analysis and outputs cluster-specific BAM, BED, and BigWig files for visualization in a genome browser. scATACpipe eliminates the need for users to chain different tools together and facilitates reproducible and comprehensive analyses of scATAC-seq data from raw reads to various biological insights with minimal changes of configuration settings for different computing environments or species. By applying it to public datasets, we illustrated the utility, flexibility, versatility, and reliability of our pipeline, and demonstrated that our scATACpipe outperforms other workflows.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The mechanisms underlying ETS-driven prostate cancer initiation and progression remain poorly understood due to a lack of model systems that recapitulate this phenotype. We generated a genetically engineered mouse with prostate-specific expression of the ETS factor, ETV4, at lower and higher protein dosages through mutation of its degron. Lower-level expression of ETV4 caused mild luminal cell expansion without histologic abnormalities and higher-level expression of stabilized ETV4 caused prostatic intraepithelial neoplasia (mPIN) with 100% penetrance within 1 week. Tumor progression was limited by p53-mediated senescence and Trp53 deletion cooperated with stabilized ETV4. The neoplastic cells expressed differentiation markers such as Nkx3.1 recapitulating luminal gene expression features of untreated human prostate cancer. Single-cell and bulk RNA-sequencing showed stabilized ETV4 induced a novel luminal-derived expression cluster with signatures of the cell cycle, senescence, and epithelial to mesenchymal transition. These data suggest that ETS overexpression alone, at sufficient dosage, can initiate prostate neoplasia. Methods Mouse prostate digestion: Intraperitoneal injection of tamoxifen was administered in 8-week-old mice. 2 weeks after tamoxifen treatment, the mouse prostate was digested 1 hour with Collagenase/Hyaluronidase (STEMCELL, #07912), and then 30 minutes with TrypLETM Express Enzyme (Thermo Fischer, # 12605028) at 37°C to isolate single prostate cells. The prostate cells were stained with PE/Cy7 conjugated anti-mouse CD326 (EpCAM) antibody (BioLegend, 118216) and then, CD326 and EYFP double positive cells were sorted out by flow cytometry, which are luminal cells mainly from the anterior prostate and dorsal prostate. The mRNA or genomic DNA were extracted from these double-positive cells and then were used for ATAC-sequencing and RNA-sequencing analysis. ATAC-seq and primary data processing: ATAC-seq was performed as previously described. Primary data processing and peak calling were performed using ENCODE ATAC-seq pipeline (https://github.com/kundajelab/atac_dnase_pipelines). Briefly, paired-end reads were trimmed, filtered, and aligned against mm9 using Bowtie2. PCR duplicates and reads mapped to mitochondrial chromosome or repeated regions were removed. Mapped reads were shifted +4/-5 to correct for the Tn5 transposase insertion. Peak calling was performed using MACS2, with p-value < 0.01 as the cutoff. Reproducible peaks from two biological replicates were defined as peaks that overlapped by more than 50%. On average 25 million uniquely mapped pairs of reads were remained after filtering. The distribution of inserted fragment length shows a typical nucleosome banding pattern, and the TSS enrichment score (reads that are enriched around TSS against background) ranges between 28 and 33, suggesting the libraries have high quality and were able to capture the majority of regions of interest. Differential peak accessibility: Reads aligned to peak regions were counted using R package GenomicAlignments_v1.12.2. Read count normalization and differential accessible peaks were called with DESeq2_v1.16.1 in R 3.4.1. Differential peaks were defined as peaks with adjusted p-value < 0.01 and |log2(FC)| > 2. For visualization, coverage bigwig files were generated using bamCoverage command from deepTools2, normalizing using the size factor generated by DESeq2. The differential ATAC-seq peak density plot was generated with deepTools2, using regions that were significantly more or less accessible in ETV4AAA samples relative to EYFP samples. Motif analysis: Enriched motif was performed using MEME-ChIP 5.0.0 with differentially accessible regions in ETV4AAA relative to EYFP. ATAC-seq footprinting was performed using TOBIAS. First, ACACCorrect was run to correct Tn5 bias, followed by ScoreBigwig to calculate footprint score, and finally BindDetect to generate differential footprint across regions. RNA-seq analysis: The extracted RNA was processed for RNA-sequencing by the Integrated Genomics Core Facility at MSKCC. The libraries were sequenced on an Illumina HiSeq-2500 platform with 51 bp paired-end reads to obtain a minimum yield of 40 million reads per sample. The sequenced data were aligned using STAR v2.3 with GRCm38.p6 as annotation. DESeq2_v1.16.1 was subsequently applied on read counts for normalization and the identification of differentially expressed genes between ETV4AAA and EYFP groups, with an adjusted p-value < 0.05 as the threshold. Genes were ranked by sign(log2(FC)) * (-log(p-value)) as input for GSEA analysis using ‘Run GSEA Pre-ranked’ with 1000 permutations (48). The custom gene sets used in GSEA analysis are shown in Table S2. Unsupervised hierarchical clustering: To get an overall sample clustering as part of QC, hierarchical clustering was performed using pheatmap_v1.0.10 package in R on normalized ATAC-seq or RNA-seq data. It was done using all peaks or all genes, with Spearman or Pearson correlation as the distance metric. To have an overview of the differential gene expression from the RNA-seq data, unsupervised clustering was also performed on a matrix with all samples as columns and scaled normalized read counts of differentially expressed genes between ETV4AAA and EYFP as rows. Integrative analysis of ATAC-seq, RNA-seq, and ChIP-seq data: ERG ChIP-seq peaks were called using MACS 2.1, with an FDR cutoff of q < 10-3 and the removal of peaks mapped to blacklist regions. Reproducible peaks between two biological replicates were identified as ETV4AAA ATAC-seq peaks. ERG ChIP-seq peaks and ETV4AAA ATAC-seq peaks were considered as overlap if peak summits were within 250bp. To determine whether the overlap was significant, enrichment analysis was done using regioneR_v1.8.1 in R, which counted the number of overlapped peaks between a set of randomly selected regions in the genome (excluding blacklist regions) and the ERG-ChIP seq peaks or ETV4AAA ATAC-seq peaks. A null distribution was formed using 1000 permutation tests to compute the p-value and z-score of the original evaluation. To assign ATAC-seq peaks to genes, ChIPseeker_v1.12.1 in R was used. Each peak was unambiguously assigned to one gene with a TSS or 3’ end closest to that peak. Differential gene expression between ETV4AAA and EYFP was evaluated using log2(FC) calculated by DESeq2. p-values were estimated with Wilcoxon rank t-test and Student t-test. scRNA-sequencing: Tmprss2-CreERT2, EYFP; Tmprss2-CreERT2, ETV4WT; Tmprss2-CreERT2, ETV4AAA; and Tmprss2-CreERT2, ETV4AAA; Trp53L/L mice were euthanized 2 weeks or 4 months after tamoxifen treatment (n=3 mice for each genotype and time point). After euthanasia, the prostates were dissected out and minced with scalpel, and then processed for 1h digestion with collagenase/hyaluronidase (#07912, STEMCELL Technologies) and 30min digestion with TrypLE (#12605010, Gibco). Live single prostate cells were sorted out by flow cytometry as DAPI-. For each mouse, 5,000 cells were directly processed with 10X genomics Chromium Single Cell 3’ GEM, Library & Gel Bead Kit v3 according to manufacturer’s specifications. For each sample, 200 million reads were acquired on NovaSeq platform S4 flow cell. Reads obtained from the 10x Genomics scRNAseq platform were mapped to mouse genome (mm9) using the Cell Ranger package (10X Genomics). True cells are distinguished from empty droplets using scCB2 package. The levels of mitochondrial reads and numbers of unique molecular identifiers (UMIs) were similar among the samples, which indicates that there were no systematic biases in the libraries from mice with different genotypes. Cells were removed if they expressed fewer than 600 unique genes, less than 1,500 total counts, more than 50,000 total counts, or greater than 20% mitochondrial reads. Genes detected in less than 10 cells and all mitochondrial genes were removed for subsequent analyses. Putative doublets were removed using the Doublet Detection package. The average gene detection in each cell type was similar among the samples. Combining samples in the entire cohort yielded a filtered count matrix of 48,926 cells by 19,854 genes, with a median of 6,944 counts and a median of 1,973 genes per cell, and a median of 2,039 cells per sample. The count matrix was then normalized to CPM (counts per million), and log2(X+1) transformed for analysis of the combined dataset. The top 1000 highly variable genes were found using SCANPY (version 1.6.1) (77). Principal Component Analysis (PCA) was performed on the 1,000 most variable genes with the top 50 principal components (PCs) retained with 29% variance explained. To visualize single cells of the global atlas, we used UMAP projections (https://arxiv.org/abs/1802.03426). We then performed Leiden clustering. Marker genes for each cluster were found with scanpy.tl.rank_genes_groups. Cell types were determined using the SCSA package, an automatic tool, based on a score annotation model combining differentially expressed genes (DEGs) and confidence levels of cell markers from both known and user-defined information. Heat-map were performed for single cells based on log-normalized and scaled expression values of marker genes curated from literature or identified as highly differentially expressed. Differentially expressed genes between different clusters were found using MAST package, which were shown in heat-map. The logFC of MAST output was used for the ranked gene list in GSEA analysis (48). The custom gene sets used in GSEA analysis are shown in Table S2. Gene imputation was performed using MAGIC (Markov affinity-based graph imputation of cells) package, and imputated gene expression were used in the heatmap. Analysis of public human gene expression datasets: To analyze TP53 RNA expression in human prostate cancer samples, we obtained normalized RNA-seq data from prostate cancer TCGA (www.firebrowse.org) (3). To assess the role of TP53 loss on
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed data and code for "Precise modulation of transcription factor levels reveals drivers of dosage sensitivity," Naqvi et al 2022.
Count/expression data
Metadata
Scripts
Intermediate/output files (some files are gzipped to save space, the Rscripts that output them won't gzip but they expect gzipped input when indicated)
chromatin_predictions.tar.gz (self-contained folder for chromatin-based predictions of gene expression change) contains:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BCG vaccination can boost innate immune responses via trained immunity (TI), resulting in an increased resistance to respiratory viral infections. Assay for transposase accessible chromatin (ATAC), including tagmentation, library preparation and sequencing were performed by Genewiz (Azenta Life Sciences, MA, USA) on PBMCs from two BCG-treated NMIBC patients at baseline and during BCG (mid) as well as from three healthy donors. This dataset include the pipeline for preprocessing of raw data including mapping to the hg38 reference genome using bowtie2 and peak calling by MACS2. Differentially accessible regions in proximity to annotated genes between the two time points (during BCG versus baseline) were identified using the R packages csaw and edgeR and resulting data files are provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2 Results_of_intePareto. Full list of the results of integrative analysis using intePareto.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 6. csaw_workflow.R—Example R workflow for differential accessibility analysis with csaw as graphically displayed in Fig. 6. Describes process for both TMM and loess normalizations and either supplying MACS2 peak sets as query regions or identifying de novo locally enriched windows.
MNase-Seq and ChIP-Seq have evolved as popular techniques to study chromatin and histone modification. Although many tools have been developed to identify enriched regions, software tools for nucleosome positioning are still limited. We introduce a flexible and powerful open-source R package, PING 2.0, for nucleosome positioning using MNase-Seq data or MNase- or sonicated- ChIP-Seq data combined with either single-end or paired-end sequencing. PING uses a model-based approach, which enables nucleosome predictions even in the presence of low read counts. We illustrate PING using two paired-end datasets from Saccharomyces cerevisiae and compare its performance to nucleR and ChIPseqR. Identification of nucleosomes from two different mononucleosomes data. A yeast strain (W303 background) with the HTZ1 gene expressed a fusion with a myc epitope was used to map total and Htz1-containign nucleosome by MNase-ChIP-Seq. Cells were grown to mid-log phase and monomucleosomes were generated using MNase treatment of isolated nuclei. Especially for the sample of SC0017_61YDGAAXX_8_TCATTC, the Htz1-containing nucleosomes were enriched by immunoprecipitation using an anti-Myc antibody (3E10). DNA from both total nucleosomes and Htz1-enriched nucleosomes were purified and sequenced on an Illumina GA IIx using the by paired-end protocol.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 5.naiveOverlapBroad.sh—Bash script for calculating naïve overlap broad peak set from 2 individual replicate peak sets and a pooled replicate peak set. Can be modified for to accept more replicates as desired. See Fig. 4 for usage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2024.
Directory is organized into 4 subfolders, each tar'ed and gzipped:
data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage
baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage
chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage
modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models
mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves
Genomic sequencing of many thousands of tumors has revealed many genes associated with specific types of cancer. Similarly, large scale CRISPR functional genomics efforts have mapped genes required for proliferation or survival in hundreds of cancer cell lines. Despite this, for specific disease subtypes, such as metastatic prostate cancer, it is likely that there exist many undiscovered tumor specific genetic dependencies, such as prostate cancer specific drivers, that represent drug targets. To identify such genetic dependencies, we performed genome-scale CRISPRi screens in metastatic prostate cancer models. We then created a pipeline in which we integrated publicly available pan-cancer functional genomics data with our metastatic prostate cancer functional and clinical genomics data to identify genes that can drive aggressive prostate cancer phenotypes. Our integrative analysis of these data revealed two known prostate cancer specific driver genes, AR and HOXB13, as the top two hits and also nominated a number of unexpected genes. In this study we highlight the strength of an integrated clinical and functional genomics pipeline and focus on two hit genes, KIF4A and WDR62. We demonstrate that both KIF4A and WDR62 drive aggressive prostate cancer phenotypes in vitro and in vivo in multiple models, irrespective of AR-status, and are also associated with poor patient outcome. ATAC-seq was performed in KIF4A knockdown in LNCaP and C42B prostate cancer cells
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains analysis products for the paper "Transcription factor stoichiometry, motif affinity and syntax regulate single cell chromatin dynamics during fibroblast reprogramming to pluripotency" by Nair, Ameen et al. Please refer to the READMEs in the directories, which are summarized below.
The record contains the following files:
`clusters.tsv`: contains the cluster id, name and colour of clusters in the paper
scATAC.zip
Analysis products for the single-cell ATAC-seq data. Contains:
- `cells.tsv`: list of barcodes that pass QC. Columns include:
- `barcode`
- `sample`: (time point)
- `umap1`
- `umap2`
- `cluster`
- `dpt_pseudotime_fibr_root`: pseudotime values treating a fibroblast cell as root
- `dpt_pseudotime_xOSK_root`: pseudotime values treating xOSK cell as root
- `peaks.bed`: list of peaks of 500bp across all cell states. 4th column contains the peak set label. Note that ~5000 peaks are not assigned to any peak set and are marked as NA.
- `features.tsv`: 50 dimensional representation of each cell
- `cell_x_peak.mtx.gz`: sparse matrix of fragment counts within peaks. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (combine sample + barcode). Rows correspond to peaks in `peaks.bed`
scATAC_clusters.zip
Analysis products corresponding to cluster pseudo-bulks of the single-cell ATAC-seq data.
- `clusters.tsv`: contains the cluster id, name and colour used in the paper
- `peaks`: contains `overlap_reproducibilty/overlap.optimal_peak` peaks called using ENCODE bulk ATAC-seq pipeline in the narrowPeak format.
- `fragments`: contains per cluster fragment files
scATAC_scRNA_integration.zip
Analysis products from the integration of scATAC with scRNA. Contains:
- `peak_gene_links_fdr1e-4.tsv`: file with peak gene links passing FDR 1e-4. For analyses in the paper, we filter to peaks with absolute correlation >0.45.
- `harmony.cca.30.feat.tsv`: 30 dimensional co-embedding for scATAC and scRNA cells obtained by CCA followed by applying Harmony over assay type.
- `harmony.cca.metadata.tsv`: UMAP coordinates for scATAC and scRNA cells derived from the Harmony CCA embedding. First column contains barcode.
scRNA.zip
Analysis products for the single-cell RNA-seq data. Contains:
- `seurat.rds`: seurat object that contains expression data (raw counts, normalized, and scaled), reductions (umap, pca), knn graphs, all associated metadata. Note that barcode suffix (1-9 corresponds to samples D0, D2, ..., D14, iPSC)
- `genes.txt`: list of all genes
- `cells.tsv`: list of barcodes that pass QC across samples. Contains:
- `barcode_sample`: barcode with index of sample (1-9 corresponding to D0, D2, ..., D14, iPSC)
- `sample`: sample name (D0, D2, .., D14, iPSC)
- `umap1`
- `umap2`
- `nCount_RNA`
- `nFeature_RNA`
- `cluster`
- `percent.mt`: percent of mitochondrial transcripts in cell
- `percent.oskm`: percent of OSKM transcripts in cell
- `gene_x_cell.mtx.gz`: sparse matrix of gene counts. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (barcode suffix contains sample information). Rows correspond to genes in `genes.txt`
- `pca.tsv`: first 50 PC of each cell
- `oskm_endo_sendai.tsv`: estimated raw counts (cts, may not be integers) and log(1+ tp10k) normalized expression (norm) for endogenous and exogenous (Sendai derived) counts of POU5F1 (OCT4), SOX2, KLF4 and MYC genes. Rows are consistent with `seurat.rds` and `cells.tsv`
multiome.zip
multiome/snATAC:
These files are derived from the integration of nuclei from multiome (D1M and D2M), with cells from day 2 of scATAC-seq (labeled D2).
- `cells.tsv`: This is the list of nuclei barcodes that pass QC from multiome AND also cell barcodes from D2 of scATAC-seq. Includes:
- `barcode`
- `umap1`: These are the coordinates used for the figures involving multiome in the paper.
- `umap2`: ^^^
- `sample`: D1M and D2M correspond to multiome, D2 corresponds to day 2 of scATAC-seq
- `cluster`: For multiome barcodes, these are labels transfered from scATAC-seq. For D2 scATAC-seq, it is the original cluster labels.
- `peaks.bed`: This is the same file as scATAC/peaks.bed. List of peaks of 500bp. 4th column contains the peak set label. Note that ~5000 peaks are not assigned to any peak set and are marked as NA.
- `cell_x_peak.mtx.gz`: sparse matrix of fragment counts within peaks. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (combine sample + barcode). Rows correspond to peaks in `peaks.bed`.
- `features.no.harmony.50d.tsv`: 50 dimensional representation of each cell prior to running Harmony (to correct for batch effect between D2 scATAC and D1M,D2M snMultiome). Rows correspond to cells from `cells.tsv`.
- `features.harmony.10d.tsv`: 10 dimensional representation of each cell after running Harmony. Rows correspond to cells from `cells.tsv`.
multiome/snRNA:
- `seurat.rds`: seurat object that contains expression data (raw counts, normalized, and scaled), reductions (umap, pca),associated metadata. Note that barcode suffix (1,2 corresponds to samples D1M, D2M). Please use the UMAP/features from snATAC/ for consistency.
- `genes.txt`: list of all genes (this is different from the list in scRNA analysis)
- `cells.tsv`: list of barcodes that pass QC across samples. Contains:
- `barcode_sample`: barcode with index of sample (1,2 corresponding to D1M, D2M respectively)
- `sample`: sample name (D1M, D2M)
- `nCount_RNA`
- `nFeature_RNA`
- `percent.oskm`: percent of OSKM genes in cell
- `gene_x_cell.mtx.gz`: sparse matrix of gene counts. Load using scipy.io.mmread in python or readMM in R. Columns correspond to cells from `cells.tsv` (barcode suffix contains sample information). Rows correspond to genes in `genes.txt`
Linking regulatory DNA elements to their target genes, which may be located hundreds of kilobases away, remains challenging. Here, we introduce Cicero, an algorithm that identifies co-accessible pairs of DNA elements using single-cell chromatin accessibility data and so connects regulatory elements to their putative target genes. We apply Cicero to investigate how dynamically accessible elements orchestrate gene regulation in differentiating myoblasts. Groups of Cicero-linked regulatory elements meet criteria of “chromatin hubs”—they are enriched for physical proximity, interact with a common set of transcription factors, and undergo coordinated changes in histone marks that are predictive of changes in gene expression. Pseudotemporal analysis revealed that most DNA elements remain in chromatin hubs throughout differentiation. A subset of elements bound by MYOD1 in myoblasts exhibit early opening in a PBX1- and MEIS1-dependent manner. Our strategy can be applied to dissect the architecture, sequence determinants, and mechanisms of cis-regulation on a genome-wide scale. sci-ATAC-seq data was collected on human skeletal muscle myoblasts (HSMM) in culture at four timepoints after serum switch to induce differentiation into myotubes, 0 hours, 24 hours, 48 hours and 72 hours. Libraries pooled for sequencing (Experiment 1). An additional experiment was collected using the same system, at 0 hours and 72 hours after serum switch (Experiment 2). Bulk ATAC-seq data was also collected for each of the four timepoints. In addition, sci-ATAC-seq data was collected on an artificial mixture of GM12878 and HL60 cells. Lastly, bulk ATAC-seq data was collected at day 0 and day 7 after serum switch in 54-1 immortalized myoblasts that were transduced with lentivirus carrying small guide RNAs targeting Pbx1, Meis1 or non-targeting controls using lentiCRISPRv2-blast. Cells were allowed time for editing post transduction before differentiation. See publication for details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data repository for the manuscript: Kuppe, Ibrahim et al. "Decoding myofibroblast origins in human kidney fibrosis", 2020. Please also consult the supplemental data in the paper, and the data availability statement in hte manuscript for raw FASTQ files for mouse data.
For further data requests and questions, please contact Dr. Rafael Kramann (rkramann@ukaachen.de)
File Details:
- Human in vitro PDGFRb+ RNA-seq (bulk RNA-seq data for various NKD2 knock-out and knock-in clones)
* invitro_bulk_rnaseq.tar.gz: Salmon output for all samples. Please see the manuscript for further information.
- UUO Mouse FACS sorted PDGFRa+/b+ ATAC-Seq
* mouse_uuo_pdgfrab_atacseq.bw: BigWig Signal file for ATAC-Seq data, PDGFRa+/b+ FACS sorted cells from day 10 UUO mouse kidneys (average of two biological replicates)
* mouse_uuo_pdgfrab_motifs.meme: Motifs identified based on the ATAC-Seq data and further analyzed in the paper
- UUO and Sham Mouse FACS sorted PDGFRa+/b+ scRNA-seq (10x Genomics)
* Mouse_PDGFRab.tar.gz: contains the count data derived by Alevin/Salmon for the cells analyzed in the paper in matrix market format (.mtx). column data include cell cluster annotations.
- UUO and Sham Mouse FACS sorted PDGFRb+ scRNA-seq (SmartSeq2)
* Mouse_PDGFRa.tar.gz: contains the expression data for the cells analyzed in the paper in matrix market format (.mtx). column data include cell cluster annotations.
- Human FACS sorted CD10+ scRNA-seq (10x Genomics)
* Human_CD10plus.tar.gz: contains the count data derived by Alevin/Salmon for the cells analyzed in the paper in matrix market format (.mtx). column data include cell cluster annotations.
- Human FACS sorted CD10- scRNA-seq (10x Genomics)
* Human_CD10minus.tar.gz: contains the count data derived by Alevin/Salmon for the cells analyzed in the paper in matrix market format (.mtx). column data include cell cluster annotations.
- Human FACS sorted PDGFRb+ scRNA-seq (10x Genomics)
* Human_PDGFRb.tar.gz: contains the count data derived by Alevin/Salmon for the cells analyzed in the paper in matrix market format (.mtx). column data include cell cluster annotations.
* HumanPDGFRBpositive_Nkd2_grnboost2.csv: Gene Regulatory Network obtained by GRNboost2 on genes correlated with NKD2 in Fibroblast (Mesenchymal) cells. See manuscript for details.
* Human_PDGFRBplus_TFanalysis.tar.gz: TF analysis based on single cell RNA-seq for promoter and distal regions. See manuscript for details.
- github_files.tar.gz: RData Objects associated with the paper code repository (https://github.com/mahmoudibrahim/KidneyMap)
Einkorn (Triticum monococcum) is the first domesticated wheat species, being central to the birth of agriculture and the Neolithic Revolution in the Fertile Crescent ~10,000 years ago. Here, we generate and analyze 5.2-gigabase genome assemblies for wild and domesticated einkorn, including completely assembled centromeres. Einkorn centromeres are highly dynamic, showing evidence of ancient and recent centromere shifts caused by structural rearrangements. Whole-genome sequencing of a diversity panel uncovered the population structure and evolutionary history of einkorn, revealing complex patterns of hybridizations and introgressions following the dispersal of domesticated einkorn from the Fertile Crescent. We also discovered that around 1% of the modern bread wheat (Triticum aestivum) A subgenome originates from einkorn. These resources and findings highlight the history of einkorn evolution and provide a basis to accelerate the genomics-assisted improvement of einkorn and bread wheat., Chromatin immunoprecipitation (ChIP) and sequencing (ChIP-seq): Chromatin immunoprecipitation (ChIP) was performed according to the method given by Nagaki et al. standardized with wheat CENH3 antibody. Nuclei were isolated from 2-week-old seedlings and digested with micrococcal nuclease (Sigma, MO) to liberate nucleosomes. The digested mixture was incubated overnight with 3 mg of wheat CENH3 antibody at 4°C. The chromatin-antibody complexes were captured using Dynabeads Protein G (Invitrogen, CA). Elution of the chromatin was done using 100 ml of preheated elution buffer (1% sodium dodecyl sulfate and 0.1 M NaHCO3) for 30 min at 65°C. DNA from the ChIP was isolated using ChIP DNA Clean and Concentrator Kit (Zymo Research, CA). ChIP-seq libraries were then constructed using the TruSeq ChIP Library Preparation Kit (Illumina, CA) according to the manufacturer’s instructions, and libraries were sequenced using NovoSeq S4 with 150-bp paired-end sequencing run. CENH3 ChIP-seq data analysis: R..., The link contain the BED files and the BAM (mapped) files of CENH3 reads against the respective genome assembly.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4.bedpeMinimalConvert.sh—Bash script for converting standard 10-column format BEDPE to the “minimal” format defined by MACS2. See Fig. 4 for usage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2.ATACseq_workflow.txt—Example machine-readable Fig. 4 workflow including stepwise unix and R commands for ATAC-seq data processing.