Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Becker muscular dystrophy (BMD) is a rare X-linked recessive neuromuscular disorder, frequently caused by in-frame deletions in the DMD gene that result in the production of a truncated, yet functional, dystrophin protein. The consequences of BMD-causing in-frame deletions on the organism are difficult to predict, especially in regard to long-term prognosis. Here, we used CRISPR-Cas9 to generate a new Dmd Δ52-55 mouse model by deleting exons 52-55 in the Dmd gene, resulting in a BMD-like in-frame deletion. To delineate the long-term effects of this deletion, we studied these mice over 52 weeks by performing histology and echocardiography analyses and assessing motor functions. To further delineate the effects of the exons 52-55 in-frame deletion, we performed RNA-Seq pre- and post-exercise and identified several differentially expressed pathways that could explain the abnormal muscle phenotype observed at 52 weeks in the BMD model.
This dataset shows the results and raw data of the RNA-sequencing and transcriptomic analysis for 52-week-old exercised and non-exercised mice (4 BMD, 4 WT and 4 DMD, as mentioned on the names of each file).
1. Due to size restrictions, this RNA-Seq dataset will be published on Zenodo in 3 parts. This first part contains the data for the exercised mice, including the fastq (R1 and R2) and associated (md5) files for the 4 BMD mice (15315-15318) and 2 DMD mice (15319 and 15320), all the raw gene counts (txt files), and all the differentially expressed genes (tsv files).
Workflow (performed by TCAG at SickKids):
2. RNA-Seq Library and Reference Genome Information
Type of library: stranded, paired end
Genome reference sequence: GRCm39, M31 Gencode gene models.
3. Read Pre-processing, Alignment and Obtaining Gene Counts
3.1 Read Pre-processing
The sequencing data is in FASTQ format. The quality of the data is assessed using FastQC v.0.11.5 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
Adaptors are trimmed using Trim Galore (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) v. 0.5.0. Trim Galore is running Cutadapt (https://cutadapt.readthedocs.org/en/stable/) v. 1.10. Trim Galore is run with the following parameters:
-q 25 – the reads are trimmed from the 3' end base by base, trimming stops if the quality of the base is greater than 25;
--clip_R1 6, --clip_R2 6 – clip the first 6 nucleotides from the 5' ends of read 1 and read 2;
--stringency 5 – at least 5 nucleotides overlap with the Illumina primer sequence are needed for trimming;
--length 40 – any read that is shorter than 40 nucleotides as a result of trimming is discarded;
--paired – only pairs of reads are retained (for paired-end reads only, not for single reads).
The type of adaptor is automatically detected by screening the first 1 million sequences of the first specified file for the first 12/13 nucleotides of the standard Illumina or Nextera primers and the sequence from the start of the primer to the 3' end of the read is trimmed.
The quality of the trimmed reads is re-assessed with FastQC.
The trimmed reads are also screened for presence of rRNA and mtRNA sequences using FastQ-Screen v.0.10.0 (http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/).
To assess the read distribution, positional read duplication and to confirm the strandedness of the alignments we use the RSeQC package (http://rseqc.sourceforge.net/), v. 2.6.2. The distribution of reads across exonic, intronic and intergenic sequences is assessed by the read_distribution.py program, infer_experiment.py is used for confirming strandedness, and read_duplication.py is used to obtain the positional read duplication (percentage of reads mapping to exactly the same genomic location). Sufficient proportion of reads should map to the exonic sequences (ideally > 70-80%). Large amounts of reads mapping to intronic sequences in a poly-A mRNA library will suggest significant presence of pre-mRNA or other issues with RNA preparation. For stranded RNA-seq experiments the majority of the reads should map exclusively to one strand, same or opposite to the transcript, depending on the library preparation method. For non-stranded experiments the reads should be equally distributed to both strands.
3.2. Read Alignment
The raw trimmed reads are aligned to the reference genome using the STAR aligner, v.2.6.0c. (https://github.com/alexdobin/STAR, https://academic.oup.com/bioinformatics/article/29/1/15/272537). The alignments are contained in the .bam files. The “.bam” together with the “.bai” files can be used for viewing of the alignments in the Integrative Genomics Viewer (IGV, http://software.broadinstitute.org/software/igv/).
3.3. Obtaining Gene Counts
The filtered STAR alignments are processed to extract raw read counts for genes using htseq-count v.0.6.1p2 (HTSeq, http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html). Assigning reads to genes by htseq-count is done in the mode “intersection_nonempty”, i.e. if a read overlaps with two overlapping genes and the overlap to gene A is greater than the overlap to gene B, the read is counted towards gene A, while if a read overlaps equally with gene A and gene B, then it is not counted towards either gene. Htseq_count does not count reads with multiple alignments to avoid introducing bias in the expression results. Only uniquely mapping reads are counted.
4. Pre-processing, Alignment and Gene Counts QC
MultiQC (https://multiqc.info/) is a reporting tool that aggregates statistics generated by bioinformatics analyses across multiple samples. MultiQC v. 1.14 was used to generate a consolidated report from FastQC screening of both untrimmed and trimmed reads, and from RSeQC, FastQ Screen, STAR and htseq-count results. The MultiQC report is contained in MultiQC_Report_*.html file.
5. DGE Analysis with edgeR
Differential expression was done with the edgeR R package v.3.28.1, using R v.3.6.1 (http://www.bioconductor.org/packages/release/bioc/html/edgeR.html). The data set was filtered to retain only genes whose gene counts were >50 in at least 3 samples. This is intended to remove genes that are notexpressed, or expressed at a very low level.
The method used for normalizing the data was TMM, implemented by the calcNormFactors(y) function. All samples were normalized and filtered together. The glmLRT functionality in edgeR was used for the differential expression tests, with sample group taken into account.
EdgeR Results Legend:
· GeneID – Ensembl Gene ID;
· Chr.Start.End - gene coordinates;
· GeneName, GeneType, etc. – Gene attributes, derived from the genome annotation;
· logFC - Log2 Fold Change (use this column for selection of DEGs);
· logCPM - Log2 Counts Per Million, average for all libraries;
· LR – Statistic calculated by the LR-Test;
· PValue - Differential expression P value;
· FDR – Differential expression False Discovery Rate, calculated by the Benjamini-Hochberg method (use this column for selection of DEGs);
· (columns labeled with sample names) – Fragments Per Kilobase of transcript per Million mapped reads (FPKMs) for the given samples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data analysis of RNA-Seq FASTQ files for RCC4-EV cells (DRR100656) and RCC4-VHL cells (DRR100657) were obtained from the Sequence Read Archive (https://trace.ddbj.nig.ac.jp/dra/index_e.html). The quality of sequence data was evaluated by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) after the trimming process by fastx_toolkit v 0.0.14 (http://hannonlab.cshl.edu/fastx_toolkit/). The human reference sequence file (hs37d5.fa) was downloaded from the 1000 genome ftp site (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/), and the annotated general feature format (gff) file was downloaded from the Illumina iGenome ftp site (ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/NCBI/build37.2/). The human genome index was constructed with bowtie-build in Bowtie v.2.2.9. The fastq files were aligned to the reference genomic sequence by TopHat v.2.1.1 with default parameters. Bowtie2 v2.2.9 and Samtools v.1.3.1 was used with the TopHat program47. Estimation of transcript abundance was calculated, and the count values were normalized to the upper quartile of the fragments per kilobase of transcript per million fragments mapped reads (FPKM) using Cufflinks (cuffdiff) v2.1.1. cuffdiff output (gene_exp. diff) was presentated (gene_exp.diff.txt).
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The growth and development of root systems, essential for plant performance, is influenced by mechanical properties of the substrate in which the plants grow. Mechanical impedance, such as by compacted soil, can reduce root elongation and limit crop productivity.
To understand better the mechanisms involved in plant root responses to mechanical impedance stress, we investigated changes in the root transcriptome and hormone signalling responses of Arabidopsis to artificial root barrier systems in vitro.
We demonstrate that upon encountering a barrier, reduced Arabidopsis root growth and the characteristic 'step-like' growth pattern is due to a reduction in cell elongation associated with changes in signalling gene expression. Data from RNA-sequencing combined with reporter line and mutant studies identified essential roles for reactive oxygen species, ethylene and auxin signalling during the barrier response.
We propose a model in which early responses to mechanical impedance include reactive oxygen signalling integrated with ethylene and auxin responses to mediate root growth changes. Inhibition of ethylene responses allows improved growth in response to root impedance, an observation that may inform future crop breeding programmes.
Methods 20 mg of tissue was ground in liquid nitrogen using TissueLyser II (QIAGEN, Manchester, UK) and RNA extracted using the Qiagen ReliaPrepTM RNA Tissue Miniprep System. RNA quality was determined using the NanoDrop ND-1000 spectrophotometer (ThermoFisher Scientific) and Agilent 2200 TapeStation. Libraries were constructed from 100 ng and 1 μg total RNA using the NEBNext UltraTM Directional RNA Library Prep Kit for Illumina for use with the NEBNext Poly(A) mRNA Magnetic Isolation Module (NEB, Hitchin, UK). mRNA was isolated, fragmented and primed, cDNA was synthesised and end prep was performed. NEBNext Adaptor was ligated and the ligation reaction was purified using AMPure XP Beads. PCR enrichment of adaptor ligated DNA was conducted using NEBNext Multiplex Oligos for Illumina (Set 1, NEB#E7335). The PCR reaction was purified using Agencourt AMPure XP Beads. Library quality was then assessed using a DNA analysis ScreenTape on the Agilent Technologies 2200 TapeStation. qPCR was used for sample quantification using NEBNext® Library Quant Kit Quick Protocol Quant kit for Illumina. Samples were diluted to 10 nM. 7 μl of each 10 nM sample was pooled together and all were run on two lanes using an Illumina HiSeq2500 (DBS Genomics facility, Durham University). Approximately 30M unique paired-end 125bp reads were carried per sample. Primers were designed using Primer-BLAST (http://www.ncbi.nlm.nih.gov/tools/primer-blast/) and synthesised by MWG Eurofins (http://www.eurofinsdna.com/). FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used to assess read quality and Trimmomatic (Bolger et al., 2014) was used to cut down and remove low quality reads. Salmon (Patro et al., 2017) was used for quasi-mapping of reads against the AtRTD2-QUASI (Brown et al., 2017; Zhang et al., 2017) transcriptome and to estimate transcript-level abundances. The tximport R package (Soneson et al., 2016) was used to import transcript-level abundance, estimate counts and transcript lengths, and summarise into matrices for downstream analysis in R. Before differential expression analysis, low quality reads were filtered out of the data set. Only genes with a count per million of 0.744 in 6 or more samples were retained. The DESeq2 (Love et al., 2014) R package was used to estimate variance-mean dependence in count data and test for differential expression (using the negative binomial distribution model). A padj-value of ≤0.05 and a log2fold change of ≥0.5 were selected to identify differentially expressed genes (DEGs). The 3D RNA-Seq online App (Guo et al., 2019; Calixto et al., 2018) was used for independent verification of estimated DEGs and for differential alternative splicing analysis.
Facebook
TwitterThe C2H2 zinc finger is the most prevalent DNA-binding motif in the mammalian proteome, with DNA-binding domains usually containing more tandem fingers than are needed for stable sequence-specific DNA recognition. To examine the reason for the frequent presence of multiple zinc fingers, we generated mice lacking finger 1 or finger 4 of the 4-finger DNA-binding domain of Ikaros, a critical regulator of lymphopoiesis and leukemogenesis. Each mutant strain exhibited a specific subset of the phenotypes observed with Ikaros null mice. Of particular relevance, fingers 1 and 4 contributed to distinct stages of B- and T-cell development and finger 4 was selectively required for tumor suppression in thymocytes and in a new model of BCR-ABL+ acute lymphoblastic leukemia. These results, combined with transcriptome profiling (this GEO submission: RNA-Seg of whole thymus from wt and the two ZnF mutants), reveal that different subsets of fingers within multi-finger transcription factors can regulate distinct target genes and biological functions, and they demonstrate that selective mutagenesis can facilitate efforts to elucidate the functions and mechanisms of action of this prevalent class of factors. Overall design: RNA-Seq from Whole Thymus comparing wt (3 replicates), Ikaros-ZnF1-/- mutant (2 replicates) and Ikaros-ZnF4-/- mutant (2 replicates) RPKM_Thymocytes.txt (linked below as a supplementary file) reports the relative mRNA expression levels (RPKM)values for all annotated Refseq genes that had at least one read in at least one of the samples, with duplicates for the same gene (different transcripts for same gene) filtered out. RPKM (Mortazavi et al., 2008) were calculated based on exonic reads obtained by using the software SeqMonk (Babraham Bioinformatics) and reference genome annotations from NCBI (mm9).
Facebook
TwitterAim: Shaped by both climate change and sea-level rise, tidal salt marshes represent ephemeral systems that are home to only a few, highly specialized species. The dynamic ecological histories and spatial complexities of these habitats, however, render it challenging to reconstruct the complete biogeographic histories of their endemic taxa. Here, we leverage three species of North American Ammospiza sparrows that inhabit tidal marshes ( Ammospiza caudacuta, A. maritima, and A. n. subvirgatus) and closely related freshwater species to demonstrate the utility of whole-genome data in resolving demographic and evolutionary history as it relates to divergence and dispersal events in ephemeral ecosystems. We employ a combination of demographic and biogeographic reconstructions to shed new light on the colonization history of freshwater-saline environments in this system.
Location: North America
Taxon: Ammospiza Sparrows
Methods: We sequenced whole genomes from Ammospiza sparrows to address...
Facebook
Twitterhttps://www.bco-dmo.org/dataset/813173/licensehttps://www.bco-dmo.org/dataset/813173/license
Supplementary Table 4C: Metatranscriptome data summary for cellular activities presented and statistics on sequencing and removal of potential contaminant sequences: Statistics of reads retained through bioinformatic processing of iTAG data for the 11 samples and control samples and metatranscriptome data. Samples taken on board of the R/V JOIDES Resolution between November 30, 2015 and January 30, 2016 access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=Rock material was crushed while still frozen in a Progressive Exploration Jaw Crusher (Model 150) whose surfaces were sterilized with 70% ethanol and RNase AWAY (Thermo Fisher Scientific, USA) inside a laminar flow hood. Powdered rock material was returned to the -80\u00b0C freezer until extraction.
DNA was extracted from 20, 30, or 40 grams of powdered rock material, depending on the quantity of rock available. A DNeasy PowerMax Soil Kit (Qiagen, USA) was used following the manufacturer\u2019s protocol modified to included three freeze/thaw treatments prior to the addition of Soil Kit solution C1. Each treatment consisted of 1 minute in liquid nitrogen followed by 5 minutes at 65 \u00b0C. DNA extracts were concentrated by isopropanol precipitation overnight at 4\u00b0C.
The low biomass in our samples required whole genome amplification (WGA) prior to PCR amplification of marker genes. Genomic DNA was amplified by Multiple Displacement Amplification (MDA) using the REPLI-g Single Cell Kit (Qiagen) as directed. MDA bias was minimized by splitting each WGA sample into triplicate 16 \u03bcL reactions after 1 hr of amplification and then resuming amplification for the manufacturer-specified 7 hrs (8 hrs total).
DNA was also recovered from samples of drilling mud and drilling fluid (surface water collected during the coring process) for negative controls, as well as two \u201ckit control\u201d samples, in which no sample was added, to account for any contaminants originating from either the DNeasy PowerMax Soil Kit or the REPLI-g Single Cell Kit.
Bacterial SSU rRNA gene fragments were PCR amplified from MDA samples and sequenced at Georgia Genomics and Bioinformatics Core (Univ. of Georgia). The primers used were: Bac515-Y and Bac926R. Dual-indexed libraries were prepared with (HT) iTruS (Kappa Biosystems) chemistry and sequencing was performed on an Illumina MiSeq 2 x 300 bp system with all samples combined equally on a single flow cell.
Raw sequence reads were processed through Trim Galore [http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/], FLASH (ccb.jhu.edu/software/FLASH/) and FASTX Toolkit [http://hannonlab.cshl.edu/fastx_toolkit/] for trimming and removal of low quality/short reads.
Quality filtering included requiring a minimum average quality of 25 and rejection of paired reads less than 250 nucleotides.
Operational Taxonomic Unit (OTU) clusters were constructed at 99% similarity
with the script pick_otus.py within the Quantitative Insights Into Microbial
Ecology (QIIME) v.1.9.1 software and \u2018uclust\u2019. Any OTU that matched
an OTU in one of our control samples (drilling fluids, drilling mud,
extraction and WGA controls) was removed (using filter_otus_from_otu_table.py)
along with any sequences of land plants and human pathogens that may have
survived the control filtering due to clustering at 99%
(filter_taxa_from_otu_table.py). As an additional quality control measure,
genera that are commonly identified as PCR contaminants were removed.
Unclassified OTUs were queried using BLAST against the GenBank nr database and
further information about these OTUs is provided in the Supplementary
Discussion text under the section \u201cTaxonomic diversity information from
iTAGs.\u201d OTUs that could not be assigned to Bacteria or Archaea were
removed from further analysis. For downstream analyses, any OTUs not
representing more than 0.01% of relative abundance of sequences overall were
removed as those are unlikely to contribute significantly to in situ
communities. The OTU data table was transformed to a presence/absence table
and the Jaccard method was used to generate a distance matrix using the
dist.binary() function in the R package ade4.
awards_0_award_nid=709555
awards_0_award_number=OCE-1658031
awards_0_data_url=http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1658031
awards_0_funder_name=NSF Division of Ocean Sciences
awards_0_funding_acronym=NSF OCE
awards_0_funding_source_nid=355
awards_0_program_manager=David L. Garrison
awards_0_program_manager_nid=50534
cdm_data_type=Other
comment=Supplementary Table 4C: iTAG
PI: Virginia Edgcomb
Data Version 1: 2020-05-28
Conventions=COARDS, CF-1.6, ACDD-1.3
data_source=extract_data_as_tsv version 2.3 19 Dec 2019
dataset_current_state=Final and no updates
defaultDataQuery=&time<now
doi=10.26008/1912/bco-dmo.813173.1
Easternmost_Easting=57.278183
geospatial_lat_max=-32.70567
geospatial_lat_min=-32.70567
geospatial_lat_units=degrees_north
geospatial_lon_max=57.278183
geospatial_lon_min=57.278183
geospatial_lon_units=degrees_east
geospatial_vertical_max=747.7
geospatial_vertical_min=10.7
geospatial_vertical_positive=down
geospatial_vertical_units=m
infoUrl=https://www.bco-dmo.org/dataset/813173
institution=BCO-DMO
instruments_0_acronym=Automated Sequencer
instruments_0_dataset_instrument_description=DNA sequencing performed using the Illumina MiSeq 2 x 300 bp platform (Univ. of Georgia)
instruments_0_dataset_instrument_nid=813183
instruments_0_description=General term for a laboratory instrument used for deciphering the order of bases in a strand of DNA. Sanger sequencers detect fluorescence from different dyes that are used to identify the A, C, G, and T extension reactions. Contemporary or Pyrosequencer methods are based on detecting the activity of DNA polymerase (a DNA synthesizing enzyme) with another chemoluminescent enzyme. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step.
instruments_0_instrument_name=Automated DNA Sequencer
instruments_0_instrument_nid=649
instruments_0_supplied_name=Illumina MiSeq 2 x 300 bp platform
metadata_source=https://www.bco-dmo.org/api/dataset/813173
Northernmost_Northing=-32.70567
param_mapping={'813173': {'Latitude': 'flag - latitude', 'Depth': 'flag - depth', 'Longitude': 'flag - longitude'}}
parameter_source=https://www.bco-dmo.org/mapserver/dataset/813173/parameters
people_0_affiliation=Woods Hole Oceanographic Institution
people_0_affiliation_acronym=WHOI
people_0_person_name=Virginia P. Edgcomb
people_0_person_nid=51284
people_0_role=Principal Investigator
people_0_role_type=originator
people_1_affiliation=Woods Hole Oceanographic Institution
people_1_affiliation_acronym=WHOI
people_1_person_name=Virginia P. Edgcomb
people_1_person_nid=51284
people_1_role=Contact
people_1_role_type=related
people_2_affiliation=Woods Hole Oceanographic Institution
people_2_affiliation_acronym=WHOI BCO-DMO
people_2_person_name=Karen Soenen
people_2_person_nid=748773
people_2_role=BCO-DMO Data Manager
people_2_role_type=related
project=Subseafloor Lower Crust Microbiology
projects_0_acronym=Subseafloor Lower Crust Microbiology
projects_0_description=NSF abstract:
The lower ocean crust has remained largely unexplored and represents one of the last frontiers for biological exploration on Earth. Preliminary data indicate an active subsurface biosphere in samples of the lower oceanic crust collected from Atlantis Bank in the SW Indian Ocean as deep as 790 m below the seafloor. Even if life exists in only a fraction of the habitable volume where temperatures permit and fluid flow can deliver carbon and energy sources, an active lower oceanic crust biosphere would have implications for deep carbon budgets and yield insights into microbiota that may have existed on early Earth. This is all of great interest to other research disciplines, educators, and students alike. A K-12 education program will capitalize on groundwork laid by outreach collaborator, A. Martinez, a 7th grade teacher in Eagle Pass, TX, who sailed as outreach expert on Drilling Expedition 360. Martinez works at a Title 1 school with ~98% Hispanic and ~2% Native American students and a high number of English Language Learners and migrants. Annual school visits occur during which the project investigators present hands on-activities introducing students to microbiology, and talks on marine microbiology, the project, and how to pursue science related careers. In addition, monthly Skype meetings with students and PIs update them on project progress. Students travel to the University of Texas Marine Science Institute annually, where they get a campus tour and a 3-hour cruise on the R/V Katy, during which they learn about and help with different oceanographic sampling approaches. The project partially supports two graduate students, a Woods Hole undergraduate summer student, the participation of multiple Texas A+M undergraduate students, and 3 principal investigators at two institutions, including one early career researcher who has not previously received NSF support of his own.
Given the dearth of knowledge of the lower oceanic crust, this project is poised to transform our understanding of life in this vast environment. The project assesses metabolic functions within all three domains of life in this crustal biosphere, with a focus on nutrient cycling and evaluation of connections to other deep marine microbial habitats. The lower ocean crust represents a potentially vast biosphere whose microbial constituents and the biogeochemical cycles they mediate are likely linked to deep ocean processes through faulting and subsurface fluid flow. Atlantis Bank represents a tectonic
Facebook
TwitterAll sample information for individuals included in this VCF can be found in Supporting Information Table S1. This is the filtered VCF used.
Facebook
TwitterA large dataset of replicated transcriptomes was developed to accelerate Theobroma cocoa genomics research with the long-term goal of progressing breeding towards developing high-yielding elite varieties of cacao. RNAs were extracted and transcriptomes were sequenced from 123 different tissues and stages of development representing major organs and developmental stages of the cacao lifecycle. In addition, several experimental treatments and time courses were performed to measure gene expression in tissues responding to biotic and abiotic stressors. Samples were collected in replicates (3-5) to enable statistical analysis of gene expression levels for a total of 390 transcriptomes. We describe the creation of the atlas,and its global characterization and define sets of genes co-regulated in highly organ- and temporally-specific manners. To promote wider use of these data, all raw sequencing data, expression read mapping matrices, scripts, and other information used to create the resourc..., RNA was extracted form about 400 different tissues/treatments and replicates. Transcriptome sequencing was performed by Quant Seq (Lexogen). Raw QuantSeq reads were first examined with FASTQC (v0.11.9 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess the overall data quality before processing. Reads were then processed using bbduk (BBMap tools v37.76; https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/) to trim the adapter sequences, poly-A tails, and low-quality bases and to discard fragments less than 20 bp in length after trimming. Trimmed reads were mapped to the CCN-51 and SCA6 Theobroma cacao genotype reference genomes using the STAR Aligner version 2.7.5b (Dobin et al. 2013). Expression quantification was performed with featureCounts from the Subread package version 2.0.1 (Liao et al. 2013) in a fractional read-counting mode to prop distribute muti-mapping reads among features using gene annotation GFF3 files modified wit..., Excel or any text editor or spreadsheet program., # The cacao gene atlas: A transcriptome developmental atlas reveals highly tissue-specific and dynamically-regulated gene networks in Theobroma cacao L
Facebook
TwitterThe reconstruction of relationships within recently radiated groups is challenging even when massive amounts of sequencing data are available. The use of restriction site-associated DNA sequencing (RAD-Seq) to this end is promising. Here, we assessed the performance of RAD-Seq to infer the species-level phylogeny of the rapidly radiating genus Cereus (Cactaceae). To examine how the amount of genomic data affects resolution in this group, we used distinct datasets and implemented different analyses. We sampled 52 individuals of Cereus, representing 18 of the 25 species currently recognized, plus members of the closely allied genera Cipocereus and Praecereus, and other 11 Cactaceae genera as outgroups. Three scenarios of permissiveness to missing data were carried out in iPyRAD, assembling datasets with 4330% (333 loci), 45% (1440 loci), and 70% (6141 loci) of missing data. For each dataset, Maximum Likelihood (ML) trees were generated using two supermatrices, i.e., only SNPs and SNPs plus invariant sites. Accuracy and resolution were improved when the dataset with the highest number of loci was used (6141 loci), despite the high percentage of missing data included (70%). Coalescent trees estimated using SVDQuartets and ASTRAL are similar to those obtained by the ML reconstructions. Overall, we reconstruct a well-supported phylogeny of Cereus, which is resolved as monophyletic and composed of four main clades with high support in their internal relationships. Our findings also provide insights into the impact of missing data for phylogeny reconstruction using RAD loci. SamplingOur dataset includes 63 samples spanning 52 ingroups of Cereus and 11 outgroups (Table 1). ddRAD library preparation and sequencing 157Genomic DNA was extracted from root tissues using the DNeasy Plant Mini Kit (Qiagen). ddRAD libraries were prepared using high fidelity EcoRI and HPAII restriction enzymes following Campos et al. (2017) and Khan et al. (2019). Details of library preparation and sequencing are shown in Supplementary materialBioinformatics analyses Raw data were trimmed for adapters and quality filtered before SNPs calling. The quality of sequencing data was checked with FastQC 0.11.2 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc), visualized in MultiQC 1.0 (https://github.com/ewels/MultiQC), and filtered with SeqyClean 1.9.12 (Zhbannikov et al., 2017) using the following settings: minimum quality (Phred Score 20), minimum size (>65 bp), and Illumina contaminants (UniVec.fas). We used the iPyRAD pipeline (available at http://github.com/dereneaton/ipyrad) to identify homology among reads, make SNP calls, and format output files. The following parameter settings were implemented: mindepth_majrule = 6 (minimum depth for majority-rule base calling), clust_threshold = 0.85 (clustering threshold for de novo assembly), filter_adapters = 2 (strict filter), max_Hs_consens = 6 (maximum heterozygotes in consensus), min_samples_locus (minimum percentage of samples per locus 184for output). For the latter, values varied in three distinct scenarios concerning the permissiveness to missing data. These scenarios considered that the final set of loci should have at least 39 samples (scenario 1, approximately 30% of missing data), 26 samples (scenario 2, approximately 45% of missing data), or 13 samples (scenario 3, approximately 70% of missing data). After SNP calling, CD-HIT (Li and Godzik, 2006; Fu et al., 2012) was used to identify reverse-complement duplicates in the loci recovered by iPyRAD.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We integrated a total of 149 publicly available RNA-seq libraries from 7 international studies (see atlas_info). These transcriptomes were generated from 10 different pea varieties and covered a wide range of biological conditions, including a comprehensive collection of plant organs, various modalities of abiotic stress (mineral nutrition, water supply and temperature) and biotic interactions (nodule). The raw expression data from the source RNA-seq libraries were re-assembled to the reference genome (Kreplak et al., 2019) and the mean expression value was computed between biological replicates produced in individual studies (see atlas_info), thus providing a transcriptomic atlas of the 44.756 genes in the pea genome across 81 biological conditions.
Method : RNA-seq data (sequenced reads and fastq files) generated from P. sativum were downloaded from the Sequence Read Archive publicly available at NCBI (Bioproject numbers listed in table info) using SRAtools v3.0.1 (SRA toolkit). A fastp v0.22.0 (Chen et al., 2018) analysis was performed to trim the adapters and filter out reads with a low-quality score, followed by a quality assessment performed using FastQC v0.12.1 (Babraham Bioinformatics). The RNA-seq reads were mapped to the P. sativum reference genome v1a (Kreplak et al., 2019) using STAR v2.7.10b (Dobin et al., 2012). Gene expression table counts were generated using FeatureCounts v2.0.1 (Liao et al., 2013) and normalized by median ratio using the DESeq2 R-package. The transcriptomic atlas comprises expression data of 44.756 genes of the pea genome across 81 biological conditions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies of the first isolate collected from each clinical case (A1 and B1), together with the respective annotation.
File “Cje_metadata.xlsx” contains the genome assembly statistics for each isolate, including the European Nucleotide Archive (ENA) accession numbers, genotyping and antibiotic resistance profiles.
The directory “Assemblies/” contains the genome assembly (.fasta and .gbk formats) of each isolate presented in the metadata file.
Genome assembly and annotation
Reads quality control and improvement, species confirmation (using the 8GB database available at https://ccb.jhu.edu/software/kraken/) and de novo assembly were performed using the INNUca v4.2.2 pipeline (https://github.com/B-UMMI/INNUca). Briefly, after reads’ quality analysis using FastQC v0.11.5 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and cleaning with Trimmomatic v0.38 (http://www.usadellab.org/cms/?page=trimmomatic), genomes were de novo assembled with SPAdes 3.14.0 (http://bioinf.spbau.ru/spades) with a mean depth of coverage above 160x, and subsequently improved using Pilon v1.23. Multi-Locus Sequence Typing (MLST) was performed using mlst v2.18.1 software (https://github.com/tseemann/mlst). Genome annotation was performed with RAST server v2.0 (http://rast.nmpdr.org/).
The raw sequence reads of each isolate were deposited at ENA under the study accession numbers PRJEB42628 and PRJNA505131.
Funding
This work was supported by GenomePT (ref. POCI-01-0145-FEDER-022184) from Fundação para a Ciência e Tecnologia, Portugal.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Visualization for RNA transcript quality control and comparison of per base quality score Q. The images are taken before (A) and after (B) quality trimming procedure (removes reads with Q ≤ 20) to estimate the effect of trimming. The quality score Q is plotted to the read position by using the FastQC package in Galaxy (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). The color indicates the quality of the read: "red" low quality, "orange" median quality, "green" good quality. Red line expresses the mean of the measured values (yellow boxes are inter-quartile range) and the blue line represents the mean quality. (ZIP 81 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Methylome of the fresh water snail Biomphalaria glabrata. DNA was extracted from the feet of 10 individuals of B. glabrata originally isolated from Brazil. These snails have been cultivated in the laboratory since 1960. Tissue were grinded at 4°C and incubated in 1 ml volume of lysis buffer (20 mM TRIS pH 8; 1 mM EDTA; 100 mM NaCl; 0.5% SDS), with 0.3 mg of proteinase K at 55°C for 1 night. Afterwards, lysate was purified with phenol-chloroform and DNA was isopropanol precipitated. The extracted DNA (around 138ng/µL) was poled in equivalent amounts and Whole Genome Bisulfite Sequencing was done by GATC-biotech (www.gatc-biotech.com). The principle of this treatment is to convert non-methylated cytosines of gDNA into deoxy-uracil, whereas methylated cytosines remain intact. WGBS was done according to the Lister protocol (sequence 2 forward strands only). The reference genome (Biomphalaria-glabrata-BB02_SCAFFOLDS_BglaB1.fa) and annotation (Biomphalaria-glabrata-BB02_BASEFEATURES_BglaB1.3.gff3) used in this project are available on VectorBase (https://www.vectorbase.org/). To align our short reads, we chose to use two specific bisulfite mapping tools, BSMAP 1.0.0 (https://code.google.com/p/bsmap/) and Bismark 0.10.2 (www.bioinformatics.babraham.ac.uk /projects/bismark/), to compare their efficiency and convenience to finally work with the more suitable one on our datasets. IGV (Interactive Genomics Viewer, https://www.broadinstitute.org/igv/) was used to visualized final alignments.
BSMAP performed better than Bismark and was used for downstream analyses. Without default parameters alignement efficiency for BSMAP is 47.1%, allowing for 2 mismatches increases it to 55.6%. Methylation occurs predominantly in CpGs. (C methylated in CpG context: 12.4%, C methylated in CHG context: 0.5%, C methylated in CHH context: 0.5%) The major part of CpG sites, 95.7% were unmethylated, of the remaining 4.3% of CpG sites around 3.8% had low methylation, and 0.5% were completely methylated. Methylation is of the mosaic type. Methylation is relatively low with 1.2% of total cytosines. Our analyses suggested that conserved genes and genes with stable expression are localized in high methylated regions of the genome. Finally, we see that repetitive sequences were predominantly situated in low methylated regions of B. glabrata.
Wiggle files were generated for CpG pairs only.
Produced at IHPE (http://ihpe.univ-perp.fr/)
Facebook
TwitterVCF files VCF files contain raw unfiltered genotypes from 66 olive baboons (Papio anubis) from the Southwest National Primate Research Center (SNPRC). Genomes are aligned to the Panubis1.0 reference genome (GCA_008728515.1, Batra et al., 2020 (https://doi.org/10.1093/gigascience/giaa134)). Sequencing was performed with HiSeq 4000 and X machines (450 bp mean insert size, 150 bp x 150 bp paired-end sequencing) using DNA extracted from blood samples. Sequences generated for this study (n=23) were combined with previously generated sequence data from Robinson et al., 2019 (https://doi.org/10.1101/gr.247122.118) and Wu et al., 2020 (https://doi.org/10.1371/journal.pbio.3000838). All raw sequence data are available from the Sequence Read Archive under BioProject PRJNA433868. Median depth of coverage across samples is 35.6X. Briefly, reads were trimmed with TrimGalore v0.6.4 (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore) using the following options: -q 20 --stringency 1 --len...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Unmapped paired-end sequences from an Illumina HiSeq4000 sequencer were assessed by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Sequence adapters were removed, and reads were quality trimmed using Trimmomatic_0.36. The reads were mapped against the reference mouse genome (mm10) and counts per gene were calculated using annotation from GENCODE M25 (http://www.gencodegenes.org/) using STAR_2.7.2b.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RNA-seq reads were processed using RSEM after the adapter trimming by Trim Galore! (version 0.4.1), which is a wrapper script to automate quality and adapter trimming as well as quality control (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). As a reference mapping tool, Bowtie2 (version 2.3.2) was used from RSEM (version1.2.31) following a short tutorial (https://github.com/bli25ucb/RSEM_tutorial).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental procedures for deep enzymology reactions with randomized substrates: For analysis of flanking sequence preferences of the TET enzymes, a similar approach as described for DNMTs (Emperle et al., 2019; Gao et al., 2020; Adam et al., 2020; Dukatz et al., 2020) was used. Briefly, the following single-stranded oligonucleotides containing a methylated or hydroxymethylated CpG or CpH site flanked by 10 randomized nucleotides on either side were obtained from IDT and primer extension was performed to obtain the double stranded DNA substrates. A CpN substrate was prepared as a mixture of CpG and CpH in a 1:3 ratio. For the randomized hydroxymethylated substrate, the single-stranded oligo was purchased coupled to Desthiobiotin-TEG. Primer extension was conducted and the substrate was purified via Streptavidin beads (Dynabeads M-280, ThermoFisher Scientific) and eluted with a biotin solution. HM rand. GAGTGTGACTAGGCTCTCACTGCCNNNNNNNNNN mC GNNNNNNNNNNGAGAGGAGACCTAGTGAGAAG OH rand. GAGTGTGACTAGGCTCTCACTGCCNNNNNNNNNN hmC GNNNNNNNNNNGAGAGGAGACCTAGTGAGAAG CH rand. GAGTGTGACTAGGCTCTCACTGCCNNNNNNNNNN mC HNNNNNNNNNNGAGAGGAGACCTAGTGAGAAG The randomized double stranded substrates were incubated with the TET enzyme at 37 °C for 45 min (CN context) or 1 h (CG context) using mixtures containing 1x reaction buffer (50 mM HEPES pH 6.8, 100 mM NaCl, 1 mM DTT, 1 mM alpha-ketoglutarate and 2 mM ascorbic acid), 100 µM ammonium iron(II) sulfate, using different enzyme concentrations and variable amounts of dialysis buffer to keep a fixed salt and glycerol concentration. Reactions were stopped by freezing in liquid nitrogen. Afterwards, Proteinase K (NEB) treatment was used for enzyme inactivation for 1 h at 50 °C, followed by purification with a PCR clean-up kit (MACHEREY-NAGEL). Hairpin ligation and bisulfite conversion was performed using EZ DNA Methylation-Lightning kit (ZYMO). Library preparation for Illumina Next Generation Sequencing was conducted using a two-step PCR approach as described in (Gao et al., 2020). Unique combinations of barcode and index sequences were introduced to distinguish different samples and experiments. For bioinformatic analysis of the NGS datasets, a local instance of a Galaxy server (Afgan et al., 2018) was used. Sequence reads were trimmed with Trim Galore! (Galaxy Version 0.4.3.1, https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) keeping only the sequences with a quality score above 20 for further analysis, and filtered according to the expected DNA size using the Filter FASTQ tool (Blankenberg et al., 2010). The data in this entry contain the Fastq sequence files and extracted DNA sequences obtained with the hemimethylated CpG substrate (HM CG), hemimethylated CpN substrate mixture (HM CN) and hemihydroxymethylated CpG substrate (OH CG). Enzyme kinetics were conducted with TET1 and two versions of TET2 (V1 and V2) as described in the accompanying paper. Individual repeats of experiments are indicated with R1-R5 as appropriate. Control reaction refer to samples treated identically but without enzyme. The cited references are listed in the accompanying publication to this dataset.
Facebook
TwitterLinux platform
Facebook
TwitterPurpose: The goals of this study are to compare NGS-derived transcriptome profiling (RNA-seq) from the SEEDSTICK mutant in Arabidopsis with a wild type, to unveil the role of this transcription factor in seed development and decipher the impact of this factor in PAs metabolism Methods: Total RNA was extracted from two biological replicates from both wild-type and stk mutant inflorescences and siliques until 5 DAP with the Qiagen Kit according to the manufacturer's instructions. DNA contaminations were removed using the PROMEGA RQ1 RNase-Free DNase according to the manufacturer's instructions. RNA quality integrity was analyzed by electrophoresis gel and was validated on a Bioanalyzer 2100 (Aligent, Santa Clara, CA); RNA Integrity Number (RIN) values were greater than 7 for all the samples. In order to confirm that in stk mutant samples STK was not expressed, STK expression was checked by real time PCR with primer RT 780 (5â??-TGCGATGCAGAAGTTGCGCTC-3â??) and RT 781 (5â??-AGTACGCGGCATTGATTTCTTG-3â??). Sequencing libraries were prepared according to the manufacturerâ??s instructions by TruSeq RNA Sample Prep kit (Illumina Inc.) and sequenced on Illumina HiSeq2000 in one lane single-read 50bp. The processing of fluorescent images into sequences, base-calling and quality value calculations were performed using the Illumina data processing pipeline (version 1.8). Raw reads were filtered to obtain high-quality reads by removing low-quality reads containing more than 30% bases with Q < 20. Finally, a quality control of the raw sequence data was performed using FastQC [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/]. Results: A total of 102,278,242 reads passed a quality filter and 85% were mapped back to the Arabidopsis TAIR10 genome. Approximately 90% mapped uniquely to only one location and could be assigned to a single annotated TAIR10 gene. Normalization of expression values was performed using RPKM values. All other parameters were kept at default levels. The CLC Genomic Workbench was also further used to determine all differentially expressed transcripts found in each cDNA library. Baggerley's test and a FDR correction were used for statistical analysis of samples. Our analysis revealed that 156 genes were upregulated, whereas for 90 genes a reduction in their mRNA level was observed in the stk mutant when compared to wild-type . Data analysis revealed a significant enrichment for terms related to the phenylpropanoid metabolic process, flavonoid biosynthetic process as well as cellular amino acid derivative metabolic process. Conclusion: Our genome-wide transcriptomic analysis suggests that the ovule identity factor STK is involved in the regulation of several metabolic processes providing a strong connection between cell fate determination, development and metabolism. In particular we characterize, through phenotypic, genetic, biochemical and transcriptomic approaches, the role of STK in PAs biosynthesis. Our results indicate that STK exerts this role through the direct regulation of the gene encoding for BANYULS/ANTHOCYANIDIN REDUCTASE (BAN/ANR), which converts anthocyanidins into their corresponding 2,3-cis-flavan-3-ols. Our study also demonstrates that the levels of H3K9ac chromatin modification directly correlate with the active state of BAN in an STK-dependent way. This supports the theory that MADS-domain proteins control the expression of their target genes through the modification of the chromatin states. STK might recruit or negatively regulate histone modifying factors to control their activity. Moreover, we show that STK controls through a complex regulatory network not only directly BAN but also other regulators of this key gene in tannin production mRNA profiles from both Arabidopsis wild-type and stk mutant inflorescences and siliques until 5 DAP were generated by deep sequencing, in duplicate according to the manufacturerâ??s instructions by TruSeq RNA Sample Prep kit (Illumina Inc.) and sequenced on Illumina HiSeq2000 in one lane single-read 50bp. In order to confirm that in stk mutant samples STK was not expressed, STK expression was checked by real time PCR with primer RT 780 (5â??-TGCGATGCAGAAGTTGCGCTC-3â??) and RT 781 (5â??-AGTACGCGGCATTGATTTCTTG-3â??).
Facebook
TwitterWe collected ddRADseq data for 84 individuals following the protocol described in Peterson et al. (2012) and following parameters specified in Streicher et al. (2014). Our final library was analyzed on one Illumina HiSeq2500 lane (150 bp single end reads) at the Genomic Sequencing and Analysis Facility (GSAF) at The University of Texas (https://www.wikis.utexas.edu/display/GSAF). The workflow for data processing, filtering, and formatting was automated using scripts available from Portik et al. 2017 (https://github.com/dportik/Stacks_pipeline). In brief, the raw Illumina reads were demultiplexed using stacks v1.35 (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013), the restriction site overhangs were removed using the fastx_trimmer module of the fastx-toolkit (www.hannonlab.cshl.edu/fastx_toolkit), and the sequencing quality was examined on a per sample basis using fastqc v0.10.1 (www.bioinformatics.babraham.ac.uk/projects/fastqc). Loci were created, catalogued, and identified us...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Becker muscular dystrophy (BMD) is a rare X-linked recessive neuromuscular disorder, frequently caused by in-frame deletions in the DMD gene that result in the production of a truncated, yet functional, dystrophin protein. The consequences of BMD-causing in-frame deletions on the organism are difficult to predict, especially in regard to long-term prognosis. Here, we used CRISPR-Cas9 to generate a new Dmd Δ52-55 mouse model by deleting exons 52-55 in the Dmd gene, resulting in a BMD-like in-frame deletion. To delineate the long-term effects of this deletion, we studied these mice over 52 weeks by performing histology and echocardiography analyses and assessing motor functions. To further delineate the effects of the exons 52-55 in-frame deletion, we performed RNA-Seq pre- and post-exercise and identified several differentially expressed pathways that could explain the abnormal muscle phenotype observed at 52 weeks in the BMD model.
This dataset shows the results and raw data of the RNA-sequencing and transcriptomic analysis for 52-week-old exercised and non-exercised mice (4 BMD, 4 WT and 4 DMD, as mentioned on the names of each file).
1. Due to size restrictions, this RNA-Seq dataset will be published on Zenodo in 3 parts. This first part contains the data for the exercised mice, including the fastq (R1 and R2) and associated (md5) files for the 4 BMD mice (15315-15318) and 2 DMD mice (15319 and 15320), all the raw gene counts (txt files), and all the differentially expressed genes (tsv files).
Workflow (performed by TCAG at SickKids):
2. RNA-Seq Library and Reference Genome Information
Type of library: stranded, paired end
Genome reference sequence: GRCm39, M31 Gencode gene models.
3. Read Pre-processing, Alignment and Obtaining Gene Counts
3.1 Read Pre-processing
The sequencing data is in FASTQ format. The quality of the data is assessed using FastQC v.0.11.5 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
Adaptors are trimmed using Trim Galore (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) v. 0.5.0. Trim Galore is running Cutadapt (https://cutadapt.readthedocs.org/en/stable/) v. 1.10. Trim Galore is run with the following parameters:
-q 25 – the reads are trimmed from the 3' end base by base, trimming stops if the quality of the base is greater than 25;
--clip_R1 6, --clip_R2 6 – clip the first 6 nucleotides from the 5' ends of read 1 and read 2;
--stringency 5 – at least 5 nucleotides overlap with the Illumina primer sequence are needed for trimming;
--length 40 – any read that is shorter than 40 nucleotides as a result of trimming is discarded;
--paired – only pairs of reads are retained (for paired-end reads only, not for single reads).
The type of adaptor is automatically detected by screening the first 1 million sequences of the first specified file for the first 12/13 nucleotides of the standard Illumina or Nextera primers and the sequence from the start of the primer to the 3' end of the read is trimmed.
The quality of the trimmed reads is re-assessed with FastQC.
The trimmed reads are also screened for presence of rRNA and mtRNA sequences using FastQ-Screen v.0.10.0 (http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/).
To assess the read distribution, positional read duplication and to confirm the strandedness of the alignments we use the RSeQC package (http://rseqc.sourceforge.net/), v. 2.6.2. The distribution of reads across exonic, intronic and intergenic sequences is assessed by the read_distribution.py program, infer_experiment.py is used for confirming strandedness, and read_duplication.py is used to obtain the positional read duplication (percentage of reads mapping to exactly the same genomic location). Sufficient proportion of reads should map to the exonic sequences (ideally > 70-80%). Large amounts of reads mapping to intronic sequences in a poly-A mRNA library will suggest significant presence of pre-mRNA or other issues with RNA preparation. For stranded RNA-seq experiments the majority of the reads should map exclusively to one strand, same or opposite to the transcript, depending on the library preparation method. For non-stranded experiments the reads should be equally distributed to both strands.
3.2. Read Alignment
The raw trimmed reads are aligned to the reference genome using the STAR aligner, v.2.6.0c. (https://github.com/alexdobin/STAR, https://academic.oup.com/bioinformatics/article/29/1/15/272537). The alignments are contained in the .bam files. The “.bam” together with the “.bai” files can be used for viewing of the alignments in the Integrative Genomics Viewer (IGV, http://software.broadinstitute.org/software/igv/).
3.3. Obtaining Gene Counts
The filtered STAR alignments are processed to extract raw read counts for genes using htseq-count v.0.6.1p2 (HTSeq, http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html). Assigning reads to genes by htseq-count is done in the mode “intersection_nonempty”, i.e. if a read overlaps with two overlapping genes and the overlap to gene A is greater than the overlap to gene B, the read is counted towards gene A, while if a read overlaps equally with gene A and gene B, then it is not counted towards either gene. Htseq_count does not count reads with multiple alignments to avoid introducing bias in the expression results. Only uniquely mapping reads are counted.
4. Pre-processing, Alignment and Gene Counts QC
MultiQC (https://multiqc.info/) is a reporting tool that aggregates statistics generated by bioinformatics analyses across multiple samples. MultiQC v. 1.14 was used to generate a consolidated report from FastQC screening of both untrimmed and trimmed reads, and from RSeQC, FastQ Screen, STAR and htseq-count results. The MultiQC report is contained in MultiQC_Report_*.html file.
5. DGE Analysis with edgeR
Differential expression was done with the edgeR R package v.3.28.1, using R v.3.6.1 (http://www.bioconductor.org/packages/release/bioc/html/edgeR.html). The data set was filtered to retain only genes whose gene counts were >50 in at least 3 samples. This is intended to remove genes that are notexpressed, or expressed at a very low level.
The method used for normalizing the data was TMM, implemented by the calcNormFactors(y) function. All samples were normalized and filtered together. The glmLRT functionality in edgeR was used for the differential expression tests, with sample group taken into account.
EdgeR Results Legend:
· GeneID – Ensembl Gene ID;
· Chr.Start.End - gene coordinates;
· GeneName, GeneType, etc. – Gene attributes, derived from the genome annotation;
· logFC - Log2 Fold Change (use this column for selection of DEGs);
· logCPM - Log2 Counts Per Million, average for all libraries;
· LR – Statistic calculated by the LR-Test;
· PValue - Differential expression P value;
· FDR – Differential expression False Discovery Rate, calculated by the Benjamini-Hochberg method (use this column for selection of DEGs);
· (columns labeled with sample names) – Fragments Per Kilobase of transcript per Million mapped reads (FPKMs) for the given samples.