https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The pentatricopeptide repeat protein GENOMES UNCOUPLED1 (GUN1) is required for chloroplast-to-nucleus signalling in response to plastid stress during chloroplast development in Arabidopsis thaliana but its exact molecular function remains unknown. Current data on GUN1 function is limited to Arabidopsis, so we set out to investigate the origin and evolution of the land plant GUN1 proteins. We retrieved GUN1 sequences from 76 phylogenetically diverse land plants and developed a GUN1 sequence profile using hmmbuild (http://hmmer.org). We then used this profile to systematically analyse the presence/absence of GUN1 sequences in transcriptomes from land plants and streptophyte algae. This dataset includes the GUN1 profile we developed, the code we used to analyse the results of screening over 500,000 PPR protein sequences with the profile, and an alignment of the 893 GUN1 sequences that we obtained. We used this data to show that GUN1 is an ancient protein that is highly conserved across land plants but missing from the Rafflesiaceae that lack chloroplast genomes. Our findings suggest that GUN1 is an ancient protein that evolved within the streptophyte algal ancestors of land plants before the first plants colonised land more than 470 million years ago. This dataset also includes transcript count data from an RNA-seq experiment looking at gene expression in liverwort Marchantia polymorpha wild type and Mpgun1 mutant spore samples grown in the presence or absence of spectinomycin. We used this data to show that GUN1 does not act significantly in chloroplast retrograde signalling in the liverwort M. polymorpha. Its primary role is likely to be in chloroplast gene expression and its role in chloroplast retrograde signalling probably evolved more recently. Methods Dataset 1 Arabidopsis and Marchantia GUN1 sequences were retrieved from TAIR (https://www.arabidopsis.org/) and MarpoIBase (https://marchantia.info/), respectively. Full-length GUN1 sequences were obtained from a representative set of land plants by protein BLAST searches (https://blast.ncbi.nlm.nih.gov/Blast.cgi) using the Arabidopsis sequence to search GenBank. A set of 76 phylogenetically diverse GUN1 sequences (including representatives from algae, bryophytes, lycophytes, ferns, gymnosperms, and angiosperms) were aligned using the G-INS-i algorithm in MAFFT v7 (Katoh & Standley, 2013). The most highly conserved region of this alignment (876 positions) was used to generate a GUN1 sequence profile with hmmbuild from the HMMER package (v3.3.1) (http://hmmer.org; Eddy, 2011), which in turn was used to search for GUN1 sequences (using hmmsearch with default parameters) in translations of various transcriptome datasets, most notably putative PPR protein sequences compiled by (Gutmann et al., 2020) from the 1KP data set (Carpenter et al., 2019) The 1KP transcriptomes were filtered to remove those encoding fewer than 10000 distinct proteins to avoid trivial false negatives due to low coverage and those from organisms other than green algae and land plants. This resulted in 1128 analysable samples from 894 plant species. Specific searches were also made in data sets of particular interest (whole genome shotgun or transcriptome shotgun assemblies selected via the NCBI Sequence Set Browser (https://www.ncbi.nlm.nih.gov/Traces/wgs/). These additional data sets included genomes or transcriptomes where GUN1 could not be found in the corresponding 1KP samples and also whole genome shotgun data from Sapria himalayana (Cai et al., 2021) and whole transcriptome data from Rafflesia cantleyi (Lee et al., 2016), both holo-parasites from the Rafflesiaceae.
Katoh K, Standley DM. 2013. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular biology and evolution 30: 772–780. Eddy SR. 2011. Accelerated Profile HMM Searches. PLoS computational biology 7: e1002195. Gutmann B, Royan S, Schallenberg-Rüdinger M, Lenz H, Castleden IR, McDowell R, Vacher MA, Tonti-Filippini J, Bond CS, Knoop V, et al. 2020. The Expansion and Diversification of Pentatricopeptide Repeat RNA-Editing Factors in Plants. Molecular plant 13: 215–230. Carpenter EJ, Matasci N, Ayyampalayam S, Wu S, Sun J, Yu J, Jimenez Vieira FR, Bowler C, Dorrell RG, Gitzendanner MA, et al. 2019. Access to RNA-sequencing data from 1,173 plant species: The 1000 Plant transcriptomes initiative (1KP). GigaScience 8. Cai L, Arnold BJ, Xi Z, Khost DE, Patel N, Hartmann CB, Manickam S, Sasirat S, Nikolov LA, Mathews S, et al. 2021. Deeply Altered Genome Architecture in the Endoparasitic Flowering Plant Sapria himalayana Griff. (Rafflesiaceae). Current biology: CB 31: 1002-1011.e9. Lee X-W, Mat-Isa M-N, Mohd-Elias N-A, Aizat-Juhari MA, Goh H-H, Dear PH, Chow K-S, Haji Adam J, Mohamed R, Firdaus-Raih M, et al. 2016. Perigone Lobe Transcriptome Analysis Provides Insights into Rafflesia cantleyi Flower Development. PloS one 11: e0167958.
Dataset 2 Dataset 2 is derived from the NCBI SRA BioProject PRJNA800059 which contains paired-end random-primed, rRNA-depleted, strand-specific RNA-seq reads from 12 liverwort Marchantia polymorpha wild type (accession Takaragaike) or Mpgun1 mutant spore samples grown in the presence or absence of spectinomycin. The raw read data can be obtained from NCBI SRA.
M. polymorpha spores were sterilised and plated on ½ Gamborg’s medium (Duchefa Biochemie) supplemented with 1.2 % agar and 500 μg⋅ml-1 spectinomycin (an inhibitor of plastid translation). The spores were germinated under long day conditions for 48 hours, after which they were resuspended in 1 ml of sterile water, transferred into a microcentrifuge tube, and spun down at 6,000 rpm for 1 minute. Water was removed, and the spore pellet flash-frozen in liquid nitrogen. RNA was extracted from spores using the Direct-Zol RNA MINIprep kit (Zymo Research) and its quality was estimated on an Agilent 4200 tape station (Agilent). Three independent biological replicates were extracted for each genotype/condition. RNA was quantified using a NanoDrop spectrophotometer (Thermo Fisher) and DNase treated using Turbo DNase (Ambion). Transcriptome libraries were prepared using the TruSeq Stranded Total RNA kit with Ribo-Zero Plant (Illumina). The libraries were sequenced on an Illumina HiSeq 4000 platform (150 nt paired-end reads) at Novogene, Hong Kong. Optical duplicate reads were first removed with clumpify (parameters: dedupe optical dist = 40) from the bbmap package (https://sourceforge.net/projects/bbmap/) and adapters were trimmed with bbduk (parameters: ktrim=r k=23 mink=11 hdist=1 tpe tbo ftm=5). The reads were then assigned to transcripts using Salmon v1.3.0 (Patro et al., 2017) (parameters: -l A --validateMappings) against an index prepared with the M. polymorpha MpTak_v6.1 reference genome and cDNA assemblies (https://marchantia.info/). Differential expression analyses were carried out using DESeq2 (Love et al., 2014). Functional annotations for MpTak_v5.1 genome release were used to identify M. polymorphaphotosynthesis-associated nuclear genes.
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nature methods 14: 417–419. Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology 15: 550.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance metrics of 18 models fitted and the selected model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used in the Article
The programs used were Augustus, HMMER and Gromacs.
The files are:
-Augustusresults (results obtained from augutus)
-BirdsTree (the data to create the bird tree)
-Commands (the commands used for augustus and hmmer)
-Hmmer.Finding (the findings from hmmer)
-Hmmer Results (results from the searches)
-HmmProfiles (the profiles used to serch in augustus)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance metrics of rbf-SVM68 model on real data.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We developed genetic resources for two North American frogs, Lithobates clamitans and Pseudacris regilla, widespread native amphibians that are potential indicator species of environmental health. For both species, mRNA from multiple tissues was sequenced using 454 technology. De novo assemblies with Mira3 resulted in 50,238 contigs (N50 = 687 bp) and 48,213 contigs (N50 = 686 bp) for L. clamitans and P. regilla, respectively, after clustering with CD-Hit-EST and purging contigs below 200 bp. We performed BLASTX similarity searches against the Xenopus tropicalis proteome and, for predicted ORFs, HMMER similarity searches against the Pfam-A database. Because there is broad interest in amphibian immune factors, we manually annotated putative antimicrobial peptides. To identify conserved regions suitable for amplicon re-sequencing across a broad taxonomic range, we performed an additional assembly of public short-read transcriptome data derived from two species of the genus Rana and identified reciprocal best TBLASTX matches among all assemblies. Although P. regilla, a hylid frog, is substantially more diverged from the ranid species, we identified 56 genes that were sufficiently conserved to allow non-degenerate primer design with Primer3. In addition to providing a foundation for comparative genomics and quantitative gene expression analysis, our results enable quick development of nuclear sequence-based markers for phylogenetics or population genetics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nuclear receptor sequences were identified in the Danio rerio and Gadus morhua proteomes (downloaded from ENSEMBL) by performing an HMM search (hmmer) with the Pfam profile for the ligand-binding domain of nuclear hormone receptor (PF00104). The identified sequences were aligned using ClustalX 2.1 with default parameters.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top 3 k-meric features sets when ranked by feature importance instated by our feature selection flow-chart.
To study the impact of wheat streak mosaic virus on global gene expression in wheat curl mite, we generated a de novo transcriptome assembly using 50 x 50 paired end reads from the Illumina HiSeq 2500. Reads were assembled using Trinity (version 2.0.6) and contigs greater than 200 nt were retained. All assembled transcripts were annotated using the Trinotate pipeline using blastp searches against the Swiss-prot/Uni-Prot database, blastx searches against the Swiss-prot/Uni-Prot databases, HMM searches against the Pfam-A database, blastp searches against the non-redundant protein database, and signalP and tmHMM predictions. To reduce noise from low abundance transcripts not well supported by the data, we filtered the assembly to retain only those transcripts with TPM values >=0.5. Resources in this dataset:Resource Title: Raw Trinity Assembly. File Name: Trinity.fasta.txtResource Description: Raw trinity assembly obtained from wheat curl mite using 50 x 50 Illumina paired end reads from the HiSeq2500.Resource Software Recommended: Notepad++,url: https://notepad-plus-plus.org/ Resource Title: Raw Trinity Assembly. File Name: Trinity.fasta.txtResource Description: Raw trinity assembly obtained from wheat curl mite using 50 x 50 Illumina paired end reads from the HiSeq2500.Resource Software Recommended: Text wrangler,url: https://itunes.apple.com/us/app/textwrangler/id404010395?mt=12 Resource Title: Trinotate annotations for raw Trinity assembly. File Name: trinotate_annotations_report.xlsResource Description: Trinotate results for raw wheat curl mite transcriptome assemblyResource Software Recommended: Excel,url: https://products.office.com/en-us/excel Resource Title: Trinotate annotations for raw Trinity assembly. File Name: trinotate_annotations_report.xlsResource Description: Trinotate results for raw wheat curl mite transcriptome assemblyResource Software Recommended: Libre Office Calc,url: https://www.libreoffice.org/discover/calc/ Resource Title: Blastp results versus non-redundant protein database. File Name: wheat_curl_mite_blastp_nr.txtResource Description: Blastp results for protein coding unigenes from raw Trinity transcriptome assembly (wheat curl mite). Output format is default. Resource Software Recommended: Notepad++,url: https://notepad-plus-plus.org/ Resource Title: Blastp results versus non-redundant protein database. File Name: wheat_curl_mite_blastpnr.txtResource Description: Blastp results for protein coding unigenes from raw Trinity transcriptome assembly (wheat curl mite). Output format is default. Resource Software Recommended: Text wrangler,url: https://itunes.apple.com/us/app/textwrangler/id404010395?mt=12 Resource Title: Protein predictions for raw trinity transcriptome assembly (wheat curl mite). File Name: transcriptome.all.cds.pep.fasta.txtResource Description: Putative coding regions were predicted using Transdecoder. Default parameters were used in conjunction with Pfam-A searches to identify putative open reading frames (ORFs).Resource Title: Protein predictions for final transcriptome assembly (wheat curl mite). File Name: transcriptome.all.cds.pep.fasta.txtResource Description: Protein coding regions were predicted using Transdecoder. ORFs were identified using default parameters in conjunction with Pfam-A searches. Resource Software Recommended: Notepad++,url: https://notepad-plus-plus.org/ Resource Title: Protein predictions for final transcriptome assembly (wheat curl mite). File Name: transcriptome.all.cds.pep.fasta.txtResource Description: Protein coding regions were predicted using Transdecoder. ORFs were identified using default parameters in conjunction with Pfam-A searches. Resource Software Recommended: Text wrangler,url: https://itunes.apple.com/us/app/textwrangler/id404010395?mt=12 Resource Title: Final trinity transcriptome assembly for wheat curl mite. File Name: Trinity.mite.fasta.txtResource Description: Transcripts less than 200 nt and transcripts with TPM values less than 0.5 were removed from the assembly. In addition, transcripts whose coding sequences had highest scoring blastp matches to microbes were also removed from the assembly.Resource Title: Nucleotide coding regions for final transcriptome assembly for wheat curl mite. File Name: transcriptome.mite.cds.fasta.txtResource Description: Nucleotide sequences corresponding to coding regions from the final transcriptome assembly for wheat curl mite. Open reading frames (ORFs) were predicted using transdecoder. Default parameters with the addition of the identification of Pfam-A domains was used for ORF identification.Resource Title: Trinotate annotations for final Trinity assembly (wheat curl mite). File Name: trinotate.mite.xlsResource Description: Trinotate results for final wheat curl mite transcritpome assembly. Blastp and blastx searches against Swiss-Prot/Uni-Prot were performed along with Pfam-A searches using HMMER. Signal peptides and transmembrane domains were also identified. Resource Software Recommended: Excel,url: https://products.office.com/en-us/excel Resource Title: Trinotate annotations for final Trinity assembly (wheat curl mite). File Name: trinotate.mite.xlsResource Description: Trinotate results for final wheat curl mite transcritpome assembly. Blastp and blastx searches against Swiss-Prot/Uni-Prot were performed along with Pfam-A searches using HMMER. Signal peptides and transmembrane domains were also identified. Resource Software Recommended: Libre Office Calc,url: https://www.libreoffice.org/discover/calc/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw TBLOUT-format output from HMMER analysis (using hmmsearch).Query: HMMs from TIGRFAMS and Pfam, representing protein families involved in the synthesis and transport of the bacterial cell envelope, including lipid biosynthesis, peptidoglycan biosynthesis, translocation, etc. Also includes HMMs representing protein families involved in translation, inlcuding ribosomal proteins, aminoacyle tRNA synthases, etc.Target: A select set of 85 genomes, representing a diversity of organisms with genomes smaller than 1 Mb (mostly obligate intracellular parasites and symbionts)
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The RNA-binding pentatricopeptide repeat (PPR) family comprises hundreds to thousands of genes in most plants, but only a few dozen in algae, evidence of massive gene expansions during land plant evolution. The nature and timing of these expansions has not been well-defined due to the sparse sequence data available from early-diverging land plant lineages. We exploit the comprehensive OneKP dataset of over 1,000 transcriptomes from diverse plants and algae to establish a clear picture of the evolution of this massive gene family, focusing on the proteins typically associated with RNA editing, which show the most spectacular variation in numbers and domain composition across the plant kingdom. We characterise over 2,250,000 PPR motifs in over 400,000 proteins. In lycophytes, polypod ferns and hornworts, nearly 10% of expressed protein-coding genes encode putative PPR editing factors, whereas they are absent from algae and complex-thalloid liverworts. We show that rather than a single expansion, most land plant lineages with high numbers of editing factors have continued to generate novel sequence diversity. We identify sequence variation that implies functional differences between PPR proteins in seed plants versus non-seed plants and which we propose to be linked to seed-plant-specific editing cofactors. Finally, using the sequence variation across the dataset, we develop a structural model of the catalytic DYW domain associated with C-to-U editing and identify a clade of unique DYW variants that are strong candidates as U-to-C RNA editing factors, given their phylogenetic distribution and sequence characteristics.
Methods Sources of data
Nucleotide sequences from the 1000 plants initiative (oneKP) (Matasci et al., 2014) were downloaded fromhttp://web.corral.tacc.utexas.edu/OneKP/. The files used for this study are the SOAPdenovo-Trans (Xie et al., 2014)assemblies (file names of the form ‘xxxx-SOAPdenovo-Trans-assembly.fa’ where xxxx is the four-letter sample code).
Prediction of PPR motif arrays
Assemblies were translated in all six frames, retaining open reading frames of at least 31 amino acids (the length of an S motif, the shortest of the PPR motif variants). Hmmsearch from the HMMER 3.2.1 package (Eddy, 2018) was used to search for PPR motifs, initially using the motif profiles developed by(Cheng et al., 2016), but subsequently we re-defined the DYW motif and defined the new DYW:KP variant using oneKP sequences and used these profiles in place of the DYW profile of (Cheng et al., 2016). Hmmsearch settings were ‘--domtblout --noali -E 0.1’. Our own in-house code (PPRfinder) was used to select the best-scoring non-overlapping chain of motifs from the domain table output of hmmsearch, with a motif score threshold of 0 for all motif types, except SS motifs (score >= 10) and DYW motifs (score >= 30). All motifs on the same strand, whatever frame they were in, were considered in this process. Amino acid sequences from different frames were joined where necessary to connect PPR motif arrays, with ‘X’ residues inserted to maintain the approximate length and to indicate the junction. For all the studies reported here, the protein sequences were filtered such that only sequences with either at least two adjacent PPR motifs and a sum of motif scores >= 40, or a DYW motif, were retained. These are stringent thresholds that we believe eliminate almost all non-PPR sequences.
Alignment of PPR motifs
For most large alignments based on single motifs, sequences were aligned with hmmalign from the HMMER 3.2.1 package (Eddy, 2018). Hmmalign tends to give terminal gaps and unaligned tails where the sequence is a relatively poor match to the profile HMM and there is no surrounding sequence context provided. As all of the aligned motifs were originally identified using hmmsearch with the same profile HMM, we know that in its native sequence context each sequence does align to the HMM. Thus we wrote a script to replace terminal gaps at each end of the motif by the corresponding unaligned tails, as long as the gap and the tail were of equal length and thus there is no possible ambiguity in the placement of the amino acid residues. This procedure avoided overly sparse positions at the motif termini and a bias towards residues that best matched the HMM. For generating consensus sequences and structural models, the alignments were filtered to remove positions at which at least 50% of the sequences contained gaps. The alignments were then filtered a second time to remove sequences containing at least 5% gaps. For DYW alignments, the alignments were also filtered to remove sequences lacking the cytidine deaminase signature (HxEx25CxxC), to avoid including degenerate, non-functional sequences.
Construction of phylogenetic trees
Large trees (thousands of sequences) were constructed using the approximate maximum likelihood approach of FastTree 2 (v. 2.1.11 double precision) (Price et al., 2010) using the JTT model of amino acid evolution (Jones et al., 1992), pseudocounts (-pseudo option) and a discrete gamma model with 20 rate categories ( -gamma option). For local support values, FastTree uses the Shimodaira-Hasegawa test on alternate nearest-neighbour interchanges for each split in the tree.
Modelling the DYW domain
Contact-derived ab initio modelling was performed using the RaptorX contact web server, available athttp://raptorx.uchicago.edu/ContactMap/, using a trimmed alignment of 15,177 DYW sequences as input for the contact prediction tool, and a consensus of the DYW domain as the subject for modelling (Ma et al., 2015; Wang et al., 2016; Wang et al., 2017). Models were visualised using the PyMOL Molecular Graphics System, Version 2.0 (Schrödinger, LLC), with zinc and 3-deazacytidine modelled into the best prediction based on the crystal structure of E. coli cytidine deaminase complexed to 3-deazacytidine, PDB ID 1ALN (Xiang et al., 1996).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction:
This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.
Motivation:
We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.
Results/Description:
MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.
Methods:
The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.
The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.
Software/databases used:
checkM v1.0.11[16]
HMMERv3.1b2 (http://hmmer.org/)
prodigal v2.6.3 [22]
trimAl v1.4.rev22 [24]
AliView v1.18.1 [33] [34]
Phyx v0.1 [35]
RAxML v8.2.12 [36]
Pplacer v1.1alpha [28]
GTDB-Tk v0.1.3 [19]
Kaiju v1.6.0 [34]
GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
NCBI Taxonomy (accessed 2018-07-02) [23]
TIGRFAM v14.0 [37]
PFAM v31.0 [38]
Discussion/Caveats:
MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high
Bioinformatics Resource Center for invertebrate vectors. Provides web-based resources to scientific community conducting basic and applied research on organisms considered potential agents of biowarfare or bioterrorism or causing emerging or re-emerging diseases.
Restorer-of-fertility (Rf) genes have practical applications in hybrid seed production as a means to control self-pollination. They encode pentatricopeptide repeat (PPR) proteins that are targeted to mitochondria where they specifically bind to transcripts that induce cytoplasmic male sterility and repress their expression. We have identified a unique domain, RfCTD (Restorer-of-fertility C-terminal domain), which discriminates Restorer-of-fertility-like (RFL) proteins from hundreds of PPR proteins encoded in plant genomes. Using the sequence of this domain from hundreds of plant species, we have constructed a sequence profile that can quickly and accurately identify RfCTD sequences in plant genomes or transcriptomes. This data set contains PPR genes identified in 213 plant genomes (as summarised in accompanying table). , The PPR genes were identified in the genome sequences using the PPRfinder approach (Cheng et al., 2016, Plant J. 85:532-47. doi: 10.1111/tpj.13121). Briefly, the genomic sequences were screened for open reading frames (ORFs) in six-frame translations with the getorf program of the EMBOSS 6.6.0 package (Rice et al., 2000, Trends Genet. 16:276-7. doi: 10.1016/s0168-9525(00)02024-2). Predicted ORFs were screened for the presence of P- and PLS-class PPR motifs using hmmsearch from the HMMER 3.2.1 package (Eddy 2011, PLoS Comput Biol. 7:e1002195. doi: 10.1371/journal.pcbi.1002195) (http://hmmer.org) and hidden Markov models defined by hmmbuild (Cheng et al., 2016, Plant J. 85:532-47. doi: 10.1111/tpj.13121). Hmmbuild was used to create the RfCTD profile from an alignment of 1,486 non-redundant sequences. The RfCTD profile was incorporated into PPRfinder code (Cheng et al., 2016, Plant J. 85:532-47. doi: 10.1111/tpj.13121) and the screen of the genome sequences was repeated. Final filtering p..., The data files are simple fasta files or text files that can be opened in any text editor.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results from HMMER 3.3 with E-value of 1E-3 to search for Pfam dominos in proteins of 12 back yeast genomes.
Genomic Distribution of structural Superfamilies identifies and classifies evolutionary related proteins at the superfamily level in whole genome databases. GenDiS has been curated in direct correspondence with SCOP and represents 4001 highly resolved domains in 1194 structural superfamilies across protein sequence databases. Sequences showing reliable homology to entries in SCOP and PASS2 databases have been obtained from the non-redundant protein sequence database and aligned. Similar alignments of the superfamily members are provided in the genome level. GenDiS provides a platform for cross genome comparison at the superfamily level. GenDis relates proteins sequence information across all strata of taxonomy. One may navigate through the database to obtain structural homologues across different levels in taxonomic classification. The nomenclature of the various genomes and their hierarchy is in direct correspondence with the taxonomy database maintained at the NCBI. Sequence homologues for the various structural members are obtained from the non-redundant protein sequence database employing sensitive sequence search methods. Multiple approaches such as PSI-BLAST, HMMsearch of the HMMer suite and an interacting motif constrained PHI-BLAST have been employed to identify homologues in the sequence databases.
Apple fruit mealiness is one of the most important textural problems that results from an undesirable ripening process during storage. This phenotype is characterized by textural deterioration described as soft, grainy and dry fruit. Despite several studies, little is known about mealiness development and the associated molecular events. In this study, we integrated phenotypic, microscopic, transcriptomic and biochemical analyses to gain insights into the molecular basis of mealiness development.ResultsInstrumental texture characterization allowed the refinement of the definition of apple mealiness. In parallel, a new and simple quantitative test to assess this phenotype was developed.Six individuals with contrasting mealiness were selected among a progeny and used to perform a global transcriptome analysis during fruit development and cold storage. Potential candidate genes associated with the initiation of mealiness were identified. Amongst these, the expression profile of an early do...
A rice genome automated annotation system. This system integrates programs for prediction and analysis of protein-coding gene structure. Integrated softwares are coding region prediction programs ( GENSCAN, RiceHMM, FGENESH, MZEF ), splice site prediction programs (SplicePredictor ), homology search analysis programs ( Blast, HMMER, ProfileScan, MOTIF ), tRNA gene prediction program ( tRNAscan-SE ), repetitive DNA analysis programs ( RepeatMasker, Printrepeats ), signal scan search program ( Signal Scan ), protein localization site prediction program ( PSORT ), and program of classification and secondary structure prediction of membrane proteins ( SOSUI ). Blast against full-length cDNA sequences of japonica rice is integrated. The full-length rice cDNA sequence is provided by KOME database. Interpretation of the coding region is fully automated and gene prediction is accomplished without manual evaluation and modification. Therefore some differences exist between the predicted genes by the system and the manually predicted genes included in the GenBank entries. Please see "comparison table of gene prediction", http://RiceGAAS.dna.affrc.go.jp/rga-bin/col_accur.pl in detail. Further, a unique function is automatically assigned for predicted gene by GFSelector based on the protein homology of the gene. Additionally, the keyword search from the functions predicted by GFSelector is provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data continues with the development of the unprocessed NPEGC Trinity de novo metatranscriptome assemblies, uploaded to this Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3
A full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:
Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.
Excerpts of key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog
Processing and annotation of protein-level NPEGC metatranscripts is done in 6 primary steps:
1. Six-frame translation into protein sequences
2. Frame-selection of protein-coding translation frames
3. Clustering of protein sequences at 99% sequence identity
4. Taxonomic annotation against MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library with DIAMOND
5. Functional annotation against Pfam 35.0 protein family HMM profiles using HMMER3
6. Functional annotation against KOfam HMM profiles (KEGG release 104.0) using KofamScan v1.3.0# Define local NPEGC base directory here:
NPEGC_DIR="/mnt/nfs/projects/armbrust-metat"
# Raw assemblies are located in the /assemblies/raw/ directory
# for each of the metatranscriptome projects
PROJECT_LIST="D1PA G1PA G2PA G3PA G3PA_diel"
# raw Trinity assemblies:
RAW_ASSEMBLY_DIR="${NPEGC_DIR}/${PROJECT}/assemblies/raw"
Translation
We began processing the raw metatranscriptome assemblies by six-frame translation from nucleotide transcripts into three forward and three reverse reading frame translations, using the transeq function in the EMBOSS package. We add a cruise and sample prefix to the sequence IDs to ensure unique identification downstream (ex, `>TRINITY_DN2064353_c0_g1_i1_1` to `>G1PA_S09C1_3um_TRINITY_DN2064353_c0_g1_i1_1` for the S09C1_3um sample in the G1PA assemblies). See NPEGC.6tr_frame_selection_clustering.sh for full code description.
Example of six-frame translation using transeqtranseq -auto -sformat pearson -frame 6 -sequence 6tr/${PREFIX}.Trinity.fasta -outseq 6tr/${PREFIX}.Trinity.6tr.fasta
Frame selection
We use a custom frame-selection python script keep_longest_frame.py to determine the longest coding length in each open reading frame and retain this sequence (or multiple sequences if there is a tie) for downstream analyses. See NPEGC.6tr_frame_selection_clustering.sh for full code description.
Clustering by sequence identity
To reduce sequence redundancy and near-identical sequences, we cluster protein sequences at the 99% sequence identity level and retain the sequence cluster representative in a reduced-size FASTA output file. See NPEGC.6tr_frame_selection_clustering.sh for full code description of linclust/mmseqs clustering.
Sample of linclust clustering script: core mmseqs functionfunction NPEGC_linclust {
# make an index of the fasta file:
$MMSEQS_DIR/mmseqs createdb $FASTA_PATH/$FASTA_FILE NPac.$STUDY.bf100.db
# cluster sequences at $MIN_SEQ_ID
$MMSEQS_DIR/mmseqs linclust NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac_tmp --min-seq-id ${MIN_SEQ_ID}
# retieve cluster representatives:
$MMSEQS_DIR/mmseqs result2repseq NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac.${STUDY}.clusters.rep
# generate flat FASTA output with cluster reps
$MMSEQS_DIR/mmseqs result2flat NPac.${STUDY}.bf100.db NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.rep NPac.${STUDY}.bf100.id99.fasta --use-fasta-header
}
Corresponding files uploaded to this repository: Gzip-compressed FASTA files after translation, frame-selection, and clustering at 99% sequence identity (.bf100.id99.aa.fasta.gz)
NPac.G1PA.bf100.id99.aa.fasta.gz
NPac.G2PA.bf100.id99.aa.fasta.gz
NPac.G3PA.bf100.id99.aa.fasta.gz
NPac.G3PA_diel.bf100.id99.aa.fasta.gz
NPac.D1PA.bf100.id99.aa.fasta.gz
MarFERReT + MARMICRODB taxonomic annotation with DIAMOND
Taxonomy was inferred for the NPEGC metatranscripts with the DIAMOND fast read alignment software against the MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library (v1.1), a combined database of the MarFERReT v1.1 marine microbial eukaryote sequence library and MARMICRODB v1.0 prokaryote-focused marine genome database. See NPEGC.diamond_taxonomy.log.sh for full description of DIAMOND annotation.
Excerpt of core DIAMOND function:function NPEGC_diamond {
# FASTA filename for $STUDY
FASTER_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# Output filename for LCA results in lca.tab file:
LCA_TAB="NPac.${STUDY}.MarFERReT_v1.1_MMDB.lca.tab"
echo "Beginning ${STUDY}"
singularity exec --no-home --bind ${DATA_DIR} \
"${CONTAINER_DIR}/diamond.sif" diamond blastp \
-c 4 --threads $N_THREADS \
--db $MFT_MMDB_DMND_DB -e $EVALUE --top 10 -f 102 \
--memory-limit 110 \
--query ${FASTER_FASTA} -o ${LCA_TAB} >> "${STUDY}.MarFERReT_v1.1_MMDB.log" 2>&1
}
Corresponding files uploaded to this repository: Gzip-compressed diamond lowest common ancestor predictions with NCBI Taxonomy against a combined MarFERReT + MARMICRODB taxonomic library (*.Pfam35.domtblout.tab.gz)
NPac.G1PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G2PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G3PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G3PA_diel.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.D1PA.MarFERReT_v1.1_MMDB.lca.tab.gz
Pfam 35.0 functional annotation using HMMER3
Clustered protein sequences were annotated against the Pfam 35.0 collection of 19,179 protein family Hidden Markov Models (HMMs) using HMMER 3.3 with the Pfam 35.0 protein family database. Pfam annotation code is documented here: NPEGC.hmmer_function.sh
Excerpt of core hmmsearch function:function NPEGC_hmmer {
# Define input FASTA
INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# hmmsearch call:
hmmsearch --cut_tc --cpu $NCORES --domtblout $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab $HMM_PROFILE ${INPUT_FASTA}
# compress output file:
gzip $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab
}
Corresponding files uploaded to this repository: Gzip-compressed hmmsearch domain table files for Pfam35 queries (*.Pfam35.domtblout.tab.gz)
G1PA.Pfam35.domtblout.tab.gz
G2PA.Pfam35.domtblout.tab.gz
G3PA.Pfam35.domtblout.tab.gz
G3PA_diel.Pfam35.domtblout.tab.gz
D1PA.Pfam35.domtblout.tab.gz
KEGG functional annotation using KofamScan v1.3.0
Clustered protein sequences were annotated against the KEGG collection (release 104.0) of 20,819 protein family Hidden Markov Models (HMMs) using KofamScan and KofamKOALA. Kofam annotation code is documented here: NPEGC.kofamscan_function.sh
Excerpt of core NPEGC_kofam function:
# Core function to perform KofamScan annotation
function NPEGC_kofam {
# Define input FASTA
local INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# KofamScan call
${KOFAM_DIR}/kofam_scan-1.3.0/exec_annotation -f detail-tsv -E ${EVALUE} -o ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv ${FASTA_DIR}/${INPUT_FASTA}
# Keep best hit
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of rbf-SVM in classifying positive and negative sense RNA viruses.
Reference transcriptome and associated annotations for Emiliania huxleyi (UNC1419). A culture was grown into late exponential phase for filtration. Total RNA was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s protocol except for an initial bead beating step and two instead of one chloroform steps to separate proteins and DNA. RNA libraries were created with the KAPA Stranded mRNA-Seq kit for Illumina platforms. The library was sequenced on an Illumina MiSeq (300 bp, paired-end reads) and an Illumina HiSeq 2500 with one lane in high output mode (100 bp, paired-end reads) and another lane in rapid run mode (150 bp, paired-end reads). Raw reads were trimmed for quality with Trimmomatic v0.36 then assembled de novo with Trinity v2.5.1 with the default parameters for paired-reads and a minimum contig length of 90 bp. Contigs were clustered based on 99% similarity using CD-HIT-EST v4.7 and then protein sequences were predicted with GeneMark S-T. Protein sequences were annotated by best-homology (lowest E-value) with the KEGG (Release 86.0), UniProt (Release 2018_03), and PhyloDB (v1.076) databases via BLASTP v2.7.1 (E-value ≤ 10-5) and with Pfam 31.0 via HMMER v3.1b2 (Dataset S2). KEGG Ortholog (KO) annotations were assigned from the top hit with a KO annotation from the top 10 hits (https://github.com/ctberthiaume/keggannot). Provided here are predicted proteins as nucleotides and peptides. Raw reads are deposited in SRA (SRP234650).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The pentatricopeptide repeat protein GENOMES UNCOUPLED1 (GUN1) is required for chloroplast-to-nucleus signalling in response to plastid stress during chloroplast development in Arabidopsis thaliana but its exact molecular function remains unknown. Current data on GUN1 function is limited to Arabidopsis, so we set out to investigate the origin and evolution of the land plant GUN1 proteins. We retrieved GUN1 sequences from 76 phylogenetically diverse land plants and developed a GUN1 sequence profile using hmmbuild (http://hmmer.org). We then used this profile to systematically analyse the presence/absence of GUN1 sequences in transcriptomes from land plants and streptophyte algae. This dataset includes the GUN1 profile we developed, the code we used to analyse the results of screening over 500,000 PPR protein sequences with the profile, and an alignment of the 893 GUN1 sequences that we obtained. We used this data to show that GUN1 is an ancient protein that is highly conserved across land plants but missing from the Rafflesiaceae that lack chloroplast genomes. Our findings suggest that GUN1 is an ancient protein that evolved within the streptophyte algal ancestors of land plants before the first plants colonised land more than 470 million years ago. This dataset also includes transcript count data from an RNA-seq experiment looking at gene expression in liverwort Marchantia polymorpha wild type and Mpgun1 mutant spore samples grown in the presence or absence of spectinomycin. We used this data to show that GUN1 does not act significantly in chloroplast retrograde signalling in the liverwort M. polymorpha. Its primary role is likely to be in chloroplast gene expression and its role in chloroplast retrograde signalling probably evolved more recently. Methods Dataset 1 Arabidopsis and Marchantia GUN1 sequences were retrieved from TAIR (https://www.arabidopsis.org/) and MarpoIBase (https://marchantia.info/), respectively. Full-length GUN1 sequences were obtained from a representative set of land plants by protein BLAST searches (https://blast.ncbi.nlm.nih.gov/Blast.cgi) using the Arabidopsis sequence to search GenBank. A set of 76 phylogenetically diverse GUN1 sequences (including representatives from algae, bryophytes, lycophytes, ferns, gymnosperms, and angiosperms) were aligned using the G-INS-i algorithm in MAFFT v7 (Katoh & Standley, 2013). The most highly conserved region of this alignment (876 positions) was used to generate a GUN1 sequence profile with hmmbuild from the HMMER package (v3.3.1) (http://hmmer.org; Eddy, 2011), which in turn was used to search for GUN1 sequences (using hmmsearch with default parameters) in translations of various transcriptome datasets, most notably putative PPR protein sequences compiled by (Gutmann et al., 2020) from the 1KP data set (Carpenter et al., 2019) The 1KP transcriptomes were filtered to remove those encoding fewer than 10000 distinct proteins to avoid trivial false negatives due to low coverage and those from organisms other than green algae and land plants. This resulted in 1128 analysable samples from 894 plant species. Specific searches were also made in data sets of particular interest (whole genome shotgun or transcriptome shotgun assemblies selected via the NCBI Sequence Set Browser (https://www.ncbi.nlm.nih.gov/Traces/wgs/). These additional data sets included genomes or transcriptomes where GUN1 could not be found in the corresponding 1KP samples and also whole genome shotgun data from Sapria himalayana (Cai et al., 2021) and whole transcriptome data from Rafflesia cantleyi (Lee et al., 2016), both holo-parasites from the Rafflesiaceae.
Katoh K, Standley DM. 2013. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular biology and evolution 30: 772–780. Eddy SR. 2011. Accelerated Profile HMM Searches. PLoS computational biology 7: e1002195. Gutmann B, Royan S, Schallenberg-Rüdinger M, Lenz H, Castleden IR, McDowell R, Vacher MA, Tonti-Filippini J, Bond CS, Knoop V, et al. 2020. The Expansion and Diversification of Pentatricopeptide Repeat RNA-Editing Factors in Plants. Molecular plant 13: 215–230. Carpenter EJ, Matasci N, Ayyampalayam S, Wu S, Sun J, Yu J, Jimenez Vieira FR, Bowler C, Dorrell RG, Gitzendanner MA, et al. 2019. Access to RNA-sequencing data from 1,173 plant species: The 1000 Plant transcriptomes initiative (1KP). GigaScience 8. Cai L, Arnold BJ, Xi Z, Khost DE, Patel N, Hartmann CB, Manickam S, Sasirat S, Nikolov LA, Mathews S, et al. 2021. Deeply Altered Genome Architecture in the Endoparasitic Flowering Plant Sapria himalayana Griff. (Rafflesiaceae). Current biology: CB 31: 1002-1011.e9. Lee X-W, Mat-Isa M-N, Mohd-Elias N-A, Aizat-Juhari MA, Goh H-H, Dear PH, Chow K-S, Haji Adam J, Mohamed R, Firdaus-Raih M, et al. 2016. Perigone Lobe Transcriptome Analysis Provides Insights into Rafflesia cantleyi Flower Development. PloS one 11: e0167958.
Dataset 2 Dataset 2 is derived from the NCBI SRA BioProject PRJNA800059 which contains paired-end random-primed, rRNA-depleted, strand-specific RNA-seq reads from 12 liverwort Marchantia polymorpha wild type (accession Takaragaike) or Mpgun1 mutant spore samples grown in the presence or absence of spectinomycin. The raw read data can be obtained from NCBI SRA.
M. polymorpha spores were sterilised and plated on ½ Gamborg’s medium (Duchefa Biochemie) supplemented with 1.2 % agar and 500 μg⋅ml-1 spectinomycin (an inhibitor of plastid translation). The spores were germinated under long day conditions for 48 hours, after which they were resuspended in 1 ml of sterile water, transferred into a microcentrifuge tube, and spun down at 6,000 rpm for 1 minute. Water was removed, and the spore pellet flash-frozen in liquid nitrogen. RNA was extracted from spores using the Direct-Zol RNA MINIprep kit (Zymo Research) and its quality was estimated on an Agilent 4200 tape station (Agilent). Three independent biological replicates were extracted for each genotype/condition. RNA was quantified using a NanoDrop spectrophotometer (Thermo Fisher) and DNase treated using Turbo DNase (Ambion). Transcriptome libraries were prepared using the TruSeq Stranded Total RNA kit with Ribo-Zero Plant (Illumina). The libraries were sequenced on an Illumina HiSeq 4000 platform (150 nt paired-end reads) at Novogene, Hong Kong. Optical duplicate reads were first removed with clumpify (parameters: dedupe optical dist = 40) from the bbmap package (https://sourceforge.net/projects/bbmap/) and adapters were trimmed with bbduk (parameters: ktrim=r k=23 mink=11 hdist=1 tpe tbo ftm=5). The reads were then assigned to transcripts using Salmon v1.3.0 (Patro et al., 2017) (parameters: -l A --validateMappings) against an index prepared with the M. polymorpha MpTak_v6.1 reference genome and cDNA assemblies (https://marchantia.info/). Differential expression analyses were carried out using DESeq2 (Love et al., 2014). Functional annotations for MpTak_v5.1 genome release were used to identify M. polymorphaphotosynthesis-associated nuclear genes.
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nature methods 14: 417–419. Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology 15: 550.