Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we present a reference transcript dataset (LsRTDv1) for lettuce, combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 is a valuable resource for the investigation of transcriptional and alternative splicing regulation in lettuce., We generated a lettuce Reference Transcript Dataset (LsRTDv1) by integrating transcript assemblies from short- and long-read RNA sequencing data with existing lettuce genome annotations. RNA sequencing data was generated from 23 different lettuce samples capturing different tissues, ages of plant and treatments. The 23 samples, all from Lactuca sativa cv. Saladin (synonymous with cv. Salinas) were combined equally into 7 samples prior to sequencing. Short-read assembly The RNA-seq reads of the seven pooled samples were pre-processed with Fastp (Chen et al., 2018) to remove adapters and filter low-quality reads (quality score <20, length <30). Trimmed reads were mapped to the latest lettuce reference genome assembly in NCBI (Lsat_Salinas_v11) using STAR aligner in the 2-pass mode to increase the mapping sensitivity at splice junctions (SJs)(Dobin and Gingeras, 2015). Mismatch was set to 1 with minimum and maximum intron sizes of 60 and 15,000 bp respectively. Two transcript assembl..., , # LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce
The genome assembly of cultivated lettuce was published in 2017 (Reyes-Chin-Wo et al., 2017) with an updated genome version (version 11) available on NCBI (). Here, we introduce the first lettuce reference transcript dataset (LsRTDv1) integrating long-read Iso-seq and short-read RNA-seq of diverse tissue and treatment samples from lettuce with the GenBank and RefSeq transcript annotations, using stringent quality measures. The final LsRTDv1 includes 179,404 non-redundant transcripts encoded by 65,724 genes, greatly expanding the existing lettuce transcriptome and increasing the number of transcripts per gene from 1.4 to 2.7. LsRTDv1 identifies 3696 novel gene models, predominantly long non-coding RNAs, absent in both GenBank and RefSeq annotations.
We provide two files for the LsRTDv1:
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Microalgae are integral primary producers for global ecosystems whose genomes can be mined for ecological insights, but representative genome sequences are lacking for many phyla. We cultured and sequenced 107 microalgae species from 11 different phyla indigenous to varied geographies and climates. This genome collection was used to resolve genomic differences between saltwater and freshwater microalgae. Freshwater species showed domain-centric ontology enrichment for nuclear and nuclear membrane functions, while saltwater species were enriched in organellar and cellular membrane functions. Marine species contained significantly more viral families in their genomes (p-value = 8 x 10(-4)). Viral sequences were identified from Chlorovirus, Coccolithovirus, Pandoravirus, Marseillevirus, Tupanvirus, and others integrated into algal genomes. Algal, viral-origin sequences were found to be expressed and to code for a wide variety of functions. Our results clarify the poorly characterized occurrences of viral elements in algal genomes and define a unified adaptive strategy for algal halotolerance.
Methods METHODS DETAILS
Microalgal strain selection and cultivation
Cultivation, DNA extraction, and sequencing of isogenic microalgae was done in several international culture collections and sequencing centers; UTEX (Austin, TX, USA), Bigelow laboratories (NMCA culture collection center, East Boothbay, ME, USA), New York University Abu Dhabi Center for Genomics and Systems Biology (Abu Dhabi, UAE), Admera Health LCC (South Plainfield, NJ, USA), and Novogene (HK).
The UTEX strains were grown on slants using one of the following media as appropriate: BG11 Medium, Bristol Medium, Cyandidium Medium, f/2 Medium, Modified Artificial Seawater Medium, Modified Bold's 3N Medium, Porphyridium Medium, Proteose Medium, Soil Extract Medium, Trebouxia Medium, or Volvox Medium with 1.5% agar as described on the UTEX website (https://utex.org/pages/algal-culture-media-recipes); grown under cool white fluorescent lights at 20⁰C on a 12 hour light cycle. For species isolated and cultured at NYUAD, f/2 medium (Lananan et al., 2013), or Tris-minimal medium (https://chlamycollection.org/), was used (https://utex.org/pages/algal-culture-media-recipes).
The species chosen for sequencing were intended to represent as many microalgae phyla and as many different environments as possible. We sequenced representatives from 11 phyla (see Table S1). Most of the species were from the Chlorophyta or the Ochrophyta phyla. The project designations were algallCODE phase II (n=107, this manuscript), algallCODE phase I (n=22), NCBI-hosted (n=43), and Phytozome-hosted (n=2). Individual strain cultures were selected as representative species for their lineages or as standards to confirm workflow reproducibility (see Table S1; Dataset S1).
We emphasized maximizing the sample size of each saltwater and freshwater species (Fig. 1; Table S1). Of our initial effort to culture >150 species, 24 failed to produce sufficient biomass, six were contaminated, and 3 yielded reads inadequate for an assembly matching the expected size (Dataset S1). The de novo assemblies from the final batch of 107 sequenced species were combined with publicly-available algal genomes for downstream analyses, including coding sequence (CDS) predictions (Dataset S2), hidden Markov model (HMM)-based functional predictions, including viral and protein family domain identification, hierarchical bi-clustering (Fig. 1), enrichment analyses (Fig. 5), principal component analyses (Fig. 5), and ternary graph-based analyses (Fig. S10, Dataset S12).
The natural habitats of these microalgae include a range of diverse geographic locations and all climatic zones, with various temperatures, wind speeds, precipitation, and solar radiation. To allow the study of their evolution, we included species from different types of environments (from the arctic to the tropics) and both salt- and fresh-water habitats (Fig. 1A; Table S1). The freshwater species sequenced in this project included members of the Chlorophyta and Ochrophyta; most of the Haptophytes, Rhodophytes, and Myzozoa we sequenced were saltwater species. Most of the UTEX accessions were freshwater species (28/40); most of the NCMA accessions were saltwater species (50/57). Alexandrium andersonii, a mixotrophic dinoflagellate (1.7Gb), Heterocapsa arctica, an arctic dinoflagellate (1.3 Gb), Lingulodinium polyedra, a red-tide dinoflagellate (1.2 Gb), Amphidinium gibbosum (1.1Gb), and Karena brevis (1.0 Gb) were the largest de novo-assembled genomes in this work (Table S2).
Long-read assemblies, including those from other studies (i.e., Chromochloris zofingiensis (Roth et al., 2017), Thalassiosira pseudonana (Armbrust et al., 2004), and Chlamydomonas reinhardtii (Merchant et al., 2007)), were used to validate our high-throughput short-read assembly process. Four subtropical axenic isolates (from the United Arab Emirates) were sequenced for this study using long-read technologies, including 10x Genomics (Pleasanton, CA, USA) linked-reads, and Pacific Biosciences (Menlo Park, CA, USA) Sequel long reads. Long reads were used to validate assemblies, viral element insertions, and to resolve repeat-containing regions (Ummat and Bashir, 2014; Vondrak et al., 2019). Our results indicated that the CDSs that provided the foundational information for the comparative analyses in this manuscript were reliably determined using short reads (Illumina HiSeq X or Novoseq6000). For example, the Chromochloris zofingiensis genome is the highest quality algal genome published to date (Roth et al., 2017) and has 33,513 exons; our short-read assembly for Chromochloris zofingiensis had 33,910 exons. Other reference microalgae used in this study as resequencing standards included Thalassiosira psuedonana (Armbrust et al., 2004), Chlamydomonas reinhardtii (Merchant et al., 2007), Scenedesmus sp., Guillardia theta (Curtis et al., 2012), Fragilariopsis cylindricus (Mock et al., 2017), Coccomyxa subellipsoidea (Blanc et al., 2012), and Bigelowiella natans (Curtis et al., 2012). A comparison of the assemblies generated from the monoculture and sequencing performed in this study and previous whole-genome sequencing projects is presented in Fig. S3, and QUAST and BUSCO assembly metrics are in Table S2. Hidden Markov models were used to predict structure and function from the whole-genome sequences (Fig. 1, B–D; Tables S3,5; Datasets S6, S7). The results for functional characterization using Enzyme Commission (EC) codes (Alborzi et al., 2017; Ryu et al., 2019), Kyoto Encyclopedia of Genes and Genomes (KEGG) designations (Porollo, 2014), and Gene Ontology (GO) terms (Hayes and Mamano, 2018; Teng et al., 2017) are in Tables S6,8.
DNA extraction
DNA was extracted from mature cultures with QIAGEN DNeasy Plant Maxi kits for HiSeqX 150x2 paired-end (short read) sequencing or QIAGEN MagAttract High Molecular Weight DNA Kits (48) for long-read sequencing. DNA was quantified and assayed for integrity as per the kit manufacturer's protocol. For HMW DNA extraction, briefly, DNA concentration was measured using a Qubit Fluorometer and checked for size by pulsed-field electrophoresis. A length-weighted mean of 50-70 kb was obtained, or the sample was rejected for sequencing. See Dataset S2 for FastQC reports and Fig. S1 for the gel images showing DNA integrity. Extracted DNA with low integrity was not included in library preparations. More than 30 cultures were grown whose DNA did not meet the quality threshold; in these instances, substitute strains were chosen. The final, sequenced strains are listed in Table S1.
Sequencing
Genomic DNA sequencing was performed with Illumina paired-end (Illumina, San Diego, CA, USA), PacBio Sequel (Pacific Biosciences, Menlo Park, CA, USA), and 10x Genomics linked-reads, where indicated, (10x Genomics, Pleasanton, CA, USA) to enable reliable coverage, contig assembly, and de novo genomic sequence assembly. For Illumina paired-end sequencing, Nextera 2x150 bp libraries (Illumina, San Diego, CA, USA) with approximately 72 million reads per sample passing quality filters (Dataset S2) were used for sequencing with a HiSeqX (https://emea.illumina.com/systems/sequencing-platforms/hiseq-x.html). All reads are uploaded to the National Center for Biotechnology Information (NCBI) under the Bioproject accession PRJNA517804). The target coverage was 100x on a 100 Mbp genome. Quality control for library preparation for Illumina sequencing was done with Qubit, Tapestation (Fig. S1), and qPCR. Combining these technologies assisted the validation of VFAM placement within selected genomes and ensured reliable assembly.
De novo genome assembly
De novo assembly can produce variable output depending on the source species and software used; we used both ABySS 2.0 (Jackman et al., 2017) and the Platanus (Kajitani et al., 2019) pipelines for each species sequenced with short-reads (Illumina HiSeqX (Illumina, San Diego, CA, USA)). The ABySS 2.0 command was: 'unset SLURM_NTASKS && mkdir -p $READFILE && TMPDIR=/tmp ABySS-pe -j 18 lib=pe1 k=64 name=$READFILE pe1='$READFILE R1_001.fastq.gz $READFILE R2_001.fastq.gz' --directory=/ data/analysis/ABySS_pe/$READFILE'. The Platanus commands were: '/platanus assemble -o $READ.OUT -f $READ-1.trimmed $READ-2.trimmed -t 4 -m 72 2>assemble.$READ.log'. Details of all YML workflows used are in Dataset S3 and the Key Resources Table lists all essential software used in the creation and analysis of these genomes.
The output with the most single-copy, universally-conserved orthologs, according to “Based on evolutionary-informed expectations of the gene content of near-universal single-copy orthologs,” BUSCO, was chosen for subsequent analyses (Table S2, Dataset S2). This step produces some bias, as ABySS produced assemblies that were much closer in size to the estimated genome sizes from close relatives.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Next-generation sequencing read statistics and sequencing coverage for the sample datasets.
Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we present a reference transcript dataset (LsRTDv1) for lettuce, combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 is a valuable resource for the investigation of transcriptional and alternative splicing regulation in lettuce., We generated a lettuce Reference Transcript Dataset (LsRTDv1) by integrating transcript assemblies from short- and long-read RNA sequencing data with existing lettuce genome annotations. RNA sequencing data was generated from 23 different lettuce samples capturing different tissues, ages of plant and treatments. The 23 samples, all from Lactuca sativa cv. Saladin (synonymous with cv. Salinas) were combined equally into 7 samples prior to sequencing. Short-read assembly The RNA-seq reads of the seven pooled samples were pre-processed with Fastp (Chen et al., 2018) to remove adapters and filter low-quality reads (quality score <20, length <30). Trimmed reads were mapped to the latest lettuce reference genome assembly in NCBI (Lsat_Salinas_v11) using STAR aligner in the 2-pass mode to increase the mapping sensitivity at splice junctions (SJs)(Dobin and Gingeras, 2015). Mismatch was set to 1 with minimum and maximum intron sizes of 60 and 15,000 bp respectively. Two transcript assembl..., , # LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce
https://doi.org/10.5061/dryad.xwdbrv1m8
The genome assembly of cultivated lettuce was published in 2017 (Reyes-Chin-Wo et al., 2017) with an updated genome version (version 11) available on NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002870075.4/). Here, we introduce the first lettuce reference transcript dataset (LsRTDv1) integrating long-read Iso-seq and short-read RNA-seq of diverse tissue and treatment samples from lettuce with the GenBank and RefSeq transcript annotations, using stringent quality measures. The final LsRTDv1 includes 179,404 non-redundant transcripts encoded by 65,724 genes, greatly expanding the existing lettuce transcriptome and increasing the number of transcripts per gene from 1.4 to 2.7. LsRTDv1 identifies 3696 novel gene models, predomin...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example files generated in the TRITEX assembly pipeline. These files were generated using the input datasets of maize B73 downloaded from NCBI SRA with accession numbers SRR11606869 and PRJNA391551. The reference guide map was generated using the RefGen_v5 genome of maize B73 (accession number GCA_000005005.1). The marker guide map was generated using the set of markers from the linkage map of the Intermated B73 x Mo17 (IBM) population (doi:10.1371/journal.pone.0028334). Some files are provided in RDS format (serialized R object), which can be loaded in R; others are CSV files. More details on README.txt and on the TRITEX long-read paper.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource is a de novo genome assembly of the cowpea (Vigna unguiculata (L.) Walp) cultivar IT86D-1010. This cultivar has become the preferred germplasm for developing transgenic cowpea plants, either genetically modified or gene edited. For the production and characterization of genetically modified cowpeas, it is crucial to have a high-quality genomic sequence of the cultivar being transformed. Although there are publicly accessible genomic sequences for cowpea, the reference cultivar is IT97K-499-35 (Lonardi et al., 2019). There is also a previously published version of IT86D-1010 based on Illumina short reads (Spriggs, et al, 2018), however this is too fragmented. The submitted dataset is a new, high-quality, assembled genomic sequence for the cowpea cultivar IT86D-1010, using Oxford Nanopore Technology (ONT) long read sequencing, and corrected using the already published Illumina short reads (Spriggs, et al., 2018). The resulting genome consisted of 505 contigs, with a total assembly length of 537341206, and was comparable in quality to the IT97K-499-35 reference.
Lineage: Plant material and tissue sample
Cowpea (Vigna unguiculata) cultivar IT86D-1010 was originally sourced from the International Institute of Tropical Agriculture (IITA). Line have been maintained in CSIRO for more than 10 generations (not through single seed descent). Young unexpanded leaves were collected for DNA extraction.
DNA Isolation and Sequencing
For Illumina short read sequencing, DNA extraction was carried out using a Qiagen maxi DNA kit as per the manufacturer’s instructions. Illumina library preparation and sequencing of DNA was undertaken by the Australian Genome Research Facility (AGRF) with 2 × 100 bp standard insert paired-end sequencing using a Hiseq 2500 system, as described in Spriggs, et al. (2018),
For Oxford Nanopore Technology (ONT) long read sequencing, DNA was isolated using the QIAGEN Genomic-tip as per the manufacturer’s instructions. DNA was extracted from a transformed IT86D-1010 line. Shorter DNA fragments (<5 kb) were depleted from the DNA sample using the Circulomics SRE XS protocol. The size-selected genomic DNA was then prepared for sequencing using the ONT native barcoding workflow and then sequenced on an ONT PromethION Sequencer at the Biomolecular Resource Facility (BRF) at the John Curtin School of Medical Research, Australian National University (Canberra, ACT).
Genome sequencing
Adapter sequences of the short read sequencing data were removed from the resulting reads using TrimGalore v0.6.6. The long-read sequences were trimmed using Porechop v0.2.4. Long-reads were assembled using Flye v4.0.8 under default settings. The resulting genome assembly was then subjected to three rounds of polishing using long reads and three rounds using short reads with Racon v1.4.22 under default settings. This was followed by six rounds of short-read-based polishing using Polca MaSURCA v4.0.7 using default settings. The resulting genome sequence was sequentially decontaminated using The NCBI Foreign Contamination Screen (FCS) tool (fcs-adaptor and fcs-gx) to obtain the final contig assembly. The sequences corresponding to the T-DNA insertion were identified and removed. The assembly was assessed using BUSCO v5.2.2 against the lepidoptera (lepidotptera_odb10) and Insecta (insecta_odb10) lineages’ gene sets.
Genome annotation
The genome was annotated with Liftoff (V16.3) using the published cowpea cultivar IT97K-499-35 (GCF_004118075.2) as a reference (Lonardi et al., 2019).
References: Spriggs A, Henderson ST, Hand ML, Johnson SD, Taylor JM, Koltunow A. Assembled genomic and tissue-specific transcriptomic data resources for two genetically distinct lines of Cowpea ( Vigna unguiculata (L.) Walp). Gates Open Res. 2018, 18;2:7. doi: 10.12688/gatesopenres.12777.2. Lonardi S, Muñoz-Amatriaín M, Liang Q, Shu S, Wanamaker SI, Lo S, Tanskanen J, Schulman AH, Zhu T, Luo MC, Alhakami H, Ounit R, Hasan AM, Verdier J, Roberts PA, Santos JRP, Ndeve A, Doležel J, Vrána J, Hokin SA, Farmer AD, Cannon SB, Close TJ. The genome of cowpea (Vigna unguiculata [L.] Walp.). Plant J. 2019, 98:767-782. doi: 10.1111/tpj.14349.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Third-generation sequencing has penetrated little in metagenomics due to the high error rate and dependence for assembly on short-read designed bioinformatics. However, second-generation sequencing metagenomics (mostly Illumina) suffers from limitations, particularly in the assembly of microbes with high microdiversity and retrieval of the flexible (adaptive) fraction of prokaryotic genomes. Here, we have used a third-generation technique to study the metagenome of a well-known marine sample from the mixed epipelagic water column of the winter Mediterranean. We have compared PacBio Sequel II with the classical approach using Illumina Nextseq short reads followed by assembly to study the metagenome. Long reads allow for efficient direct retrieval of complete genes avoiding the bias of the assembly step. Besides, the application of long reads on metagenomic assembly allows for the reconstruction of much more complete metagenome-assembled genomes (MAGs), particularly from microbes with high microdiversity such as Pelagibacterales. The flexible genome of reconstructed MAGs was much more complete containing many adaptive genes (some with biotechnological potential). PacBio Sequel II CCS appears particularly suitable for cellular metagenomics due to its low error rate. For most applications of metagenomics, from community structure analysis to ecosystem functioning, long reads should be applied whenever possible. Specifically, for in silico screening of biotechnologically useful genes, or population genomics, long-read metagenomics appears presently as a very fruitful approach and can be analyzed from raw reads before a computationally demanding (and potentially artifactual) assembly step.
For any genome-based research, a robust genome assembly is required. De novo assembly strategies have evolved with changes in DNA sequencing technologies and have been through at least three phases: i) short-read only, ii) short- and long-read hybrid, and iii) long-read only assemblies. Each of the phases has their own error model. We hypothesized that hidden scaffolding errors in short-read assembly and erroneous long-read contigs degrade the quality of short- and long-read hybrid assemblies. We assembled the genome of T. borchgrevinki from data generated during each of the three phases and assessed the quality problems we encountered. We developed strategies such as k-mer-assembled region replacement, parameter optimization, and long-read sampling to address the error models. We demonstrated that a k-mer-based strategy improved short-read assemblies as measured by BUSCO while mate-pair libraries introduced hidden scaffolding errors and perturbed BUSCO scores. Further, we found that al...
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The sequencing of a kidney sample (KW2013002) from a stranded Megaptera novaeangliae (Humpback whale) calf is the first chromosome level reference genome for this species. The calf, a 457 cm and 2,500 lbs male, was found stranded in Hawai’i Kai, HI, in 2013 and was marked as abandoned/orphaned. In 2023, 1g of kidney was sequenced with PacBio long-read DNA sequencing, chromatin conformation capture (Hi-C), RNA sequencing, and mitochondrial sequencing to comprehensively characterize the genome and transcriptome of M. novaeangliae. The reference genome was compared to the preexisting M. novaeangliae scaffold to determine assembly improvements. Data validation includes a synteny analysis, mitochondrial annotation, and a comparison of BUSCO scores (scaffold v. reference genome and Balaenoptera musculus (Blue whale) v. M. novaeangliae). BUSCO analysis was performed on an M. novaeangliae scaffold-level assembly to determine genomic completeness of the reference genome, with a scaffold BUSCO score of 91.2% versus a score of 95.4% (Table I). Synteny analysis was performed using the B. musculus genome as comparison to determine chromosome level coverage and structure. Further, a time-based phylogenetic tree was constructed using the sequenced data and publicly available genomes. This dataset also contains the results of de novo repeat identification and gene annotation for the Humpback whale (Megaptera novaeangliae) genome. The repeat families were identified and classified using RepeatModeler, and gene prediction was conducted using AUGUSTUS and SNAP, incorporating coding sequences from related cetaceans. The resulting gene models were further refined using the MAKER pipeline, with protein evidence from Swiss-Prot and related species. tRNA genes were identified with tRNAscan-SE. The dataset includes the transcript sequences (GIU3625_Humpback_whale.transcript.fasta.gz), annotation file (GIU3625_Humpback_whale.annotation.gff.gz), and a methods file (methods.txt) detailing the bioinformatic processes. Methods Sample Information A kidney sample (KW2013002) was collected from a M. novaeangliae calf on January 15, 2013, in Hawai’i Kai, HI, and deposited at the National Institutes of Standards and Technology (NIST). The sample was not collected by the authors so information regarding collection is limited to that presented herein. The calf, a 457 cm and 2,500 lbs male at the time of necropsy, was first observed on January 14, 2013, in shallow water and died between January 14 and January 15, 2013, via stranding. The calf was marked as abandoned/orphaned. In 2023, 1g of KW2013002 was sampled for sequencing by Cantata Bio. PacBio long reads DNA sequencing Quantification of DNA samples was performed using the Qubit 2.0 Fluorometer. For the construction of the PacBio SMRTbell library, targeting an insert size of approximately 20kb, the SMRTbell Express Template Prep Kit 2.0 was employed following the manufacturer's recommended protocol and default settings. The library was subsequently prepared for sequencing by binding to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto the PacBio Sequel II system. Sequencing was executed using PacBio Sequel II 8M SMRT cells to ensure comprehensive coverage and high-quality reads. Quality control of the extracted DNA was performed using nanodrop and gel. The OmniC library quality control was done using the Hifiasm draft assembly and showed a high amount of long-range linkage reads. The OmniC sequencing data was also quality controlled to examine Q30%, and the quality score matched the Illumina standard. The scaffolding algorithm HiRise also has a built-in quality control that uses only reads with a map score of over 40. Chromatin was fixed in situ within the nucleus using formaldehyde, followed by digestion with DNase I. The processed chromatin had its ends repaired and was then ligated to a biotinylated bridge adapter, facilitating proximity ligation of adapter-containing ends. Post-proximity ligation, the crosslinks were reversed, and the DNA was purified—a critical step involved treating the purified DNA to eliminate any non-internal biotin. The sequencing libraries were prepared using NEBNext Ultra enzymes and Illumina-compatible adapters, with biotin-containing fragments isolated using streptavidin beads before PCR enrichment. Sequencing was performed on an Illumina HiSeqX platform to achieve approximately 30x coverage. Contig assembling and scaffolding The de novo assembly process utilized PacBio CCS reads and Omni-C reads as input for HiC-Hifiasm, employing default parameters. This approach facilitated the generation of a separate de novo assembly for each haplotype, enhancing the accuracy and integrity of the genomic reconstruction. The scaffolding phase involved the integration of the de novo assembly with Dovetail Omni-C library reads through HiRise, a software pipeline tailored for scaffolding genome assemblies using proximity ligation data. Alignment of Omni-C library sequences to the draft assembly was achieved using bwa, with the mapped read pairs analyzed by HiRise to construct a likelihood model for genomic distance (See Figure S1). This model, along with additional information from the synteny analysis (see below), informed the identification and correction of misjoins, the scoring of potential joins, and the execution of joins exceeding a defined confidence threshold. Synteny analysis The M. novaeangliae newly-assembled scaffolds were mapped to the B. musculus whole genome (GenBank GCA_009873245.3) in order to map the synteny between the two species.9,10 A synteny analysis was performed using JupiterPlot 1.0,11 a software tool that uses circos-based consistency plots to map a given set of scaffolds with a reference genome. RNA sequencing Total RNA was extracted employing the QIAGEN RNeasy Plus Kit, adhering to the manufacturer's instructions. Quantification of RNA involved the Qubit RNA Assay and the TapeStation 4200 system. Before library preparation, DNase treatment was applied, followed by AMPure bead cleanup and rRNA depletion using QIAGEN FastSelect -HMR. The NEBNext Ultra II RNA Library Prep Kit was used for library preparation per the manufacturer's protocols. Sequencing of the prepared libraries was conducted on the NovaSeq 6000 platform, utilizing a 2 x 150 bp configuration to ensure comprehensive transcriptome coverage. Repeat Analysis This dataset was derived from a Humpback whale (Megaptera novaeangliae) genome assembly. The repeat families found in the genome were identified de novo using RepeatModeler (v2.0.1), which relies on RECON (v1.08) and RepeatScout (v1.0.6). The custom repeat library generated from RepeatModeler was then used to discover, identify, and mask the repeats in the assembly using RepeatMasker (v4.1.0). Gene prediction was performed using the AUGUSTUS software (v2.5.5) with six rounds of optimization. Coding sequences from related cetacean species, including Balaenoptera acutorostrata, Balaenoptera musculus, Balaenoptera ricei, Megaptera novaeangliae, and Orcinus orca, were used to train the ab initio models for gene prediction. Additionally, the SNAP software (v2006-07-28) was trained using the same coding sequences to build a separate gene prediction model. RNA-seq reads were mapped to the genome using the STAR aligner (v2.7), and intron hints were generated using the bam2hints tool within AUGUSTUS. MAKER was then employed to integrate the predictions from AUGUSTUS and SNAP, combining this information with peptide evidence from the UniProt database and protein sequences from related cetacean species. Only gene models predicted by both AUGUSTUS and SNAP were retained in the final dataset. Annotation Edit Distance (AED) scores were generated for each predicted gene as part of the MAKER pipeline to assess the accuracy of the predictions. Finally, tRNA genes were identified using the tRNAscan-SE software (v2.05). Acknowledgments The specimens used in this study were collected by Kristi West, University of Hawaii, and provided by the National Marine Mammal Tissue Bank (NMMTB), which is maintained by the National Institute of Standards and Technology (NIST) at the NIST Biorepository, Hollings Marine Laboratory, Charleston, SC. The NMMTB is operated under the direction of the National Oceanic and Atmospheric Administration/National Marine Fisheries Service (NOAA Fisheries) with the collaboration of the U.S. Geological Survey, U.S. Fish and Wildlife Service, the (former) Minerals Management Service, and NIST, through the Marine Mammal Health and Stranding Response Program.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains three files:
Sequencing was done on genomic DNA of E. coli strain C-1 obtained from Yale Stock Center.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cannabis is a genus of flowering plants in the family Cannabaceae.
Source: https://en.wikipedia.org/wiki/Cannabis
In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.
These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.
All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.
The following tables are included in the Cannabis Genomes Project dataset:
Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.
MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.
MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.
MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.
Fork this kernel to get started with this dataset.
Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis
Banner Photo by Rick Proctor from Unplash.
Which Cannabis samples are included in the variants table?
Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?
How many variants does each sample have at the THC Synthase gene (THCA1) locus?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For genome assembly, we generated short and long reads using the DNA sample of an adult male of Nanchukmacdon. Then, we constructed a genome assembly using the sequencing reads with the reference-guided approach. The 80.14x raw PacBio subreads were assembled and polished to generate 1,942 high-quality contigs supported by at least 50 PacBio subreads. To generate a chromosome-level assembly, the high-quality polished contigs were then further assembled by an improved version of RACA that can utilize both the genome information of related species and diverse types of sequencing data. The assembly was used to build the final assembly after one more polishing step using short reads.
For annotating protein-coding genes, RNA samples were prepared and sequenced from 24 different tissues of the Nanchukmacdon individual which was used for whole genome sequencing. Using a combination of ab initio and homology-based prediction approaches with the RNA sequencing data, a total of 20,588 protein-coding genes with an average length of 47.06 Kbp were annotated in the NCMD assembly. Non-coding genes for diverse types of RNAs, including rRNA, snRNA, and miRNA, were annotated by using the Rfam database and Infernal (v.1.1.3). The tRNAscan-SE (v.2.0.5) and RNAmmer (v.1.2) were used to annotate non-coding genes for tRNA and rRNA, respectively.
The sequencing read data for genome assembly and annotation can be obtained at NCBI SRA under the project of PRJNA967127.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 8,558-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,434 Salmonella enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 125 different serotypes are represented in this dataset, with Typhimurium (including monophasic), Enteritidis and Infantis being the most represented ones and, together, corresponding to 56.2% of the dataset.
File “Se_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Se_profiles_wgMLST.tsv” corresponds to a tab separated file with the 8,558-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Se_profiles_cgMLST_95.tsv”, “profiles/Se_profiles_cgMLST_98.tsv” and “profiles/Se_profiles_cgMLST_100.tsv” correspond to a 3,261-loci, 3,179-loci and 874-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of S. enterica genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,779 samples associated with four BioProjects (PRJEB16326, PRJEB20997, PRJEB30335 and PRJEB39988). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,434 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019). wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 8,558-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 8,558-loci wgMLST profiles of the 1,434 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 3,261-loci, 3,179-loci and 874-loci allelic matrices, respectively).
This restriction site associated DNA sequencing (RAD-seq) dataset for Antarctic krill (Euphausia superba) includes raw sequence data and summaries for 148 krill from 5 Southern Ocean sites. A detailed README.pdf file is provided to describe components of the dataset. DNA library preparation was carried out in two separate batches by Floragenex (Eugene, Oregon, USA). RAD fragment libraries (SbfI) were sequenced on an Illumina HiSeq 2000 using single-end 100 bp chemistry. As there is no reference genome for Antarctic krill, a set of unique 90 bp sequences (RAD tags) was assembled from 17.3 million single-end reads from an individual krill. We obtained over a billion raw reads from the 148 krill in our study (a mean of 6.8 million reads per sample). The reference assembly contained 239,441 distinct RAD tags. The core genotype dataset exported for downstream data filtering included just those SNPs with genotype calls in at least 80% of the krill samples and contained 12,114 SNPs on 816 RAD tags.
Sample collection table (comma separated):
Southern Ocean Location, Sample Size, Austral Summer, Latitude, Longitude, ID
East Antarctica (Casey), 21, 2010/2011, 64S, 100E, Cas East Antarctica (Mawson), 22, 2011/2012. 66S, 70E, Maw Lazarev Sea, 38, 2004/2005 and 2007/2008, 66S, 0E, Laz Western Antarctic Peninsula, 16, 2010/2011, 69S, 76W, WAP Ross Sea, 23, 2012/2013, 68S, 178E, Ross
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies of 1,426 Listeria monocytogenes samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7116878), comprising genome assemblies of 1,874 L. monocytogenes samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).
File “BeONE_Lm_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers and in-silico Multi Locus Sequence Type.
The archive “BeONE_Lm_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
Dataset selection and curation
This anonymized dataset of L. monocytogenes genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57166. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,426 isolates passed the dataset curation step and were included in the final dataset.
Funding
This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.
https://rightsstatements.org/vocab/UND/1.0/https://rightsstatements.org/vocab/UND/1.0/
Proper interactions between the nucleus and cytoplasmic organelles (mitochondria and plastids) are essential to eukaryotic cellular function. To improve our understanding of the role of organellar genomes and nuclear-cytoplasmic interactions in plant development and stress response, our first aim is to survey organellar genome diversity in wheat and across the broader Triticum-Aegilops complex. This will be followed by work to assess genome dynamics across developmental stages as well as during abiotic and biotic stress response. The results of this work will be important for improving crop traits. To accomplish our goals, it was critical to first establish improved methods for the isolation, sequencing, and assembly of organellar genomes from limited starting material without whole genome amplification. As a proof of concept, we optimized our methods using the Triticum aestivum cv. Chinese Spring, for which there is previous sequencing data available. The mitochondria and chloroplast genomes have large repeats (upto 10kb and 20kb in length, respectively). Previous studies have performed whole genome amplification and have manually stitched contigs to force a single master circle configuration of the organellar genomes, which may or may not reflect the true native state of the wheat organellar genomes. To resolve the long repeats and perform de novo assemblies without whole genome amplification and manual stitching of contigs, we utilized low input PacBio 20kb library preparations to generate long sequencing reads. In total, we sequenced 20 organellar-enriched samples with PacBio, including 13 diverse wild species, T. durum, T. aestivum cv. Chinese Spring, and three wheat alloplasmic lines. In addition we generated Illumina short-read sequences for many additional cultivars, wild species, and alloplasmic lines. This project includes data for one of these samples (Aegilops uniaristata). Raw sequencing reads are deposited here. Assemblies and annotations will be included once available.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset overview The MIMIC2 dataset provides: a non-redundant high-quality catalog of 5.0 million genes 6,967 Metagenome-Assembled Genomes (MAGs) 1,252 Metagenomic Species Pangenomes (MSPs) This dataset can be used to analyze shotgun sequencing data of the murine gut microbiota. How to use this dataset Create a gene abundance table by aligning reads from each sample against the catalog. For this purpose, you can use Meteor or NGLess. Then, normalize raw counts by gene length. Taxonomic profiling: the abundance of each species can be estimated as the average abundance of its 100 first core genes. To reduce the false positive rate, only consider that a species is present if at least 10/100 marker genes are detected. Methods Data sources The MIMIC2 dataset was constructed using two different data sources: Source 1: the Mouse Gastrointestinal Bacterial Catalogue (MGBC) which is a compilation of 276 genomes from cultured isolates and 45,218 metagenome-assembled genomes (MAGs) from 1,960 publicly available mouse metagenomes Source 2: 68 samples of Messaoudene et al. (PRJNA783624) and 85 deeply sequenced samples from bioproject CNP0000619 published by Xiao et al. Metagenomic assembly De novo metagenomic assembly was performed on the 153 samples from the data Source 2. First, sequencing adapters removal and read trimming was performed with fastp. Reads mapped on the host genome (GCF_000001635.27) with bowtie2 were removed with samtools. Finally, Metagenomic assembly was performed with metaSPAdes. Contigs of less than 1500 bp were removed. MAGs recovery Reads of each sample from the data Source 2 were aligned to their respective assembly with bowtie2 and results were indexed in sorted bam files with samtools. Then, contigs coverage was computed in each sample with jgi_summarize_bam_contig_depths. MAGs were generated with MetaBAT 2 and MAGs quality was assessed with checkM. MAGs with completeness < 70% or contamination > 5% or N50 < 8Kb were discarded. Non-redundant gene catalog Genes were predicted on all contigs from the data Source 2 with Prodigal (parameters : -m -p meta ). Likewise, genes were predicted on all genomes from the data Source 1 (MGBC) with Prodigal (parameters : -m -p single ). Genes from the two data sources were pooled and those shorter than 90 bp or incomplete were discarded. Finally, genes were clustered with cd-hit-est (parameters -c 0.95 -aS 0.90 -G 0 -d 0 -M 0 -T 0 ) by choosing those from the longest contigs as representatives. MSPs recovery Samples from 19 cohorts (see below) were aligned against the non-redundant gene catalog with the Meteor software suite to produce a raw gene abundance table (5M genes quantified in 1374 samples). Then, co-abundant genes were binned in 1,252 Metagenomic Species Pan-genomes (MSPs, i.e. clusters of > 500 co-abundant genes that likely belong to the same microbial species) using MSPminer. The 19 cohorts used to recover the MSPs are: PRJNA783624 CNP0000619 PRJEB15095 PRJEB22007 PRJEB22710 PRJEB31298 PRJEB32790 PRJEB32890 PRJEB3374 PRJEB36943 PRJEB44286 PRJEB7759 PRJNA293255 PRJNA390686 PRJNA397886 PRJNA515074 PRJNA540893 PRJNA549182 PRJEB40719 MSPs taxonomic annotation Representative genomes of the MMGC collection were annotated with GTDB-Tk based on GTDB r202. Then, taxonomic annotation of MMGC genomes was propagated to the corresponding MSPs. For the MSPs without any corresponding MAG, taxonomic annotation was performed by alignment of all core and accessory genes against representative genomes of the GTDB database (release r202) using blastn (version 2.7.1, task = megablast, word_size = 16). A species-level assignment was given if > 50% of the genes matched the representative genome of a given species, with a mean nucleotide identity ≥ 95% and mean gene length coverage ≥ 90%. The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom), if more than 50% of their genes had the same annotation. Construction of the phylogenetic tree 39 universal phylogenetic markers genes were extracted from the 1,252 MSPs (or the corresponding MAGs if available) with fetchMGs. Then, the markers were separately aligned with MUSCLE. The 40 alignments were merged and trimmed with trimAl (parameters: -automated1). Finally, the phylogenetic tree was computed with FastTreeMP (parameters: -gamma -pseudo -spr -mlacc 3 -slownni).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zip file contains scripts, initial fastq files, assembled genomes (public and from this study) as well as bioinformatics intermediate files used for this study.
File Structure and Descriptions
├── 01.Filter.sh : Script to perform read filtering from raw fastq.gz
├── 02.RefGenome_Assembly.sh : Script to perform reference-mapping genome assembly from the trimmed fastq file
├── 03.Cleanup.sh : Script to reorganize folder and clean-up intermediate files
├── 04.GapAnalysis.sh : Script to extract individual viral segment and perform QUAST analysis to calculate number of gaps
├── 05.Phylogenetic.sh : Script to combine TiLV genome from public database (availabel in Phylogenetic folder) and generate Maximum likelihood tree
├── backup : Raw Fastq (basecalled with Guppy super accuracy mode)
├── Consensus : Assembled genome in fasta format
├── Coverage: Contig coverage information
├── Filtered_Fastq: Quality and length-filtered fastq used for generating the genome assembly
├── Filtered_Segment.fasta: Viral segments from samples that have been filtered for high completeness ( > 80% genome without gap)
├── GapAnalysis.tsv: Table containing the gap information for each assembled viral segment of each sample
├── Normalised_Bam: BAM alignment file used for variant calling
├── Phylogenetic: Contains crucial whole genome sequences of publicly available virus downloaded from NCBI
├── primer-schemes: Contains reference sequence used for reference-based mapping
├── Raw_Bam: Raw BAM alignment file prior to normalisation. Used to estimate the read depth observed for each sample and each viral segment
├── ReadDepth.tsv: Table file showing the read depth of each sample and its viral segment
├── Sequencing_Stat.tsv: Sequencing statistics of samples before and after length/quality filter with NanoFilt
└── TILV.tre: Newick file containing the maximum likelihood tree generated from fasttree (-gtr -nt)
8 directories, 10 files
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Many lepidopteran caterpillars produce silk, cocoons, feeding tubes, or nests for protection from predators and parasites. Yet, the number of lepidopteran species whose silk composition has been studied in detail is very small, because the genes encoding the major structural silk proteins tend to be large and repetitive, making their assembly and sequence analysis difficult. Here we have analyzed the silk of Yponomeuta cagnagella, which represents one of the early diverging lineages of the ditrysian Lepidoptera thus improving the coverage of the order. To obtain a comprehensive list of the Y. cagnagella silk genes, we sequenced, assembled, and annotated the draft genome using Oxford Nanopore and Illumina technologies. The 626 Mb assembly with N50 of 96.5 kb contained 96.9% insect orthologs recovered by BUSCO and 30,003 predicted gene models. We then used a silk-gland transcriptome and a silk proteome to identify major silk components and verified the tissue specificity of the expression of individual genes. Methods To assemble the genome of Y. cagnagella, Oxford Nanopore reads were sequenced on the Nanopore PromethION platform. In addition, an Illumina library with a 700 bp insert size was prepared and sequenced on the Illumina HiSeq 2500 with 250 bp paired-end reads. The raw reads were deposited in NCBI under SRA accession numbers SRR15714088 and SRR15714089. For genome annotation, RNA from heads, thoraces, and gonads of three male and female imagoes was extracted with TRI-Reagent. Biological replicas were pooled prior to isolation, resulting in three tissue-specific samples per sex. First, adaptor sequences and low quality bases were filtered out of the Illumina data using Trimmomatic. Similarly, Nanopore reads shorter than 500 bp and with a quality score lower than 7 were removed from the dataset with NanoFilt. Next, the FM-index Long Read Corrector (FMLRC) was used with default settings to correct the long reads using the filtered Illumina sequences. As recommended, ropebwt2 and fmlrc-convert were used to construct the multi-string BWT data structure required by the FMLRC pipeline. The preprocessed long reads were then assembled with Flye. To eliminate the haplotypic duplications from the primary assembly, purge_dups pipeline was applied, followed by polishing using POLCA. Repeat composition and average GC content were analysed with RepeatModeler and RepeatMasker software packages. To achieve more accurate masking, major satellites were identified with TAREAN from Illumina data subsampled to 0.25× genome coverage. A custom repeat library built from the genome sequence with RepeatModeler with added satellite dimers was used in RepeatMasker pipeline to survey the landscape of repetitive elements and generate a masked version of the Y. cagnagella assembly. To annotate the assembly, all RNA-seq data were concatenated into a single dataset, including the silk gland RNA-seq (see below) and the quality of the sequencing was verified using FastQC. The resulting data were aligned to the masked genome assembly using STAR. The genome index was generated with the following parameter scaled down to the size of Y. cagnagella genome: “--genomeSAindexNbases 13”. Genes were predicted with BRAKER and annotated using BLASTp with NCBI RefSeq invertebrate protein database, all implemented in the GenSAS platform. For analysis of silk genes, total RNA from the last larval instar silk glands was isolated using TRIzol reagent, followed by isolation of mRNA using Dynabeads Oligo (dT)25 mRNA Purification Kit, and cDNA was prepared using the NEXTflex Rapid RNA-Seq Kit. The cDNA library was sequenced on Illumina platform 2×150 bp (paired-end reads) with MiSeq. 150 bp paired-end Illumina reads were visually inspected for quality using FastQC. Adaptor sequences removal and trimming were performed using BBDUK. A further rRNA contamination step was conducted using BBDUK with the associated ribokmers.fa file to eliminate rRNA contamination from the mRNA enrichment step of library preparation. Cleaned reads were assembled into a transcriptome using the multi k-mer rnaSPAdes assembler. K-mer sizes of 25, 35, 45, 55, 65, and 75 were chosen for de novo assembly to increase the likelihood of maximum transcript recovery.
Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we present a reference transcript dataset (LsRTDv1) for lettuce, combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 is a valuable resource for the investigation of transcriptional and alternative splicing regulation in lettuce., We generated a lettuce Reference Transcript Dataset (LsRTDv1) by integrating transcript assemblies from short- and long-read RNA sequencing data with existing lettuce genome annotations. RNA sequencing data was generated from 23 different lettuce samples capturing different tissues, ages of plant and treatments. The 23 samples, all from Lactuca sativa cv. Saladin (synonymous with cv. Salinas) were combined equally into 7 samples prior to sequencing. Short-read assembly The RNA-seq reads of the seven pooled samples were pre-processed with Fastp (Chen et al., 2018) to remove adapters and filter low-quality reads (quality score <20, length <30). Trimmed reads were mapped to the latest lettuce reference genome assembly in NCBI (Lsat_Salinas_v11) using STAR aligner in the 2-pass mode to increase the mapping sensitivity at splice junctions (SJs)(Dobin and Gingeras, 2015). Mismatch was set to 1 with minimum and maximum intron sizes of 60 and 15,000 bp respectively. Two transcript assembl..., , # LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce
The genome assembly of cultivated lettuce was published in 2017 (Reyes-Chin-Wo et al., 2017) with an updated genome version (version 11) available on NCBI (). Here, we introduce the first lettuce reference transcript dataset (LsRTDv1) integrating long-read Iso-seq and short-read RNA-seq of diverse tissue and treatment samples from lettuce with the GenBank and RefSeq transcript annotations, using stringent quality measures. The final LsRTDv1 includes 179,404 non-redundant transcripts encoded by 65,724 genes, greatly expanding the existing lettuce transcriptome and increasing the number of transcripts per gene from 1.4 to 2.7. LsRTDv1 identifies 3696 novel gene models, predominantly long non-coding RNAs, absent in both GenBank and RefSeq annotations.
We provide two files for the LsRTDv1: