93 datasets found
  1. n

    Data from: LsRTDv1: A reference transcript dataset for accurate...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katherine Denby; Mehmet Fatih Kara; Wenbin Guo; Runxuan Zhang (2024). LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce [Dataset]. http://doi.org/10.5061/dryad.xwdbrv1m8
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    University of York
    James Hutton Institute
    Authors
    Katherine Denby; Mehmet Fatih Kara; Wenbin Guo; Runxuan Zhang
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we present a reference transcript dataset (LsRTDv1) for lettuce, combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 is a valuable resource for the investigation of transcriptional and alternative splicing regulation in lettuce. Methods We generated a lettuce Reference Transcript Dataset (LsRTDv1) by integrating transcript assemblies from short- and long-read RNA sequencing data with existing lettuce genome annotations. RNA sequencing data was generated from 23 different lettuce samples capturing different tissues, ages of plant and treatments. The 23 samples, all from Lactuca sativa cv. Saladin (synonymous with cv. Salinas) were combined equally into 7 samples prior to sequencing. Short-read assembly The RNA-seq reads of the seven pooled samples were pre-processed with Fastp (Chen et al., 2018) to remove adapters and filter low-quality reads (quality score <20, length <30). Trimmed reads were mapped to the latest lettuce reference genome assembly in NCBI (Lsat_Salinas_v11) using STAR aligner in the 2-pass mode to increase the mapping sensitivity at splice junctions (SJs)(Dobin and Gingeras, 2015). Mismatch was set to 1 with minimum and maximum intron sizes of 60 and 15,000 bp respectively. Two transcript assemblers, StringTie (Pertea et al., 2015) and Scallop (Shao and Kingsford, 2017), were used to assemble transcripts for each sample. The assemblies were then merged and refined using RTDmaker (https://github.com/anonconda/RTDmaker) to remove low-quality transcripts, including redundant transcripts with identical intron combinations to longer transcripts, fragmented transcripts with length <70% of gene length, transcripts with non-canonical SJs, transcripts with SJs only supported by <5 spliced reads in <2 samples and low expressed transcripts with <1 transcript per million reads (TPM) in <2 samples. Long-read assembly We employed the IsoSeq pipeline (https://github.com/PacificBiosciences/IsoSeq) to pre-process the Iso-seq data from the seven samples. The CCS method was used to generate circular consensus sequences (CCS) from raw subreads and reads with minimum predicted accuracy <90% were discarded (--min-rq=0.9). Barcodes associated with the CCS reads were eliminated using the lima method. To further refine the reads, Isoseq3 was applied to trim poly(A) tails and identify and remove concatemers. The output of full-length, non-concatemer (FLNC) reads was mapped to the reference genome using Minimap2 (Li, 2018). TAMA-collapse was used to collapse redundant transcript models in each sample with variation at the 5’ and 3’ ends and at SJs not allowed (-a = 0, -m = 0 and -z = 0) to ensure high accuracy of boundaries. Reads with errors within the 10 bp up- or down-stream of a SJ were removed. TAMA-merge was used to merge transcript models from the seven samples (Kuo et al., 2020). To improve the quality of the assembly, we implemented well-established methods for SJ and transcript start site (TSS) and end site (TES) analyses previously used for Arabidopsis AtRTD3 and barley BaRTv2 (Zhang et al., 2022b; Coulter et al., 2022). We removed low-quality transcripts that exhibited non-canonical SJs and low quality SJs unless they were also present in the short-read assembly. We applied a binomial test to distinguish high-confidence TSS and TES with a false discovery rate (FDR) <0.05. For genes with limited read support, statistical testing becomes challenging, hence we also kept TSS/TES if they were supported by at least 2 Iso-seq reads. Redundancy merge was applied to transcripts if they only differed ±50 nucleotides at their TSS/TES. In addition, transcripts only supported by a single Iso-seq read were removed from the final dataset. Integration of multiple annotations We integrated four transcript annotations: the long-read assembly, short-read assembly and two versions of Lsat_Salinas_v11 genome annotations GenBank (GCA_002870075.4) and RefSeq (GCF_002870075.4). The Iso-seq long-read assembly served as the reliable backbone, while the other three annotations were incorporated in a step-wise manner to improve the RTD completeness. Firstly, the transcripts in the short-read assembly that introduce novel SJs and/or novel gene loci were integrated into the long-read assembly. Subsequently, we added transcripts from GenBank and RefSeq annotations that contributed novel SJs or gene loci to build the lettuce RTD (LsRTDv1). In cases where two transcripts from GenBank and RefSeq had identical SJ combinations or were mono-exonic transcripts with overlapping regions exceeding 30% of both transcripts, we collapsed them to a single transcript, and the longest TSS and TES were used as the start and end point of the collapsed transcript. In LsRTDv1, the overlapped transcripts were assigned the same gene ID. However, if a set of overlapped transcripts entirely resided within the intron region of other transcripts, they were treated as intronic transcripts and assigned with a different gene ID. Where the overlapped transcripts can be divided into multiple groups and the adjacent groups overlapped less than 5% of the group lengths, they were assigned separate gene IDs.

  2. d

    Data from: Adaptively integrated sequencing and assembly of near-complete...

    • search.dataone.org
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hasindu Gamaarachchi; Igor Stevanovski; Jillian M. Hammond; Andre L.M. Reis; Melissa Rapadas; Kavindu Jayasooriya; Tonia Russell; Dennis Yeow; Yvonne Hort; Andrew J. Mallett; Elaine Stackpoole; Lauren Roman; Luke W. Silver; Carolyn J. Hogg; Lou Streeting; Ozren Bogdanovic; Renata Rodrigues; Luis Nascimento; Adauto Lima Cardoso; Arthur Georges; Haoyu Cheng; Hardip R. Patel; Kishore R. Kumar; Amali C. Mallawaarachchi; Ira W. Deveson (2025). Adaptively integrated sequencing and assembly of near-complete genomes [Dataset]. http://doi.org/10.5061/dryad.kkwh70sfr
    Explore at:
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Hasindu Gamaarachchi; Igor Stevanovski; Jillian M. Hammond; Andre L.M. Reis; Melissa Rapadas; Kavindu Jayasooriya; Tonia Russell; Dennis Yeow; Yvonne Hort; Andrew J. Mallett; Elaine Stackpoole; Lauren Roman; Luke W. Silver; Carolyn J. Hogg; Lou Streeting; Ozren Bogdanovic; Renata Rodrigues; Luis Nascimento; Adauto Lima Cardoso; Arthur Georges; Haoyu Cheng; Hardip R. Patel; Kishore R. Kumar; Amali C. Mallawaarachchi; Ira W. Deveson
    Description

    Recent advances in long-read sequencing (LRS) and assembly algorithms have made it possible to create highly complete genome assemblies for humans, animals, plants, and other eukaryotes. However, there is a need for ongoing development to improve accessibility and affordability of the required data, increase the range of usable sample types, and reliably resolve the most challenging, repetitive genome regions. 'Cornetto' is a new experimental paradigm in which the genome assembly process is adaptively integrated with programmable selective nanopore sequencing, with target regions being iteratively updated to focus LRS data production onto the unsolved regions of a nascent assembly. This improves assembly quality and streamlines the process, both for human individuals and diverse non-human vertebrates, including endemic Australian endangered species, tested here. Cornetto enables us to generate highly complete diploid human genome assemblies using only a single LRS platform, surpassing t..., , # Assemblies from the Cornetto Adaptive sampling method

    Dataset DOI: 10.5061/dryad.kkwh70sfr

    Description of the data and file structure

    These assemblies were generated using the cornetto adaptive sampling method described in our manuscript. The documentation and source code used for the process are available at https://github.com/hasindu2008/cornetto.

    Files and variables

    There are two files, namely cornetto-hg002-asm.tar.gz and cornetto-animal-asm.tar.gz that are described below.

    File: cornetto-hg002-asm.tar.gz

    Description:Â This tarball contains all the FASTA assemblies for the hg002 sample.The below files are found inside the cornetto-hg002-asm directory created when you download and extract the cornetto-hg002-asm.tar.gz

    | Assembly name | Assembly Description | Assembly File ...,

  3. d

    Data from polishCLR: Example input genome assemblies

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from polishCLR: Example input genome assemblies [Dataset]. https://catalog.data.gov/dataset/data-from-polishclr-example-input-genome-assemblies-93d8e
    Explore at:
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    [ NOTE - Data files added 2022-11-01: Test long reads - test.1.filtered.bam_.gz Test short reads R1 - testpolish_R1.fastq Test short reads R2 - testpolish_R2.fastq Chromosome 30 of H. zea - GCF_022581195.2_ilHelZeax1.1_chr30.fasta ] In order to produce the best possible de novo, chromosome-scale genome assembly from error prone Pacific BioSciences continuous long reads (CLR) reads, we developed a publicly available, flexible and reproducible workflow that is containerized so it can be run on any conventional HPC, called polishCLR. This dataset provides example input primary contig assemblies to test and reproduce the demonstrated utility of our workflow. The polishCLR workflow can be easily initiated from three input cases: Case 1: An unresolved primary assembly with associated contigs, the output of FALCON 2-asm: p_ctg.fasta and a_ctg.fasta Case 2: A haplotype-resolved but unpolished set, the output of FALCON-Unzip 3-unzip: all_p_ctg.fasta and all_h_ctg.fasta Case 3: A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs, the output of FALCON-Unzip 4-polish: cns_p_ctg.fasta and cns_h_ctg.fasta. These example data are the input contigs assemblies for the pest Helicoverpa zea. These contigs are built from 49.89 Gb of raw Pacific Biosciences (PacBio) CLR data generated from a single H. zea HzStark_Cry1AcR strain male. Adult H. zea were collected near the USDA-ARS Genetics and Sustainability Agricultural Research Unit, Starkville, MS, USA in 2011, and transported to and maintained in a colony at the USDA Southern Insect Management Unit (SIMRU), Stoneville, MS, USA as described previously. Larvae were selected on a diagnostic dose of 2.0 μg ml-1 purified Cry1Ac, and survivors used to create the strain, HzStark_Cry1AcR. HzStark_Cry1AcR was back-crossed every 5 generations to a susceptible line maintained at USDA-ARS SIMRU. A single male pupa (homogametic, ZZ sex chromosome) from HzStark_Cry1AcR was dissected laterally into eight ~20 μg sections. High molecular weight DNA was extracted. PacBio libraries were generated from unsheared DNA using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), and 20 hour run time movies generated on a single SMRT Cell 1M v3 using the Sequel I system (Pacific Biosciences). The raw continuous long read (CLR) subread bam files were converted to fastq format using bamtools v. 2.5.1 (Barnett et al. 2011), then used as input for the Falcon assembler (Chin et al. 2016) using the pb-assembly conda environment v. 0.0.8.1 (Pacific Biosciences; default parameters). Falcon-Unzip created primary and alternate contigs with one round of haplotype-aware polishing by Arrow (Pacific Biosciences). Resources in this dataset: Resource Title: Associated assembly contigs output from FALCON/2-asm-falcon. File Name: a_ctg_all.fasta Resource Title: Primary assembly contigs output from FALCON/2-asm-falcon. File Name: p_ctg.fasta Resource Title: Alternate haplotype assembly contigs output from FALCON Unzip 3-unzip. File Name: all_h_ctg.fasta Resource Title: Primary assembly contigs output from FALCON Unzip 3-unzip. File Name: all_p_ctg.fasta Resource Title: Alternate assembly contigs output from FALCON Unzip 4-polish. File Name: cns_h_ctg.fasta Resource Title: Primary assembly contigs output from FALCON Unzip 4-polish. File Name: cns_pctg.fasta Resource Title: Test long reads. File Name: test.1.filtered.bam.gzResource Description: For testing the pipeline, long reads that map to H. zea chromosome 30 Resource Title: Test short reads R1. File Name: testpolish_R1.fastqResource Description: Short reads aligned to Chromosome 30 of H. zea Resource Title: Test short reads R2. File Name: testpolish_R2.fastqResource Description: Reverse pair (R2) short reads aligned to Chromosome 30 of H. zea Resource Title: Chromosome 30 of H. zea. File Name: GCF_022581195.2_ilHelZeax1.1_chr30.fasta

  4. Next-generation sequencing read statistics and sequencing coverage for the...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juhana I. Kammonen; Olli-Pekka Smolander; Lars Paulin; Pedro A. B. Pereira; Pia Laine; Patrik Koskinen; Jukka Jernvall; Petri Auvinen (2023). Next-generation sequencing read statistics and sequencing coverage for the sample datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0216885.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Juhana I. Kammonen; Olli-Pekka Smolander; Lars Paulin; Pedro A. B. Pereira; Pia Laine; Patrik Koskinen; Jukka Jernvall; Petri Auvinen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Next-generation sequencing read statistics and sequencing coverage for the sample datasets.

  5. f

    Data_Sheet_1_Completing Circular Bacterial Genomes With Assembly Complexity...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    • +1more
    Updated Sep 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kuo, Shu-Chen; Wu, Han-Chieh; Chen, Feng-Jui; Cheng, Hung-Wei; Lauderdale, Tsai-Ling Yang; Liao, Yu-Chieh (2019). Data_Sheet_1_Completing Circular Bacterial Genomes With Assembly Complexity by Using a Sampling Strategy From a Single MinION Run With Barcoding.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000092025
    Explore at:
    Dataset updated
    Sep 4, 2019
    Authors
    Kuo, Shu-Chen; Wu, Han-Chieh; Chen, Feng-Jui; Cheng, Hung-Wei; Lauderdale, Tsai-Ling Yang; Liao, Yu-Chieh
    Description

    The Oxford Nanopore MinION is an affordable and portable DNA sequencer that can produce very long reads (tens of kilobase pairs), which enable de novo bacterial genome assembly. Although many algorithms and tools have been developed for base calling, read mapping, de novo assembly, and polishing, an automated pipeline is not available for one-stop analysis for circular bacterial genome reconstruction. In this paper, we present the pipeline CCBGpipe for completing circular bacterial genomes. Raw current signals are demultiplexed and base called to generate sequencing data. Sequencing reads are de novo assembled several times by using a sampling strategy to produce circular contigs that have a sequence in common between their start and end. The circular contigs are polished by using raw signals and sequencing reads; then, duplicated sequences are removed to form a linear representation of circular sequences. The circularized contigs are finally rearranged to start at the start position of dnaA/repA or a replication origin based on the GC skew. CCBGpipe implemented in Python is available at https://github.com/jade-nhri/CCBGpipe. Using sequencing data produced from a single MinION run, we obtained 48 circular sequences, comprising 12 chromosomes and 36 plasmids of 12 bacteria, including Acinetobacter nosocomialis, Acinetobacter pittii, and Staphylococcus aureus. With adequate quantities of sequencing reads (80×), CCBGpipe can provide a complete and automated assembly of circular bacterial genomes.

  6. u

    Data from: Diaphorina citri genome assembly Diaci 1.1

    • agdatacommons.nal.usda.gov
    • search.datacite.org
    application/gzip
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nan Leng; Adam English; Shannon Johnson; Stephen Richards; Wayne B. Hunter; Surya Saha (2024). Diaphorina citri genome assembly Diaci 1.1 [Dataset]. http://doi.org/10.15482/USDA.ADC/1342728
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    Ag Data Commons
    Authors
    Nan Leng; Adam English; Shannon Johnson; Stephen Richards; Wayne B. Hunter; Surya Saha
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The International Psyllid Genome Consortium has generated a genome assembly Diaci 1.1 of the Asian citrus psyllid (Diaphorina citri). This assembly is also available and accessioned at the National Center for Biotechnology Information: https://www.ncbi.nlm.nih.gov/assembly/GCF_000475195.1/. DNA extraction and library preparation High-molecular weight DNA was extracted using the BioRad AquaPure Genomic DNA isolation kit from fresh intact D. citri collected from a citrus grove in Ft. Pierce, FL and reared at the USDA, ARS, U.S. Horticultural Research Laboratory, Ft. Pierce, FL. To generate PacBio libraries, DNA was sheared using a Covaris g-Tube and SMRT-bell library was prepared using the 10Kb protocol (PacBio DNA template prep kit 2.0; 3-10Kb), cat #001-540-835. Genome sequencing and assembly Samples were prepared for sequencing using the TruSeq DNA library preparation kits for paired-end as well as long-insert mate-pair libraries. All were sequenced on the Illumina HiSeq2000 using 100bp or longer reads. Seven libraries were sequenced, with inserts ranging from "short" (ca. 275bp) to 10Kb. These are available in NCBI SRA and included 99.7 million paired-end reads (NCBI SRA:SRX057205), 35.1 million 2kb mate-pair reads (NCBI SRA: SRX057204), 30 million 5kb mate-pair reads (NCBI SRA: SRX058250) and 30 million 10kb mate-pair reads (NCBI SRA: SRX216330). A second round of DNA sequencing was done with PacBio at 12X coverage (NCBI SRA: SRX218985) for scaffolding the Diaci1.0 Illumina assembly to create the Diaci1.1 version of the D. citri genome. Thirty-nine SMRTcells of the library were sequenced, all with 2×45 minute movies. A total of 2,750,690 post-filter reads were generated, with an average of 70,530 reads per SMRTcell. The post-filter mean read length was 2,504 bp with an error rate of 15%. Velvet was used with kmer 59 for generating the original assembly. PacBio long reads were mapped to the draft assembly using blasr with the folowing parameters: -minMatch 8 -minPctIdentity 70 -bestn 5 -nCandidates 30 -maxScore -500 -nproc 8 -noSplitSubreads. These alignments were parsed using PBJelly with default parameters. Resources in this dataset:Resource Title: Diaphorina citri genome assembly Diaci 1.1. File Name: 121845_ref_Diaci_psyllid_genome_assembly_version_1.1_chrUn.mfa_.gzResource Description: This resource contains a fasta file of the Diaphorina citri genome assembly Diaci 1.1.

  7. n

    Data from: A high-quality genome assembly and annotation of the gray...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Feb 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillermo Friis; Joel Vizueta; David R. Nelson; Basel Khraiwesh; Enas Qudeimat; Kourosh Salehi-Ashtiani; Alejandra Ortega; Alyssa Marshell; Carlos M. Duarte; Edward Smith (2022). A high-quality genome assembly and annotation of the gray mangrove, Avicennia marina [Dataset]. http://doi.org/10.5061/dryad.3j9kd51f5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 17, 2022
    Dataset provided by
    Sultan Qaboos University
    King Abdullah University of Science and Technology
    New York University Abu Dhabi
    Universitat de Barcelona
    Authors
    Guillermo Friis; Joel Vizueta; David R. Nelson; Basel Khraiwesh; Enas Qudeimat; Kourosh Salehi-Ashtiani; Alejandra Ortega; Alyssa Marshell; Carlos M. Duarte; Edward Smith
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The gray mangrove [Avicennia marina (Forsk.) Vierh.] is the most widely distributed mangrove species, ranging throughout the Indo-West Pacific. It presents remarkable levels of geographic variation both in phenotypic traits and habitat, often occupying extreme environments at the edges of its distribution. However, subspecific evolutionary relationships and adaptive mechanisms remain understudied, especially across populations of the West Indian Ocean. High-quality genomic resources accounting for such variability are also sparse. Here we report the first chromosome-level assembly of the genome of A. marina. We used a previously release draft assembly and proximity ligation libraries Chicago and Dovetail HiC for scaffolding, producing a 456,526,188 bp long genome. The largest 32 scaffolds (22.4 Mb to 10.5 Mb) accounted for 98 % of the genome assembly, with the remaining 2% distributed among much shorter 3,759 scaffolds (62.4 Kb to 1 Kb). We annotated 45,032 protein-coding genes using tissue-specific RNA-seq data in combination with de novo gene prediction, from which 34,442 were associated to GO terms. Genome assembly and annotated set of genes yield a 96.7% and 95.1% completeness score, respectively, when compared with the eudicots BUSCO dataset. Furthermore, an FST survey based on resequencing data successfully identified a set of candidate genes potentially involved in local adaptation, and revealed patterns of adaptive variability correlating with a temperature gradient in Arabian mangrove populations. Our A. marina genomic assembly provides a highly valuable resource for genome evolution analysis, as well as for identifying functional genes involved in adaptive processes and speciation.

    Methods Genome sequencing and assembly The sequenced sample was leaf tissue obtained from an individual located at Ras Ghurab Island in the Arabian Gulf (Abu Dhabi, United Arab Emirates; 24.601°N, 4.566 °E), corresponding to the A. m. marina variety. A high-quality genome was produced using proximity ligation libraries and the software pipeline HiRise at Dovetail Genomics, LLC. Briefly, for Chicago and the Dovetail HiC library preparation, chromatin was fixed with formaldehyde. Fixed chromatin was then digested with DpnII and free blunt ends were ligated. Crosslinks were reversed, and the DNA purified from protein, which was then sheared to ~350 bp mean fragment size. Libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters, and sequencing was carried out on an Illumina HiSeq X platform. Chicago and Dovetail HiC library reads were then used as input data for genome assembly for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies. A previously reported draft genome of Avicennia marina (GenBank accesion: GCA_900003535.1) was used in the assembly pipeline, excluding scaffolds shorter than 1Kb since HiRise does not assemble them.

    The mitochondrial genome was assembled using NOVOplasty2.7.2 and resequencing data based on Illumina paired-end 150 bp libraries from a conspecific individual (See below; Supplementary Information). The maturase (matR) mitochondrial gene available in NCBI (GenBank accession no. AY289666.1) was used for the input seed sequence.

    Genome annotation We performed the annotation of the A. marina genome using mRNA data from a set of tissues of conspecific individuals in combination with de novo gene prediction using BRAKER2 v2.1.5 (Hoff et al. 2016). Samples were collected on the coast of the Eastern Central Red Sea north of Jeddah in the Kingdom of Saudi Arabia (22.324 °N, 39.100 °E; Figure 1A). Total RNA was isolated from root, stem, leaf, flower, and seed using TRIzol reagent (Invitrogen, USA). RNA-seq libraries were prepared using TruSeq RNA sample prep kit (Illumina, Inc.), with inserts that range in size from approximately 100-400 bp. Library quality control and quantification were performed with a Bioanalyzer Chip DNA 1000 series II (Agilent), and sequenced in a HiSeq2000 platform (Illumina, Inc.). First, repetitive regions were modelled ab initio using RepeatModeler v2.0.1 (Flynn et al. 2019) in all scaffolds longer than 100 Kb with default options. The resulting repeat library was used to annotate and soft-mask repeats in the genome assembly with RepeatMasker 4.0.9 (Smit et al. 2015). Next, messenger RNA reads were mapped against the soft-masked genome assembly with HISAT2 (Kim et al. 2015). Gene prediction was conducted with BRAKER2 using both the RNA-seq data and the conserved orthologous genes from BUSCO Eudicots_odb10 as proteins from short evolutionary distance to provide hints and train GeneMark-ETP and Augustus (--etpmode; Hoff et al. 2019; Bruna et al. 2020; Lomsadze et al. 2005; Buchfink et al. 2015; Gotoh 2008; Iwata and Gotoh 2012; Li et al. 2009; Barnett et al. 2011; Lomsadze et al. 2014; Stanke et al. 2008; Stanke et al. 2006). The obtained gene annotation gff3 file was validated and used to generate the reported gene annotation statistics with GenomeTools (Gremme et al. 2013) and in-house Perl scripts. Finally, we conducted a similarity-based approach to assist the functional annotation of the predicted proteins. We integrated InterProScan v5.31 (Jones et al. 2014) and BLAST (Tatusova and Madden 1999) searches using the UniProt Swiss-Prot database and the annotated proteins from the Arabidopsis thaliana genome (UniProt Consortium 2019) to generate a final set of annotated functional genes.

    Variant calling from resequencing data Whole genome resequencing was carried out for the 60 individuals from 6 populations around the Arabian Peninsula at Novogene facilities. Illumina paired-end 150 bp libraries with insert size equal to 350 pb were prepared and sequenced in a Novaseq platform. A total of 2.4G reads were produced resulting in a mean coverage per site and sample of 85X before filtering. Read quality was evaluated using FASTQC after sorting reads by individual with AXE. Trimming and quality filtering treatment was conducted using Trim Galore, resulting in a set of reads ranging between 90 and 138 bp long. Reads were then mapped against the A. marina reference genome using the mem algorithm in the Burrows-Wheeler Aligner. Read groups were assigned and BAM files generated with Picard Tools version 1.126. We used the HaplotypeCaller + GenotypeGVCFs tools from the Genome Analysis Toolkit version 3.6-0 to produce a set of single nucleotide polymorphisms (SNPs) in the variant call format (vcf). Genotype quality and missing data filters for downstream analyses were implemented with vcftools. Samples with less than 25% of the sites genotyped were discarded. Then, a SNP matrix was constructed excluding those out of a range of coverage between 4 and 50 or with a genotyping phred quality score below 40. Positions for which one or more samples were not genotyped were removed, along with those presenting a minor allele count (MAC) below 3. Only the SNPs from the 32 major scaffolds were retained. A threshold for SNPs showing highly significant deviations from Hardy-Weinberg equilibrium (HWE) with a p-value of 10-4 was also implemented to filter out false variants arisen by the alignment of paralogous loci. Final dataset consisted on 56 samples and 538,185 SNPs.

  8. u

    Data from: Assignment of virus and antimicrobial resistance genes to...

    • agdatacommons.nal.usda.gov
    • datasetcatalog.nlm.nih.gov
    bin
    Updated Feb 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Derek M. Bickhart; Mick Watson; Sergey Koren; Kevin Panke-Buisse; Laura M. Cersosimo; Maximilian O. Press; Curtis P. Van Tassell; Jo Ann S. Van Kessel; Bradd J. Haley; Seon Woo Kim; Cheryl Heiner; Garret Suen; Kiranmayee Bakshy; Ivan Liachko; Shawn T. Sullivan; Phillip R. Myer; Jay Ghurye; Mihai Pop; Paul J. Weimer; Adam M. Phillippy; Timothy P. L. Smith (2024). Data from: Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Data_from_Assignment_of_virus_and_antimicrobial_resistance_genes_to_microbial_hosts_in_a_complex_microbial_community_by_combined_long-read_assembly_and_proximity_ligation/24853515
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 13, 2024
    Dataset provided by
    Genome Biology
    Authors
    Derek M. Bickhart; Mick Watson; Sergey Koren; Kevin Panke-Buisse; Laura M. Cersosimo; Maximilian O. Press; Curtis P. Van Tassell; Jo Ann S. Van Kessel; Bradd J. Haley; Seon Woo Kim; Cheryl Heiner; Garret Suen; Kiranmayee Bakshy; Ivan Liachko; Shawn T. Sullivan; Phillip R. Myer; Jay Ghurye; Mihai Pop; Paul J. Weimer; Adam M. Phillippy; Timothy P. L. Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We describe a method that adds long-read sequencing to a mix of technologies used to assemble a highly complex cattle rumen microbial community, and provide a comparison to short read-based methods. Long-read alignments and Hi-C linkage between contigs support the identification of 188 novel virus-host associations and the determination of phage life cycle states in the rumen microbial community. The long-read assembly also identifies 94 antimicrobial resistance genes, compared to only seven alleles in the short-read assembly. We demonstrate novel techniques that work synergistically to improve characterization of biological features in a highly complex rumen microbial community. We demonstrate the benefits of using multiple sequencing technologies and proximity ligation in identifying unique biological facets of the cattle rumen metagenome, and we present data that suggests that each has a unique niche in downstream analysis. Our comparison identified biases in the sampling of different portions of the community by each sequencing technology, suggesting that a single DNA sequencing technology is insufficient to characterize complex metagenomic samples. Using a combination of long-read alignments and proximity ligation, we identified putative hosts for assembled bacteriophage at a resolution previously unreported in other rumen surveys. These host-phage assignments support previous work that revealed increased viral predation of sulfur-metabolizing bacterial species; however, we were able to provide a higher resolution of this association, identify potential auxiliary metabolic genes related to sulfur metabolism, and identify phage that may target a diverse range of different bacterial species. Furthermore, we found evidence to support that these viruses have a lytic life cycle due to a higher proportion of Hi-C intercontig link association data in our analysis. Finally, it appears that there may be a high degree of mobile DNA that was heretofore uncharacterized in the rumen and that this mobile DNA may be shuttling antimicrobial resistance gene alleles among distantly related species. These unique characteristics of the rumen microbial community would be difficult to detect without the use of several different methods and techniques that we have refined in this study, and we recommend that future surveys incorporate these techniques to further characterize complex metagenomic communities. Datasets generated and/or analyzed during the current study are available in the NCBI SRA repository under Bioproject: PRJNA507739. Assemblies, bins, and ORF predictions are available on Figshare. A description of commands, scripts, and other materials used to analyze the data in this project are available in the GitHub repository: https://github.com/njdbickhart/RumenLongReadASM and also on Zenodo. Resources in this dataset:Resource Title: Availability of data and materials. File Name: Web Page, url: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1760-x#availability-of-data-and-materials The datasets generated and/or analyzed during the current study are available in the NCBI SRA repository under Bioproject: PRJNA507739. The assemblies, bins, and ORF predictions are available on Figshare. A description of commands, scripts, and other materials used to analyze the data in this project can be found in the following GitHub repository: https://github.com/njdbickhart/RumenLongReadASM and also on Zenodo.

  9. Sample datasets for E. coli C-1 genome assembly tutorial

    • zenodo.org
    application/gzip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Han Mai; Mallory Freeberg; James Taylor; Anton Nekrutenko; Han Mai; Mallory Freeberg; James Taylor; Anton Nekrutenko (2020). Sample datasets for E. coli C-1 genome assembly tutorial [Dataset]. http://doi.org/10.5281/zenodo.931765
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Han Mai; Mallory Freeberg; James Taylor; Anton Nekrutenko; Han Mai; Mallory Freeberg; James Taylor; Anton Nekrutenko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains three files:

    1. Illumina_f.fq.gz - forward reads from MiSeq run
    2. Illumina_r_fq.gz - reverse reads from MiSeq run
    3. minion_2d.fq.gz - Oxford Nanopore 2d reads

    Sequencing was done on genomic DNA of E. coli strain C-1 obtained from Yale Stock Center.

  10. u

    Data from: A High-Quality Genome Assembly from a Single, Field-collected...

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    zip
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Kingan; Julie Urban; Christine Lambert; Primo Baybayan; Anna Childers; Brad Coates; Brian Scheffler; Kevin Hackett; Jonas Korlach; Scott M. Geib (2025). Data from: A High-Quality Genome Assembly from a Single, Field-collected Spotted Lanternfly (Lycorma delicatula) using the PacBio Sequel II System [Dataset]. http://doi.org/10.15482/USDA.ADC/1503745
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Sarah Kingan; Julie Urban; Christine Lambert; Primo Baybayan; Anna Childers; Brad Coates; Brian Scheffler; Kevin Hackett; Jonas Korlach; Scott M. Geib
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    A high-quality reference genome is an essential tool for applied and basic research on arthropods. Long-read sequencing technologies may be used to generate more complete and contiguous genome assemblies than alternate technologies, however, long-read methods have historically had greater input DNA requirements and higher costs than next generation sequencing, which are barriers to their use on many samples. Here, we present a 2.3 Gb de novo genome assembly of a field-collected adult female Spotted Lanternfly (Lycorma delicatula) using a single PacBio SMRT Cell. The Spotted Lanternfly is an invasive species recently discovered in the northeastern United States, threatening to damage economically important crop plants in the region. The DNA from one individual female specimen collected in Reading, Berks County, Pennsylvania was used to make one standard, size-selected library with an average DNA fragment size of ~20 kb. The library was run on one Sequel II SMRT Cell 8M, generating a total of 132 Gb of long-read sequences, of which 82 Gb were from unique library molecules, representing approximately 38x coverage of the genome. The assembly had high contiguity (contig N50 length = 1.5 Mb), completeness, and sequence level accuracy as estimated by conserved gene set analysis (96.8% of conserved genes both complete and without frame shift errors). Further, it was possible to segregate more than half of the diploid genome into the two separate haplotypes. The assembly also recovered two microbial symbiont genomes known to be associated with L. delicatula, each microbial genome being assembled into a single contig. We demonstrate that field-collected arthropods can be used for the rapid generation of high-quality genome assemblies, an attractive approach for projects on emerging invasive species, disease vectors, or conservation efforts of endangered species. Supporting files for the manuscript "A High-Quality Genome Assembly from a Single, Field-collected Spotted Lanternfly (Lycorma delicatula) using the PacBio Sequel II System", include several intermediate versions of the assembly (raw output from Falcon, raw output from Falcon unzip, etc.) as well as the final assembly primary contigs and haplotigs (for the regions of the genome that were phased). Resources in this dataset:Resource Title: Final Assembly file . File Name: FinalAssembly.zipResource Description: Primary and haplotigs contigs in fasta format. File slf.8M.final.primary.fasta are the primary contigs, and slf.8M.final.haplotigs.fasta are the haplotigsResource Title: Falcon Raw assembly, polished with arrow. File Name: FalconAssembly.zipResource Description: Raw Primary contig assembly prior to falcon unzip. Contigs were polished with all subreads with arrow polishing tool.Resource Title: Fasta file of contig assemblies of the two symbiont genomes. File Name: Symbiont.zipResource Description: Contains contig fasta files for Sulcia (Sulciamuelleri.fa) and Vidania (vidania.fa) symbiont genomes recovered from the de novo assemblyResource Title: Haplotig placement file in PAF format. File Name: slf.haplotigPlacement.paf.zipResource Description: Final assembly placement file , describing the placement of haplotigs on the primary contig assemblyResource Title: Falcon Unzip assembly Polished with arrow . File Name: FalconUnzipAssembly.zipResource Description: Falcon unzip assembly both the primary and haplotigs, unfiltered

  11. Novel Megaptera novaeangliae (Humpback whale) haplotype reference genome

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria-Vittoria Carminati; Lonnie Vlonjat; Ruiqi Li; Daniel Klee; Sara Padula; Ajay Patel; Andy Tan; Jacqueline Mattos; Nolan Kane (2025). Novel Megaptera novaeangliae (Humpback whale) haplotype reference genome [Dataset]. http://doi.org/10.5061/dryad.dv41ns271
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    University of Colorado Boulder
    Authors
    Maria-Vittoria Carminati; Lonnie Vlonjat; Ruiqi Li; Daniel Klee; Sara Padula; Ajay Patel; Andy Tan; Jacqueline Mattos; Nolan Kane
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The sequencing of a kidney sample (KW2013002) from a stranded Megaptera novaeangliae (Humpback whale) calf is the first chromosome level reference genome for this species. The calf, a 457 cm and 2,500 lbs male, was found stranded in Hawai’i Kai, HI, in 2013 and was marked as abandoned/orphaned. In 2023, 1g of kidney was sequenced with PacBio long-read DNA sequencing, chromatin conformation capture (Hi-C), RNA sequencing, and mitochondrial sequencing to comprehensively characterize the genome and transcriptome of M. novaeangliae. The reference genome was compared to the preexisting M. novaeangliae scaffold to determine assembly improvements. Data validation includes a synteny analysis, mitochondrial annotation, and a comparison of BUSCO scores (scaffold v. reference genome and Balaenoptera musculus (Blue whale) v. M. novaeangliae). BUSCO analysis was performed on an M. novaeangliae scaffold-level assembly to determine genomic completeness of the reference genome, with a scaffold BUSCO score of 91.2% versus a score of 95.4% (Table I). Synteny analysis was performed using the B. musculus genome as comparison to determine chromosome level coverage and structure. Further, a time-based phylogenetic tree was constructed using the sequenced data and publicly available genomes. This dataset also contains the results of de novo repeat identification and gene annotation for the Humpback whale (Megaptera novaeangliae) genome. The repeat families were identified and classified using RepeatModeler, and gene prediction was conducted using AUGUSTUS and SNAP, incorporating coding sequences from related cetaceans. The resulting gene models were further refined using the MAKER pipeline, with protein evidence from Swiss-Prot and related species. tRNA genes were identified with tRNAscan-SE. The dataset includes the transcript sequences (GIU3625_Humpback_whale.transcript.fasta.gz), annotation file (GIU3625_Humpback_whale.annotation.gff.gz), and a methods file (methods.txt) detailing the bioinformatic processes. Methods Sample Information A kidney sample (KW2013002) was collected from a M. novaeangliae calf on January 15, 2013, in Hawai’i Kai, HI, and deposited at the National Institutes of Standards and Technology (NIST). The sample was not collected by the authors so information regarding collection is limited to that presented herein. The calf, a 457 cm and 2,500 lbs male at the time of necropsy, was first observed on January 14, 2013, in shallow water and died between January 14 and January 15, 2013, via stranding. The calf was marked as abandoned/orphaned. In 2023, 1g of KW2013002 was sampled for sequencing by Cantata Bio. PacBio long reads DNA sequencing Quantification of DNA samples was performed using the Qubit 2.0 Fluorometer. For the construction of the PacBio SMRTbell library, targeting an insert size of approximately 20kb, the SMRTbell Express Template Prep Kit 2.0 was employed following the manufacturer's recommended protocol and default settings. The library was subsequently prepared for sequencing by binding to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto the PacBio Sequel II system. Sequencing was executed using PacBio Sequel II 8M SMRT cells to ensure comprehensive coverage and high-quality reads. Quality control of the extracted DNA was performed using nanodrop and gel. The OmniC library quality control was done using the Hifiasm draft assembly and showed a high amount of long-range linkage reads. The OmniC sequencing data was also quality controlled to examine Q30%, and the quality score matched the Illumina standard. The scaffolding algorithm HiRise also has a built-in quality control that uses only reads with a map score of over 40. Chromatin was fixed in situ within the nucleus using formaldehyde, followed by digestion with DNase I. The processed chromatin had its ends repaired and was then ligated to a biotinylated bridge adapter, facilitating proximity ligation of adapter-containing ends. Post-proximity ligation, the crosslinks were reversed, and the DNA was purified—a critical step involved treating the purified DNA to eliminate any non-internal biotin. The sequencing libraries were prepared using NEBNext Ultra enzymes and Illumina-compatible adapters, with biotin-containing fragments isolated using streptavidin beads before PCR enrichment. Sequencing was performed on an Illumina HiSeqX platform to achieve approximately 30x coverage. Contig assembling and scaffolding The de novo assembly process utilized PacBio CCS reads and Omni-C reads as input for HiC-Hifiasm, employing default parameters. This approach facilitated the generation of a separate de novo assembly for each haplotype, enhancing the accuracy and integrity of the genomic reconstruction. The scaffolding phase involved the integration of the de novo assembly with Dovetail Omni-C library reads through HiRise, a software pipeline tailored for scaffolding genome assemblies using proximity ligation data. Alignment of Omni-C library sequences to the draft assembly was achieved using bwa, with the mapped read pairs analyzed by HiRise to construct a likelihood model for genomic distance (See Figure S1). This model, along with additional information from the synteny analysis (see below), informed the identification and correction of misjoins, the scoring of potential joins, and the execution of joins exceeding a defined confidence threshold. Synteny analysis The M. novaeangliae newly-assembled scaffolds were mapped to the B. musculus whole genome (GenBank GCA_009873245.3) in order to map the synteny between the two species.9,10 A synteny analysis was performed using JupiterPlot 1.0,11 a software tool that uses circos-based consistency plots to map a given set of scaffolds with a reference genome. RNA sequencing Total RNA was extracted employing the QIAGEN RNeasy Plus Kit, adhering to the manufacturer's instructions. Quantification of RNA involved the Qubit RNA Assay and the TapeStation 4200 system. Before library preparation, DNase treatment was applied, followed by AMPure bead cleanup and rRNA depletion using QIAGEN FastSelect -HMR. The NEBNext Ultra II RNA Library Prep Kit was used for library preparation per the manufacturer's protocols. Sequencing of the prepared libraries was conducted on the NovaSeq 6000 platform, utilizing a 2 x 150 bp configuration to ensure comprehensive transcriptome coverage. Repeat Analysis This dataset was derived from a Humpback whale (Megaptera novaeangliae) genome assembly. The repeat families found in the genome were identified de novo using RepeatModeler (v2.0.1), which relies on RECON (v1.08) and RepeatScout (v1.0.6). The custom repeat library generated from RepeatModeler was then used to discover, identify, and mask the repeats in the assembly using RepeatMasker (v4.1.0). Gene prediction was performed using the AUGUSTUS software (v2.5.5) with six rounds of optimization. Coding sequences from related cetacean species, including Balaenoptera acutorostrata, Balaenoptera musculus, Balaenoptera ricei, Megaptera novaeangliae, and Orcinus orca, were used to train the ab initio models for gene prediction. Additionally, the SNAP software (v2006-07-28) was trained using the same coding sequences to build a separate gene prediction model. RNA-seq reads were mapped to the genome using the STAR aligner (v2.7), and intron hints were generated using the bam2hints tool within AUGUSTUS. MAKER was then employed to integrate the predictions from AUGUSTUS and SNAP, combining this information with peptide evidence from the UniProt database and protein sequences from related cetacean species. Only gene models predicted by both AUGUSTUS and SNAP were retained in the final dataset. Annotation Edit Distance (AED) scores were generated for each predicted gene as part of the MAKER pipeline to assess the accuracy of the predictions. Finally, tRNA genes were identified using the tRNAscan-SE software (v2.05). Acknowledgments The specimens used in this study were collected by Kristi West, University of Hawaii, and provided by the National Marine Mammal Tissue Bank (NMMTB), which is maintained by the National Institute of Standards and Technology (NIST) at the NIST Biorepository, Hollings Marine Laboratory, Charleston, SC. The NMMTB is operated under the direction of the National Oceanic and Atmospheric Administration/National Marine Fisheries Service (NOAA Fisheries) with the collaboration of the U.S. Geological Survey, U.S. Fish and Wildlife Service, the (former) Minerals Management Service, and NIST, through the Marine Mammal Health and Stranding Response Program.

  12. NCMD assembly and gene annotation

    • figshare.com
    application/gzip
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioinfo lab. (2023). NCMD assembly and gene annotation [Dataset]. http://doi.org/10.6084/m9.figshare.23708352.v3
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Bioinfo lab.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For genome assembly, we generated short and long reads using the DNA sample of an adult male of Nanchukmacdon. Then, we constructed a genome assembly using the sequencing reads with the reference-guided approach. The 80.14x raw PacBio subreads were assembled and polished to generate 1,942 high-quality contigs supported by at least 50 PacBio subreads. To generate a chromosome-level assembly, the high-quality polished contigs were then further assembled by an improved version of RACA that can utilize both the genome information of related species and diverse types of sequencing data. The assembly was used to build the final assembly after one more polishing step using short reads.

    For annotating protein-coding genes, RNA samples were prepared and sequenced from 24 different tissues of the Nanchukmacdon individual which was used for whole genome sequencing. Using a combination of ab initio and homology-based prediction approaches with the RNA sequencing data, a total of 20,588 protein-coding genes with an average length of 47.06 Kbp were annotated in the NCMD assembly. Non-coding genes for diverse types of RNAs, including rRNA, snRNA, and miRNA, were annotated by using the Rfam database and Infernal (v.1.1.3). The tRNAscan-SE (v.2.0.5) and RNAmmer (v.1.2) were used to annotate non-coding genes for tRNA and rRNA, respectively.

    The sequencing read data for genome assembly and annotation can be obtained at NCBI SRA under the project of PRJNA967127.

  13. D

    Data from: A Sequence Distance Graph framework for genome assembly and...

    • ckan.grassroots.tools
    pdf, xml
    Updated Sep 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Earlham Institute (2022). A Sequence Distance Graph framework for genome assembly and analysis [Dataset]. https://ckan.grassroots.tools/ar/dataset/7dcb7e5c-27d8-4697-8d67-fb9900dcd6bd
    Explore at:
    xml, pdfAvailable download formats
    Dataset updated
    Sep 15, 2022
    Dataset provided by
    Earlham Institute
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ns4:pThe Sequence Distance Graph (SDG) framework works with genome assembly graphs and raw data from paired, linked and long reads. It includes a simple deBruijn graph module, and can import graphs using the graphical fragment assembly (GFA) format. It also maps raw reads onto graphs, and provides a Python application programming interface (API) to navigate the graph, access the mapped and raw data and perform interactive or scripted analyses. Its complete workspace can be dumped to and loaded from disk, decoupling mapping from analysis and supporting multi-stage pipelines. We present the design and/ns4:pns4:p implementation of the framework, and example analyses scaffolding a short read graph with long reads, and navigating paths in a heterozygous graph for a simulated parent-offspring trio dataset./ns4:pns4:p SDG is freely available under the MIT license at

  14. f

    Table2_Whole Genome Assembly of Human Papillomavirus by Nanopore Long-Read...

    • figshare.com
    xlsx
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuaibing Yang; Qianqian Zhao; Lihua Tang; Zejia Chen; Zhaoting Wu; Kaixin Li; Ruoru Lin; Yang Chen; Danlin Ou; Li Zhou; Jianzhen Xu; Qingsong Qin (2023). Table2_Whole Genome Assembly of Human Papillomavirus by Nanopore Long-Read Sequencing.XLSX [Dataset]. http://doi.org/10.3389/fgene.2021.798608.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Frontiers
    Authors
    Shuaibing Yang; Qianqian Zhao; Lihua Tang; Zejia Chen; Zhaoting Wu; Kaixin Li; Ruoru Lin; Yang Chen; Danlin Ou; Li Zhou; Jianzhen Xu; Qingsong Qin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Human papillomavirus (HPV) is a causal agent for most cervical cancers. The physical status of the HPV genome in these cancers could be episomal, integrated, or both. HPV integration could serve as a biomarker for clinical diagnosis, treatment, and prognosis. Although whole-genome sequencing by next-generation sequencing (NGS) technologies, such as the Illumina sequencing platform, have been used for detecting integrated HPV genome in cervical cancer, it faces challenges of analyzing long repeats and translocated sequences. In contrast, Oxford nanopore sequencing technology can generate ultra-long reads, which could be a very useful tool for determining HPV genome sequence and its physical status in cervical cancer. As a proof of concept, in this study, we completed whole genome sequencing from a cervical cancer tissue and a CaSki cell line with Oxford Nanopore Technologies. From the cervical cancer tissue, a 7,894 bp-long HPV35 genomic sequence was assembled from 678 reads at 97-fold coverage of HPV genome, sharing 99.96% identity with the HPV sequence obtained by Sanger sequencing. A 7904 bp-long HPV16 genomic sequence was assembled from data generated from the CaSki cell line at 3857-fold coverage, sharing 99.99% identity with the reference genome (NCBI: U89348). Intriguingly, long reads generated by nanopore sequencing directly revealed chimeric cellular–viral sequences and concatemeric genomic sequences, leading to the discovery of 448 unique integration breakpoints in the CaSki cell line and 60 breakpoints in the cervical cancer sample. Taken together, nanopore sequencing is a unique tool to identify HPV sequences and would shed light on the physical status of HPV genome in its associated cancers.

  15. b

    Data from: Origin of minicircular mitochondrial genomes in red algae

    • nde-dev.biothings.io
    • datasetcatalog.nlm.nih.gov
    • +4more
    zip
    Updated May 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongsung Lee; Chung Hyun Cho; Chanyoung Noh; Ji Hyun Yang; Seung In Park; Yu Min Lee; John A. West; Debashish Bhattacharya; Kyubong Jo; Hwan Su Yoon (2023). Origin of minicircular mitochondrial genomes in red algae [Dataset]. http://doi.org/10.5061/dryad.tqjq2bw0w
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2023
    Dataset provided by
    Sungkyunkwan University
    Sogang University
    Rutgers, The State University of New Jersey
    The University of Melbourne
    Authors
    Yongsung Lee; Chung Hyun Cho; Chanyoung Noh; Ji Hyun Yang; Seung In Park; Yu Min Lee; John A. West; Debashish Bhattacharya; Kyubong Jo; Hwan Su Yoon
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Eukaryotic organelle genomes are generally of conserved size and gene content within phylogenetic groups. However, significant variation in genome structure may occur. Here, we report that the Stylonematophyceae red algae contain multipartite circular mitochondrial genomes (i.e., minicircles) which encode one or two genes bounded by a specific cassette and a conserved constant region. These minicircles are visualized using Fluorescence Microscope and Scanning Electron Microscope, proving the circularity. Mitochondrial gene sets are reduced in these highly divergent mitogenomes. Newly generated chromosome-level nuclear genome assembly of Rhodosorus marinus reveals that most mitochondrial ribosomal subunit genes are transferred to the nuclear genome. Hetero-concatemers that resulted from recombination between minicircles and unique gene inventory that is responsible for mitochondrial genome stability may explain how the transition from typical mitochondrial genome to minicircles occurs. Our results offer inspiration on minicircular organelle genome formation and highlight an extreme case of mitochondrial gene inventory reduction. Methods Sample preparation Culture strains of Tsunamia transpacifica JAW4874, Rufusia pilicola O7031, Stylonema alsidii JAW4424, Chroodactylon ornatum JAW4256, Chroothece mobilis SAG104.79, Rhodosorus marinus CCMP1338, and Bangiopsis subsimplex UTEX LB2854 were obtained from J.A. West (School of Biosciences 2, University of Melbourne, Parkville, Victoria 3010, Australia), F.D. Ott (905 NE Hilltop Drive, Topeka, Kansas 66617, USA), The Culture Collection of Algae at Göttingen University, Germany (SAG), The National Center for Marine Algae and Microbiota (NCMA), and the Culture Collection of Algae at The University of Texas at Austin, USA (UTEX), respectively. DY-V medium (added sea salt to 5 ppt) was used for culturing Rufusia pilicola. Chroothece mobilis and Chroodactylon ornatum were cultured in L1+DY-V (1:1 ratio) medium. The other samples were cultured in L1 medium. Culture flasks were kept under a white LED lamp (7.76 µmol photon m-2 s-1) at 20°C in a 12:12 light-dark cycle. DNA and RNA extraction Samples were either collected from 10 µm membrane filters or by centrifugation (30 min, 7830 rpm). Harvested cells were frozen in liquid nitrogen before grinding. Genomic DNAs for Illumina short-read sequencing were extracted using the Exgene Plant SV Kit (General Biosystems, Seoul, Korea) and cleaned up using DNeasy® PowerClean® Pro Cleanup Kit (QIAGEN, Hilden, Germany). For long-read sequencing, genomic DNA was extracted using the manual CTAB protocol with a customized lysis buffer 1. Harvested samples were placed in a 2 ml tube with bullet and frozen in liquid nitrogen. Then samples were machine ground. After grinding, samples were resuspended by adding 600 µl of CTAB isolation buffer (1% 2-mercaptoethanol added right before usage) and incubated at 65°C for 20 min. When samples were completely thawed, bullets were removed from the tubes and 6 µl of RNase A was added. After incubation, we centrifuged tubes at 14,000 rpm for 20 min. While not disturbing the pellets, samples were placed into another 2 ml tube and mixed with one volume of phenol:chloroform:isoamyl alcohol (25:24:1, v/v) before centrifugation at 14,000 rpm for 20 min. The aqueous phase was then mixed with one volume of chloroform in a new 2 ml tube and centrifuged at 14,000 rpm for 15 min. After centrifugation, one volume of 100% isopropanol was added and incubated at -20°C for 30 min. Samples were then centrifuged at 14,000 rpm for 20 min. Precipitated DNA was washed with 70% ethanol and centrifuged again to remove ethanol. Finally, DNA was air-dried and dissolved in 50 µl AE buffer from Exgene Plant SV Kit. Total RNA of R. marinus was extracted using RNeasy® Plant Mini Kit (QIAGEN, Hilden, Germany). Whole genome sequencing and genome assembly Library preparation and whole genome sequencing for both short-read and long-read sequencing were carried out by DNA Link Inc. (Seoul, Korea). For short-read sequencing, libraries were prepared using the Truseq Nano DNA Prep Kit (550 bp Protocol) and sequencing was done with the Illumina HiSeq2500 platform according to the protocol using 100 bp paired-end reagents. Long-read sequencing was carried out with Oxford Nanopore platform (ONT GridION) for R. marinus (6 kb size selection) and the Pacific Biosciences (PacBio) High-Fidelity (HiFi) sequencing platform for C. ornatum (no size selection). RNA-seq for R. marinus was done with the Illumina NovaSeq600 platform. The raw data from short-read sequencing were assembled using SPAdes 3.14.1 2 with ‘—careful’ pipeline option and those from long-read sequencing were assembled using NextDenovo 2.5.0 (https://github.com/Nextomics/NextDenovo) for nuclear genome of R. marinus. Assembled NextDenovo contigs were polished 3 times with Pilon 1.22 3 using short-read mapping data generated by bowtie2 2.3.5.1 4. For mitogenome assemblies using long-read data, reads that have BLAST hits to mitochondrial CDS were used. The program miniasm 0.3 (r179) 5 was used to identify the R. marinus mitogenome and IPA 1.3.1 (https://github.com/PacificBiosciences/pbipa) was used for C. ornatum. In addition, reads that had BLAST hits to the NCR were used to search for “empty” minicircle reads that do not contain a CDS, however, no contigs were assembled, meaning the collected reads are just fragments of CDS-containing reads. Because minicircles share long conserved regions that short-reads cannot discriminate, we used long-read data and NextPolish 1.4.0 (https://github.com/Nextomics/NextPolish) to polish the miniasm-derived contigs. We did not perform polishing on IPA contigs, because HiFi sequencing generates extremely accurate reads. The remaining SNPs and ambiguities were manually corrected using mapping data of long-reads containing CDS. For C. ornatum, each sequence from step 10 (10-assemble/p_ctg.fasta) was considered as a minicircle sequence, because the following step of the IPA assembler (polish and purge dups) did not function correctly. For the short-read data, sorted and verified mitochondrial genes (see below) were used as seeds for NOVOplasty 4.2 6. Using Geneious (Biomatters, Auckland, New Zealand), generated NOVOplasty contigs were then de novo assembled. Assembled contig that codes any of mitochondrial genes was considered as part of mitochondrial genome. Those contigs were polished (-SNP & Indel) with Pilon 1.22 3, using short-read mapping data generated by bowtie2 2.3.5.1 4. Trinity 2.11.0 7 was used to assemble RNA sequencing data. Sorting and verifying mitochondrial contigs BLAST 2.2.31+ 8 was used to search for mitochondrial genes. Because mitochondrial gene sequences of the Stylonematophyceae were absent in the National Center for Biotechnology Information (NCBI) database, mitochondrial protein sequences from several red algae species were searched against assembled SPAdes contigs with e-value 1e-05 using tBLASTn. All the matched sequences were translated (Genetic code 4 9) and aligned against NCBI protein database (nr). Sequences that have eukaryotic taxa in the top 100 matches were considered as candidate genes. Those that only had prokaryotic taxa in the top 100 matches with significantly low identity or query coverage were also selected as possible candidates. To exclude bacterial contigs from possible mitochondrial contigs, genomic features such as GC content, read coverage, and tBLASTn result (top match and identity) of the contig were used as criteria for selection. CDSs of each contig were compared against NCBI protein database (nr) using default parameters. These candidate contigs were verified manually using phylogenetic analysis. Using translated CDS in candidate contigs as queries, protein sequences from nr database were searched by MMSeqs2 10 (Version: 330ea3684fd3f985d0127ffe8ca5b3f13053c619) with maximum sensitivity and e-value 1e-05. Nuclear gene prediction RNA-seq reads were mapped against the assembled nuclear genome of R. marinus using hisat2 (2.2.1) 11 and STAR 2.7.7a 12 (--outFilterScoreMinOverLread 0.45 --outFilterMatchNminOverLread 0.45). Mapping information was used as training set of ab initio gene models, performed using BRAKER 2.1.5 13. Completeness was measured using BUSCO 3.0.2 with the ‘eukaryote_odb9’ database 14, following Cho, et al. (2023). RAD52 was not found in the C. crispus proteome and contaminant assemblies were found in the transcriptome assembly of C. ornatum. Therefore, we chose to generate a transcriptome assembly and perform gene modeling using the available RNA-seq data (see Supplementary Table S1). We used Trinity 2.11.0 7 to obtain the transcriptome assembly. cd-hit 4.8.1 16, 17 was used to cluster sequences with similarity over 95% and predicted proteins were generated using Transdecoder 5.5.0 (https://github.com/TransDecoder/TransDecoder). BUSCO 14 values were: C. crispus, C:97.4% [S:27.4%, D:70.0%], F:2.6%, M:-0.0%, n:303; and C. ornatum, C:96.7% [S:20.5%, D:76.2%], F:1.3%, M:2.0%, n:303. Consequently, we predicted several novel genes that are not present in existing red algal data. Comparative analysis of CDSs The mitochondrial genomes of 23 red algae representing Cyanidiophyceae, Compsopogonophyceae, Porphyridiophyceae, Rhodellophyceae, Bangiophyceae, and Florideophyceae were downloaded from NCBI nucleotide database (nt) and used for the comparison (Supplementary Table S1). Translated sequences of 11 CDSs (atp6, atp9, cob, cox1, cox2, cox3, nad1, nad2, nad4, nad5-f and nad5-s) were aligned by MAFFT 7.310 18 and concatenated. From the concatenated alignment, amino acid similarity were calculated by Geneious 10.2.3 using blosum62 matrix 19 with threshold 1 as well as nucleotide identity. Maximum-likelihood phylogenetic tree was built using IQ-TREE 1.6.8 20. Optimal evolutionary models were automatically chosen after model selection 21. Pairwise dN/dS calculation was

  16. d

    Data from: Asian giant hornet, Vespa mandarinia, genome assembly

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Asian giant hornet, Vespa mandarinia, genome assembly [Dataset]. https://catalog.data.gov/dataset/asian-giant-hornet-vespa-mandarinia-genome-assembly-1f860
    Explore at:
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    The Asian giant hornet, Vespa mandarinia, has a native range that extends from northern India to East Asia. In 2019, the hornet was confirmed for the first time in North America, posing an invasive threat to honey bees and human health. In September 2019, local beekeepers, tracked down a nest in a park in Nanaimo on Vancouver Island, British Columbia, Canada and exterminated it. The specimen we used for genome sequencing was obtained from that nest, the first one found in North America. DNA was extracted from the thorax for PacBio HiFi sequencing on two cells and data were assembled using IPA to yield a contig assembly of 248 Mb with a 3.14 Mb N50. The assembly was generated by the Agricultural Research Service's Ag100Pest Initiative in collaboration with Pacific Biosciences. This high-quality genome assembly is being released prior to publication in scientific journals as a public service to the research community. The Primary and Haplotig assemblies, along with the HiFi reads have been archived at NCBI. Relevant accessions include: SRA: SRR12366675 - PacBio HiFi reads for both cells BioProject: PRJNA649644, BioSample: SAMN15675875, GenBank: JACHAV000000000 - Primary contig assembly and mitochondrial genome BioProject: PRJNA649643, BioSample: SAMN15675875, GenBank: JACHAW000000000 - Alternate (Haplotigs) contig assembly Resources in this dataset: Resource Title: IPA contigs purged from haplotigs. File Name: ihVesMand1_IPA_purged_from_htig.fastaResource Description: IPA contigs purged from the haplotigs contig set by purge_dups. Fasta format. Resource Title: Mitochondrial PacBio HiFi read set. File Name: ihVesMand1_mt_reads.fastaResource Description: Mitochondrial reads from the PacBio HiFi read set. Fasta format. Resource Title: All mitochondrial genome VNTR variants. File Name: ihVesMand1_mtgenome_all_VNTR_variants.fastaResource Description: Multiple contigs of the mitochondrial genome were obtained due to the presence of an extended variable number tandem repeat (VNTR) region corresponding to the control region, with different copy numbers (ranging from 5 to 9) of an 823 bp repeat unit. We designated the most abundant mitochondrial genome variant (6 repeat copies) as the mitochondrial genome sequence and included it with the primary assembly deposited in GenBank. Resource Title: Vespa mandarinia sequencing and assembly methods. File Name: Vespa_mandarinia_Sequencing_and_Assembly_Methods.docx

  17. n

    Supplementary datasets for: Large-scale genome sequencing reveals the...

    • data-staging.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Nelson (2020). Supplementary datasets for: Large-scale genome sequencing reveals the driving forces of viruses in microalgal evolution [Dataset]. http://doi.org/10.5061/dryad.7wm37pvnv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 21, 2020
    Dataset provided by
    New York University Abu Dhabi
    Authors
    David Nelson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Microalgae are integral primary producers for global ecosystems whose genomes can be mined for ecological insights, but representative genome sequences are lacking for many phyla. We cultured and sequenced 107 microalgae species from 11 different phyla indigenous to varied geographies and climates. This genome collection was used to resolve genomic differences between saltwater and freshwater microalgae. Freshwater species showed domain-centric ontology enrichment for nuclear and nuclear membrane functions, while saltwater species were enriched in organellar and cellular membrane functions. Marine species contained significantly more viral families in their genomes (p-value = 8 x 10(-4)). Viral sequences were identified from Chlorovirus, Coccolithovirus, Pandoravirus, Marseillevirus, Tupanvirus, and others integrated into algal genomes. Algal, viral-origin sequences were found to be expressed and to code for a wide variety of functions. Our results clarify the poorly characterized occurrences of viral elements in algal genomes and define a unified adaptive strategy for algal halotolerance.

    Methods METHODS DETAILS

    Microalgal strain selection and cultivation

    Cultivation, DNA extraction, and sequencing of isogenic microalgae was done in several international culture collections and sequencing centers; UTEX (Austin, TX, USA), Bigelow laboratories (NMCA culture collection center, East Boothbay, ME, USA), New York University Abu Dhabi Center for Genomics and Systems Biology (Abu Dhabi, UAE), Admera Health LCC (South Plainfield, NJ, USA), and Novogene (HK).

    The UTEX strains were grown on slants using one of the following media as appropriate: BG11 Medium, Bristol Medium, Cyandidium Medium, f/2 Medium, Modified Artificial Seawater Medium, Modified Bold's 3N Medium, Porphyridium Medium, Proteose Medium, Soil Extract Medium, Trebouxia Medium, or Volvox Medium with 1.5% agar as described on the UTEX website (https://utex.org/pages/algal-culture-media-recipes); grown under cool white fluorescent lights at 20⁰C on a 12 hour light cycle. For species isolated and cultured at NYUAD, f/2 medium (Lananan et al., 2013), or Tris-minimal medium (https://chlamycollection.org/), was used (https://utex.org/pages/algal-culture-media-recipes).

    The species chosen for sequencing were intended to represent as many microalgae phyla and as many different environments as possible. We sequenced representatives from 11 phyla (see Table S1). Most of the species were from the Chlorophyta or the Ochrophyta phyla. The project designations were algallCODE phase II (n=107, this manuscript), algallCODE phase I (n=22), NCBI-hosted (n=43), and Phytozome-hosted (n=2). Individual strain cultures were selected as representative species for their lineages or as standards to confirm workflow reproducibility (see Table S1; Dataset S1).

    We emphasized maximizing the sample size of each saltwater and freshwater species (Fig. 1; Table S1). Of our initial effort to culture >150 species, 24 failed to produce sufficient biomass, six were contaminated, and 3 yielded reads inadequate for an assembly matching the expected size (Dataset S1). The de novo assemblies from the final batch of 107 sequenced species were combined with publicly-available algal genomes for downstream analyses, including coding sequence (CDS) predictions (Dataset S2), hidden Markov model (HMM)-based functional predictions, including viral and protein family domain identification, hierarchical bi-clustering (Fig. 1), enrichment analyses (Fig. 5), principal component analyses (Fig. 5), and ternary graph-based analyses (Fig. S10, Dataset S12).

    The natural habitats of these microalgae include a range of diverse geographic locations and all climatic zones, with various temperatures, wind speeds, precipitation, and solar radiation. To allow the study of their evolution, we included species from different types of environments (from the arctic to the tropics) and both salt- and fresh-water habitats (Fig. 1A; Table S1). The freshwater species sequenced in this project included members of the Chlorophyta and Ochrophyta; most of the Haptophytes, Rhodophytes, and Myzozoa we sequenced were saltwater species. Most of the UTEX accessions were freshwater species (28/40); most of the NCMA accessions were saltwater species (50/57). Alexandrium andersonii, a mixotrophic dinoflagellate (1.7Gb), Heterocapsa arctica, an arctic dinoflagellate (1.3 Gb), Lingulodinium polyedra, a red-tide dinoflagellate (1.2 Gb), Amphidinium gibbosum (1.1Gb), and Karena brevis (1.0 Gb) were the largest de novo-assembled genomes in this work (Table S2).

    Long-read assemblies, including those from other studies (i.e., Chromochloris zofingiensis (Roth et al., 2017), Thalassiosira pseudonana (Armbrust et al., 2004), and Chlamydomonas reinhardtii (Merchant et al., 2007)), were used to validate our high-throughput short-read assembly process. Four subtropical axenic isolates (from the United Arab Emirates) were sequenced for this study using long-read technologies, including 10x Genomics (Pleasanton, CA, USA) linked-reads, and Pacific Biosciences (Menlo Park, CA, USA) Sequel long reads. Long reads were used to validate assemblies, viral element insertions, and to resolve repeat-containing regions (Ummat and Bashir, 2014; Vondrak et al., 2019). Our results indicated that the CDSs that provided the foundational information for the comparative analyses in this manuscript were reliably determined using short reads (Illumina HiSeq X or Novoseq6000). For example, the Chromochloris zofingiensis genome is the highest quality algal genome published to date (Roth et al., 2017) and has 33,513 exons; our short-read assembly for Chromochloris zofingiensis had 33,910 exons. Other reference microalgae used in this study as resequencing standards included Thalassiosira psuedonana (Armbrust et al., 2004), Chlamydomonas reinhardtii (Merchant et al., 2007), Scenedesmus sp., Guillardia theta (Curtis et al., 2012), Fragilariopsis cylindricus (Mock et al., 2017), Coccomyxa subellipsoidea (Blanc et al., 2012), and Bigelowiella natans (Curtis et al., 2012). A comparison of the assemblies generated from the monoculture and sequencing performed in this study and previous whole-genome sequencing projects is presented in Fig. S3, and QUAST and BUSCO assembly metrics are in Table S2. Hidden Markov models were used to predict structure and function from the whole-genome sequences (Fig. 1, B–D; Tables S3,5; Datasets S6, S7). The results for functional characterization using Enzyme Commission (EC) codes (Alborzi et al., 2017; Ryu et al., 2019), Kyoto Encyclopedia of Genes and Genomes (KEGG) designations (Porollo, 2014), and Gene Ontology (GO) terms (Hayes and Mamano, 2018; Teng et al., 2017) are in Tables S6,8.

    DNA extraction

    DNA was extracted from mature cultures with QIAGEN DNeasy Plant Maxi kits for HiSeqX 150x2 paired-end (short read) sequencing or QIAGEN MagAttract High Molecular Weight DNA Kits (48) for long-read sequencing. DNA was quantified and assayed for integrity as per the kit manufacturer's protocol. For HMW DNA extraction, briefly, DNA concentration was measured using a Qubit Fluorometer and checked for size by pulsed-field electrophoresis. A length-weighted mean of 50-70 kb was obtained, or the sample was rejected for sequencing. See Dataset S2 for FastQC reports and Fig. S1 for the gel images showing DNA integrity. Extracted DNA with low integrity was not included in library preparations. More than 30 cultures were grown whose DNA did not meet the quality threshold; in these instances, substitute strains were chosen. The final, sequenced strains are listed in Table S1.

    Sequencing

    Genomic DNA sequencing was performed with Illumina paired-end (Illumina, San Diego, CA, USA), PacBio Sequel (Pacific Biosciences, Menlo Park, CA, USA), and 10x Genomics linked-reads, where indicated, (10x Genomics, Pleasanton, CA, USA) to enable reliable coverage, contig assembly, and de novo genomic sequence assembly. For Illumina paired-end sequencing, Nextera 2x150 bp libraries (Illumina, San Diego, CA, USA) with approximately 72 million reads per sample passing quality filters (Dataset S2) were used for sequencing with a HiSeqX (https://emea.illumina.com/systems/sequencing-platforms/hiseq-x.html). All reads are uploaded to the National Center for Biotechnology Information (NCBI) under the Bioproject accession PRJNA517804). The target coverage was 100x on a 100 Mbp genome. Quality control for library preparation for Illumina sequencing was done with Qubit, Tapestation (Fig. S1), and qPCR. Combining these technologies assisted the validation of VFAM placement within selected genomes and ensured reliable assembly.

    De novo genome assembly

    De novo assembly can produce variable output depending on the source species and software used; we used both ABySS 2.0 (Jackman et al., 2017) and the Platanus (Kajitani et al., 2019) pipelines for each species sequenced with short-reads (Illumina HiSeqX (Illumina, San Diego, CA, USA)). The ABySS 2.0 command was: 'unset SLURM_NTASKS && mkdir -p $READFILE && TMPDIR=/tmp ABySS-pe -j 18 lib=pe1 k=64 name=$READFILE pe1='$READFILE R1_001.fastq.gz $READFILE R2_001.fastq.gz' --directory=/ data/analysis/ABySS_pe/$READFILE'. The Platanus commands were: '/platanus assemble -o $READ.OUT -f $READ-1.trimmed $READ-2.trimmed -t 4 -m 72 2>assemble.$READ.log'. Details of all YML workflows used are in Dataset S3 and the Key Resources Table lists all essential software used in the creation and analysis of these genomes.

    The output with the most single-copy, universally-conserved orthologs, according to “Based on evolutionary-informed expectations of the gene content of near-universal single-copy orthologs,” BUSCO, was chosen for subsequent analyses (Table S2, Dataset S2). This step produces some bias, as ABySS produced assemblies that were much closer in size to the estimated genome sizes from close relatives.

  18. Data from: Budding yeasts in the subphylum Saccharomycotina Genome...

    • agdatacommons.nal.usda.gov
    bin
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Wisconsin-Madison; Y1000+ Project (2025). Budding yeasts in the subphylum Saccharomycotina Genome sequencing and assembly [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Budding_yeasts_in_the_subphylum_Saccharomycotina_Genome_sequencing_and_assembly/25091411
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/
    Authors
    University of Wisconsin-Madison; Y1000+ Project
    License

    https://rightsstatements.org/vocab/UND/1.0/https://rightsstatements.org/vocab/UND/1.0/

    Description

    Eukaryotic life depends on the functional elements encoded by both the nuclear genome and organellar genomes, such as those contained within the mitochondria. The content, size, and structure of the mitochondrial genome varies across organisms with potentially large implications for phenotypic variance and resulting evolutionary trajectories. Among yeasts in the subphylum Saccharomycotina, extensive differences have been observed in various species relative to the model yeast Saccharomyces cerevisiae, but mitochondrial genome sampling across many groups has been scarce, even as hundreds of nuclear genomes have become available. By extracting mitochondrial reads from existing short-read genome sequence datasets, we have greatly expanded both the number of available genomes and the coverage across sparsely sampled clades. Comparison of 353 yeast mitochondrial genomes revealed that, while size and GC content were fairly consistent across species, those in the genera Metschnikowia and Saccharomyces trended larger, while several species in the order Saccharomycetales exhibited lower GC content. Extreme examples for both size and GC content were scattered throughout the subphylum. All mitochondrial genomes shared a core set of protein-coding genes for Complexes III, IV, and V, but they varied in the presence or absence of mitochondrially-encoded canonical Complex I genes. We traced the loss of Complex I genes to a major event in the ancestor of the orders Saccharomycetales and Saccharomycodales, but we also observed several independent losses in the orders Phaffomycetales, Pichiales, and Dipodascales. In contrast to prior hypotheses based on smaller-scale datasets, comparison of evolutionary rates in protein-coding genes showed no bias towards elevated rates among aerobically fermenting (Crabtree/Warburg-positive) yeasts. Mitochondrial introns were widely distributed, but highly enriched in some groups. The majority of mitochondrial introns were poorly conserved within groups, but several were shared within groups, between groups, and even across taxonomic orders, which is consistent with horizontal gene transfer, likely involving homing endonucleases acting as selfish elements. As the number of available fungal nuclear genomes continues to expand, the methods described here to retrieve mitochondrial genome sequences from these datasets will prove invaluable to ensuring that studies of fungal mitochondrial genomes keep pace with their nuclear counterparts.

  19. d

    Data from: Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Aug 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niraj Rayamajhi; Chi-Hing Christina Cheng; Julian Catchen (2022). Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki [Dataset]. http://doi.org/10.5061/dryad.ghx3ffbs3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 19, 2022
    Dataset provided by
    Dryad
    Authors
    Niraj Rayamajhi; Chi-Hing Christina Cheng; Julian Catchen
    Time period covered
    Jul 28, 2022
    Description

    Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki

    https://doi.org/10.5061/dryad.ghx3ffbs3

    Illumina based short-reead only de novo genome assembly built with kmer size 51 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k51 and the file name is k51.fasta

    Illumina based short-reead only de novo genome assembly built with kmer size 61 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k61 and the file name is k61.fasta

    Illumina based short-reead only de novo genome assembly built with kmer size 71 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k71 and the file name is k71.fasta

    Illumina based short-reead only de novo genome assembly built with kmer size 81 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k81 and the file name is k81.fasta

    Illumina based short-reead only de novo genome assembly built with kmer size 91 using Meracu...

  20. u

    Helicobacter pylori SAMN06173313 reads

    • figshare.unimelb.edu.au
    • datasetcatalog.nlm.nih.gov
    application/gzip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RYAN WICK (2023). Helicobacter pylori SAMN06173313 reads [Dataset]. http://doi.org/10.4225/49/5959d352ba0ab
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    The University of Melbourne
    Authors
    RYAN WICK
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a sample read set for use in Unicycler. It is linked to in the Unicycler README so people who install Unicycler can get a small read set to make sure the program works.The files contain Illumina and PacBio reads from Helicobacter pylori biosample SAMN06173313. The SRA run accessions are SRR5413256 and SRR5413257. These files do not contain the full set of reads from these run. Rather, they have been subsampled down to create smaller files, easier to download. The PacBio reads were subsampled based on quality and are a high-quality subset of the original reads.The Helicobacter pylori genome is small and simple. It has only two copies of the RNA operon and no other large repeats, making it very easy to assemble compared to most bacterial genomes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Katherine Denby; Mehmet Fatih Kara; Wenbin Guo; Runxuan Zhang (2024). LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce [Dataset]. http://doi.org/10.5061/dryad.xwdbrv1m8

Data from: LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 29, 2024
Dataset provided by
University of York
James Hutton Institute
Authors
Katherine Denby; Mehmet Fatih Kara; Wenbin Guo; Runxuan Zhang
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we present a reference transcript dataset (LsRTDv1) for lettuce, combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 is a valuable resource for the investigation of transcriptional and alternative splicing regulation in lettuce. Methods We generated a lettuce Reference Transcript Dataset (LsRTDv1) by integrating transcript assemblies from short- and long-read RNA sequencing data with existing lettuce genome annotations. RNA sequencing data was generated from 23 different lettuce samples capturing different tissues, ages of plant and treatments. The 23 samples, all from Lactuca sativa cv. Saladin (synonymous with cv. Salinas) were combined equally into 7 samples prior to sequencing. Short-read assembly The RNA-seq reads of the seven pooled samples were pre-processed with Fastp (Chen et al., 2018) to remove adapters and filter low-quality reads (quality score <20, length <30). Trimmed reads were mapped to the latest lettuce reference genome assembly in NCBI (Lsat_Salinas_v11) using STAR aligner in the 2-pass mode to increase the mapping sensitivity at splice junctions (SJs)(Dobin and Gingeras, 2015). Mismatch was set to 1 with minimum and maximum intron sizes of 60 and 15,000 bp respectively. Two transcript assemblers, StringTie (Pertea et al., 2015) and Scallop (Shao and Kingsford, 2017), were used to assemble transcripts for each sample. The assemblies were then merged and refined using RTDmaker (https://github.com/anonconda/RTDmaker) to remove low-quality transcripts, including redundant transcripts with identical intron combinations to longer transcripts, fragmented transcripts with length <70% of gene length, transcripts with non-canonical SJs, transcripts with SJs only supported by <5 spliced reads in <2 samples and low expressed transcripts with <1 transcript per million reads (TPM) in <2 samples. Long-read assembly We employed the IsoSeq pipeline (https://github.com/PacificBiosciences/IsoSeq) to pre-process the Iso-seq data from the seven samples. The CCS method was used to generate circular consensus sequences (CCS) from raw subreads and reads with minimum predicted accuracy <90% were discarded (--min-rq=0.9). Barcodes associated with the CCS reads were eliminated using the lima method. To further refine the reads, Isoseq3 was applied to trim poly(A) tails and identify and remove concatemers. The output of full-length, non-concatemer (FLNC) reads was mapped to the reference genome using Minimap2 (Li, 2018). TAMA-collapse was used to collapse redundant transcript models in each sample with variation at the 5’ and 3’ ends and at SJs not allowed (-a = 0, -m = 0 and -z = 0) to ensure high accuracy of boundaries. Reads with errors within the 10 bp up- or down-stream of a SJ were removed. TAMA-merge was used to merge transcript models from the seven samples (Kuo et al., 2020). To improve the quality of the assembly, we implemented well-established methods for SJ and transcript start site (TSS) and end site (TES) analyses previously used for Arabidopsis AtRTD3 and barley BaRTv2 (Zhang et al., 2022b; Coulter et al., 2022). We removed low-quality transcripts that exhibited non-canonical SJs and low quality SJs unless they were also present in the short-read assembly. We applied a binomial test to distinguish high-confidence TSS and TES with a false discovery rate (FDR) <0.05. For genes with limited read support, statistical testing becomes challenging, hence we also kept TSS/TES if they were supported by at least 2 Iso-seq reads. Redundancy merge was applied to transcripts if they only differed ±50 nucleotides at their TSS/TES. In addition, transcripts only supported by a single Iso-seq read were removed from the final dataset. Integration of multiple annotations We integrated four transcript annotations: the long-read assembly, short-read assembly and two versions of Lsat_Salinas_v11 genome annotations GenBank (GCA_002870075.4) and RefSeq (GCF_002870075.4). The Iso-seq long-read assembly served as the reliable backbone, while the other three annotations were incorporated in a step-wise manner to improve the RTD completeness. Firstly, the transcripts in the short-read assembly that introduce novel SJs and/or novel gene loci were integrated into the long-read assembly. Subsequently, we added transcripts from GenBank and RefSeq annotations that contributed novel SJs or gene loci to build the lettuce RTD (LsRTDv1). In cases where two transcripts from GenBank and RefSeq had identical SJ combinations or were mono-exonic transcripts with overlapping regions exceeding 30% of both transcripts, we collapsed them to a single transcript, and the longest TSS and TES were used as the start and end point of the collapsed transcript. In LsRTDv1, the overlapped transcripts were assigned the same gene ID. However, if a set of overlapped transcripts entirely resided within the intron region of other transcripts, they were treated as intronic transcripts and assigned with a different gene ID. Where the overlapped transcripts can be divided into multiple groups and the adjacent groups overlapped less than 5% of the group lengths, they were assigned separate gene IDs.

Search
Clear search
Close search
Google apps
Main menu