Archive database for output data generated by next-generation sequencing machines including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, and others. DRA is a member of the International Nucleotide Sequence Database Collaboration (INSDC) and archiving the data in a close collaboration with NCBI Sequence Read Archive (SRA) and EBI Sequence Read Archive (ERA). Please submit the trace data from conventional capillary sequencers to DDBJ Trace Archive.
Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.
The present data set provides an Excel file in a zip archive. The file lists 334 samples of size fractionated eukaryotic plankton community with a suite of associated metadata (Database W1). Note that if most samples represented the piconano- (0.8-5 µm, 73 samples), nano- (5-20 µm, 74 samples), micro- (20-180 µm, 70 samples), and meso- (180-2000 µm, 76 samples) planktonic size fractions, some represented different organismal size-fractions: 0.2-3 µm (1 sample), 0.8-20 µm (6 samples), 0.8 µm - infinity (33 samples), and 3-20 µm (1 sample). The table contains the following fields: a unique sample sequence identifier; the sampling station identifier; the Tara Oceans sample identifier (TARA_xxxxxxxxxx); an INDSC accession number allowing to retrieve raw sequence data for the major nucleotide databases (short read archives at EBI, NCBI or DDBJ); the depth of sampling (Subsurface - SUR or Deep Chlorophyll Maximum - DCM); the targeted size range; the sequences template (either DNA or WGA/DNA if DNA extracted from the filters was Whole Genome Amplified); the latitude of the sampling event (decimal degrees); the longitude of the sampling event (decimal degrees); the time and date of the sampling event; the device used to collect the sample; the logsheet event corresponding to the sampling event ; the volume of water sampled (liters). Then follows information on the cleaning bioinformatics pipeline shown on Figure W2 of the supplementary litterature publication: the number of merged pairs present in the raw sequence file; the number of those sequences matching both primers; the number of sequences after quality-check filtering; the number of sequences after chimera removal; and finally the number of sequences after selecting only barcodes present in at least three copies in total and in at least two samples. Finally, are given for each sequence sample: the number of distinct sequences (metabarcodes); the number of OTUs; the average number of barcode per OTU; the Shannon diversity index based on barcodes for each sample (URL of W4 dataset in PANGAEA); and the Shannon diversity index based on each OTU (URL of W5 dataset in PANGAEA).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Library preparation: The total RNA from pre-aestivation (5-day-old), aestivation (30-day-old), and post-aestivation (55-day-old) female beetles were extracted using ZYMO Quick-RNA Tissue/Insect Kit (ZYMO Research, Irvine, CA, USA) and cleaned using TURBO DNA-free™ kit (Thermo Fisher Scientific, Langenselbold, Germany) according to the manufacturer’s instructions. We opted to sample only the females to eliminate sex-related variations. RNA quantity was determined using a Nanodrop ND-1000 UV/Vis spectrophotometer (Thermo Fisher Scientific). The integrity of the RNA samples was determined using the Agilent 2100 Bioanalyzer and an RNA 6000 Nano Kit (Agilent Technologies, Santa Clara, CA, USA). RIN values ≥ 7.0 were considered appropriate for mRNA library preparation. In total, 10 libraries (4, 3, and 3 libraries respectively per pre-aestivation, aestivation, and post-aestivation stages) were prepared using NEBNext® Poly(A) mRNA Magnetic Isolation Module kit (NEB E7490, New England Biolabs) according to the manufacturer’s instructions. The qualities of the libraries were checked via RNA fragment analysis conducted on the Agilent 2100 Bioanalyzer using the Agilent DNF-935 Reagent Kit (Agilent Technologies). The libraries were pooled based on their concentration, and an overall concentration of 3.4 ng/µL was obtained. The sequencing service was provided by BGI Genomics Tech Solutions Co. Ltd (Hong Kong) on a DNBSEQ-T7 platform. The ten raw read files were deposited at Sequence Read Archive (SRA) database of NCBI under the accessions SAMN33022552 - SAMN33022561.De novo assembly and functional annotation: Erroneous k-mers from paired read ends were removed using r-Corrector (v1.0.5) the with default options (Song & Florea, 2015), and the unfixable reads were discarded using the “FilterUncorrectabledPEfastq.py” function in Transcriptome Assembly Tools (Song & Florea, 2015). The adaptor sequences from the reads were removed, and the reads having a quality score above 30 were retained using TrimGalore! (v0.6.7). The cleaned reads (n = 3 per three adult phases) were de novo assembled using Trinity with default options. In total, 224 million bases covering 341,670 transcripts, including putative isoforms, were successfully assembled. The de novo assembly had an N50 value of 1532 and a BUSCO (v5.4.2) completeness score of 96.7% when compared against the endopterygota lineage (BUSCO.v4 datasets). Furthermore, the putative isoforms were combined to obtain a supertranscriptome that contained 189,229 transcripts in total. The supertranscriptome was deposited at GeneBank as a Transcriptome Shotgun Assembly (TSA) under the accession GKIH00000000.1. The transcriptome (including isoforms) was annotated using Trinotate (v3.2.2), which combines the outputs of NCBI BLAST+ (v2.13.0; nucleotide and predicted protein BLAST), TransDecoder (v5.5.0; coding region prediction), signal (v4.0; signal peptide prediction), TmHMM (v2.0; transmembrane domain prediction), and HMMER (v3.3.2; homology search) packages into an SQLite annotation database. The latest uniport_sprot (04/2022) and Pfam-A (11/2015) databases were downloaded using Trinotate, and the default E-value thresholds were used during the searches with BLAST+ and HMMER, respectively. The obtained annotation database was used to extract gene ontology (GO) terms associated with individual genes using the “extract_GO_assignments_from_Trinotate_xls.pl” whereas the signals and TmHMM outputs were manually extracted using Excel spreadsheets. The longest protein-coding regions in the super transcript data predicted by TransDecoder were subjected to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotation via GhostKoala v2.2 (https://www.kegg.jp/ghostkoala/). The annotation database was made available publicly on Figshare (https://doi.org/10.6084/m9.figshare.21922938). Differentially expressed genes: The read counts per putative genes were calculated using Salmon (v1.9) by mapping the cleaned reads onto our de novo transcriptome. Genes that had less than 15 read counts across all samples were filtered, and R package “DeSeq2“ (v4.2) was used to identify the differentially expressed genes in the following comparisons; aestivation vs. pre-aestivation, aestivation vs. post-aestivation, and pre-aestivation vs. post-aestivation (DeSeq2 was also allowed to conduct the default filtering). For each comparison, the genes having adjusted P values — which tested for the null hypothesis that the Log2 Fold change (LFC) was 0 — below 0.05 in addition to LFC values below -1 and above 1 were accepted as significantly down- and up-regulated, respectively. Enrichment analyses: The “enricher” function in the R package ”ProfileClusterer” was used to analyze the enrichment status of GO terms and KEGG pathways associated with the differentially expressed genes in the three pair-wise comparisons. All the genes that had passed the filtration before the DeSeq2 analysis served as the background. Importantly, we did not distinguish between up- and down-regulation during the enrichment analyses due to the ambiguous nature of the term and pathway annotations. We selected the top 14 most significantly enriched GO terms and the top 3 most significantly enriched KEGG pathways to be shown in the bubble plots (full enrichment results were provided in Fig. S). The dataset was also investigated in terms of the number of genes predicted to have signal peptides, transmembrane domains, both, or neither. The number of genes belonging to each category was determined by manually investigating the SQLite annotation database, and Chi-squared tests were performed to compare the proportion of each category among differentially expressed genes with that among the background genes. Here, the upregulated and downregulated genes were separately analyzed, and Bonferroni correction was applied (P < .05/18 = .002). The gene hits from significantly enriched GO terms of interest were selected for the visualization of their expressions at three adult stages. A custom R script was used to Z-normalize the expression of each gene across the three adult stages and GraphPad Prism v10.0 was used to construct the heat maps. The names of the genes were extracted from the annotation database constructed in this study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Material for:
Emerling C.A., Springer M.S., Gatesy J., Jones Z., Hamilton D., Xia-Zhu D., Collin M.A., and Delsuc F. (2021). Genomic evidence for the parallel regression of melatonin synthesis and signaling pathways in placental mammals. Open Research Europe.
Supplementary File Legends:
- Supplementary_Figure_S1.pdf: RAxML AANAT gene tree.
- Supplementary_Figure_S2.pdf: RAxML ASMT gene tree.
- Supplementary_Figure_S3.pdf: RAxML MTNR1A+MTNR1B tree.
- Supplementary_Figure_S4.pdf: PAML AANAT results, model 1 (see Supplementary Table S7).
- Supplementary_Figure_S5.pdf: PAML ASMT results, model 2 (see Supplementary Table S8).
- Supplementary_Figure_S6.pdf: PAML MTNR1A results, model 1 (see Supplementary Table S9).
- Supplementary_Figure_S7.pdf: PAML MTNR1B results, model 1 (see Supplementary Table S10).
- Supplementary_Table_S1.xlsx: List of species examined in this study and the sources of the genes. Source key: WGS: Sequences derived from NCBI's Whole Genome Shotgun database; Whole Genome Sequencing of Short Reads: whole genomes were sequenced using short-read technologies. The methodologies varied for the species, and will be published with other projects, so please contact the author(s) for information on the specific methodology and samples used; SRA: sequences derived from NCBI's Sequence Read Archive; GenBank: sequences derived from NCBI's nucleotide collection; Bowhead Whale Genome Resource: sequences derived from http://www.bowhead-whale.org; Ensembl: sequences derived from Ensembl genome browser (www.ensembl.org)l; Discovar de novo: sequences derived genomes assembled via Discovar de novo (https://software.broadinstitute.org/software/discovar/blog/).
- Supplementary_Table_S2.xlsx: Accession numbers and functionality of AANAT in species examined. Parentheses after accession number indicates coordinates for sequence on the contig / scaffold. Exon colors code for the following: green = putatively functional; yellow = missing; pink = one or more inactivating mutations found. Abbreviations for mutations are as follows: del = deletion; ins = insertion; start = start codon mutation; stop = premature stop codon; ? = ambiguity whether the mutation is shared among all members of the clade. Abbreviations in brackets following an inactivating mutation indicate shared inactivating mutation. Key for each abbreviation follows: Bacu = Balaenoptera acutorostrata; BALA = Balaenidae; BALAEN = Balaenopteridae; Bbon = Balaenoptera bonaerensis; CAB = Cabassous; Ccap = Cebus capucinus; CETA = Cetacea; CHLAM = Chlamyphoridae; CHOL = Choloepus; Cjac = Callithrix jacchus; CING = Cingulata; DASY = Dasypodidae; DELP = Delphinidae; DERM = Dermoptera; Erob = Eschrichtius robustus; INIA = Inia; FOLI = Folivora; GALE = Galeopterus; LIPO = Lipotes; Lobl = Lagenorhynchus obliquidens; MANI = Manidae; MONO = Monodontidae; MYRM = Myrmecophagidae; MYST = Mysticeti; NPP = Not present in Platanista or Physeteroidea, but present in other Odontocetes; NPZ = Not present in Ziphiidae, but present in other Odontocetes; Oorc = Orcinus orca; PEUT = Tolypeutinae; PHOC = Phocoenidae; PHOL = Pholidota; PHOR = Chlamyphorinae; PILO = Pilosa; PHYS = Physeteroidea; PONT = Pontoporia; Schi = Sousa chinensis; SIRE = Sirenia; Tadu = Tursiops aduncus; TOLY = Tolypeutes; VERM = Vermilingua; XEN = Xenarthra.
- Supplementary_Table_S3.xlsx: Accession numbers and functionality of ASMT in species examined. See Table S2 caption for details.
- Supplementary_Table_S4.xlsx: Accession numbers and functionality of MTNR1A in species examined. See Table S2 caption for details.
- Supplementary_Table_S5.xlsx: Accession numbers and functionality of MTNR1B in species examined. See Table S2 caption for details.
- Supplementary_Table_S6.xlsx: Codon frequency model selection. These are the results from one ratio dN/dS analyses using different codon frequency models.
- Supplementary_Table_S7.xlsx: Results of AANAT PAML dN/dS analyses. Model: BG = branch(es) grouped with background; fixed 1 = branch(es) fixed at 1. p-value: specific p-value only shown if lower than 0.05. Model Comparison: if model comparison yields statistically significant differences (p < 0.05), model comparison bolded and given green background. For most models, w only shown for branch(es) of interest.
- Supplementary_Table_S8.xlsx: Results of ASMT PAML dN/dS analyses. Refer to Table S7 caption for additional details.
- Supplementary_Table_S9.xlsx: Results of MTNR1A PAML dN/dS analyses. Refer to Table S7 caption for additional details.
- Supplementary_Table_S10.xlsx: Results of MTNR1B PAML dN/dS analyses. Refer to Table S7 caption for additional details.
- Supplementary_Table_S11.xlsx: Results of BLASTing and mapping short reads from Alligator mississippiensis RNA sequencing experiments.
- Supplementary_Dataset_S1_all_ali_fasta.txt: Genomic alignments in fasta format used to determine the pseudogene/functional status of the different genes in different taxonomic groups.
- Supplementary_Dataset_S2_AANAT_RAxML_ali.phy: Alignment of AANAT in phylip format used in maximum likelihood phylogenetic reconstruction with RAxML.
- Supplementary_Dataset_S3_ASMT_RAxML_ali.phy: Alignment of ASMT in phylip format used in maximum likelihood phylogenetic reconstruction with RAxML.
- Supplementary_Dataset_S4_MTNR1A_MTNR1B_RAxML_ali.phy: Alignment of MTNR1A and MTNR1B in phylip format used in maximum likelihood phylogenetic reconstruction with RAxML.
- Supplementary_Dataset_S5_AANAT_PAML_alig.fasta: Codon alignment of AANAT in fasta format used in selection pressure analyses with PAML.
- Supplementary_Dataset_S6_ASMT_PAML_ali.fasta: Codon alignment of ASMT in fasta format used in selection pressure analyses with PAML.
- Supplementary_Dataset_S7_MTNR1A_PAML_ali.fasta: Codon alignment of MTNR1A in fasta format used in selection pressure analyses with PAML.
- Supplementary_Dataset_S8_MTNR1B_PAML_ali.fasta: Codon alignment of MTNR1B in fasta format used in selection pressure analyses with PAML.
- Supplementary_Dataset_S9_PAML_topology.tre: Tree topology in newick format used in selection pressure analyses with PAML.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
In the last few years, the bed bug Cimex lectularius has been an increasing problem world-wide, mainly due to the development of insecticide resistance to pyrethroids. The characterization of resistance alleles is a prerequisite to improve surveillance and resistance management. To identify genomic variants associated with pyrethroid resistance in Cimex lectularius, we compared the genetic composition of two recent and resistant populations with that of two ancientsusceptible strains using a genome-wide pool-seq design. We identified a large 6 Mb "superlocus" showing particularly high genetic differentiation and association with the resistance phenotype. This superlocus contained several clustered resistance genes, andwas also characterized by a high density of structural variants (inversions, duplications). The possibility that this superlocus constitute a resistance "supergene" that evolved after the clustering of alleles adapted to insecticide and after reduction in recombination is discussed. Methods The four strains used in this studywere provided by CimexStore Ltd (Chepstow, United Kingdom). Two of these strains were susceptible to pyrethroids (S), as they were collected before their massive use and have been maintained under laboratory condition without insecticide exposure for more than 40 years : German Lab (GL, collected in Monheim, Germany) and London Lab (LL, collected in London, Great Britain). The other two resistant (R) populations were London Field (LF, collected in 2008 in London) moderately resistant to pyrethroids, and Sweden Field (SF, collected in 2015 in Malm., Sweden), with a moderate-to-high resistance level. For each strain, genomic DNA was extracted from 30 individual females (except for London Lab which had only 28) using NucleoSpin 96 Tissue Kit (Macherey Nagel, Hoerdt, France) and eluated in 100 μL of BE buffer. DNA concentration of these samples was measured using Quant-iT PicoGreen Kit (ThermoFisher, Waltham MASS, USA) according to manufacturer’s instructions. Samples were then gathered with an equal DNA quantity into pools. DNA purification was performed for each pool with 1.8 times the sample volume in AMPure XP beads (Beckman Coulter, Fullerton CA, USA). Purified DNAwere retrieved in 100 μL of ultrapure water. Pool concentrations were measured with Qubit using DNA HS Kit (Agilent, Santa Clara CA, USA). Final pool concentrations were as follow: 38.5 ng/μL for London Lab, 41.6 ng/μL for London Field, 40.3 ng/μL for German Lab and 38 ng/μL for Sweden Field. Sequencing was performed using TruSeq Nano Kit (Illumina, San Diego CA, USA) to produce paired-end read of 2 x 150 bp length and a coverage of 25 X for London Lab, 32 X for London Field, 39.5 X for German Lab and 25.4 X for Sweden Field by Genotoul (Castanet-Tolosan, France). The whole pipeline with the detail of parameters used is available on GitHub (https://github.com/chaberko-lbbe/clec-poolseq). Quality control analysis of reads obtained from each line was performed using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). The raw data have been submitted to the Sequence Read Archive (SRA) database of NCBI under BioProject PRJNA826750. Sequencing reads were filtered using Trimmomatic software v0.39 (Bolger et al., 2014), which removes adaptors. FastUniq v1.1 was then used to remove PCR duplicates (Xu et al., 2012). Reads were mapped on the C. lectularius reference genome (Clec_2.1 assembly, Harlan strain) performed as part of the i5K project (Poelchau et al., 2015), with an estimated size of 510.83 Mb. Mapping was performed using BWA mem v0.7.4 (Li and Durbin, 2009). Sam files were converted to bam format using samtools v1.9, and cleaned of unmapped reads (Li et al., 2009). The 1573 nuclear scaffolds were kept in this analysis, while the mitochondrial scaffold was not considered. Bam files corresponding to the four populations were converted into mpileup format with samtools v1.9. The mpileup file was then converted to sync format by PoPoolation2 version 1201 (Kofler et al., 2011). 8.03 million (M) SNPs were detected on this sync file using R/poolfstat package v2.0.0 (Hivert et al., 2018) and the following parameters: coverage per pool between 10 and 50. Fixation indexes (FST) were computed with R/poolfstat for each pairwise population comparison of each SNP. Global SNP pool was then trimmed on minor allele frequency (MAF) of 0.2 (computed as MAF = 0.5 − |p − 0.5|, with p being the average frequency across all four populations). This relatively high MAF value was chosen in order to remove loci for which we have very limited power to detect any association with the resistance phenotype in the BayPass analysis. BayPass v2.3 (Olazcuaga et al., 2020) was used with default parameters. The final dataset was thus reduced to 2.92M SNPs located on 990 scaffolds.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aBased on the Agroforestree Database (www.worldagroforestry.org/resources/databases/agroforestree), an open access resource of ICRAF that provides data on >650 trees.bThe seed source of material for NGS varied and included natural stands, seed orchards and landraces. The numerical reference is the ICRAF accession number.cCurrent data from NGS; complete information is available at the tropiTree portal (http://bioinf.hutton.ac.uk/tropiTree). In () is the number of perfect SSRs identified. In [] is the percentage of the corresponding transcripts that have TAIR hits (for all SSRs).dData from National Center for Biotechnology Information of the USA (NCBI) searches were included to illustrate previous sequencing work. Searches were undertaken on 14 April 2014 via the Entrez search system (www.ncbi.nlm.nih.gov/sites/gquery). Species names for NCBI searches were checked as correct against current nomenclature using the Agroforestry Species Switchboard (www.worldagroforestry.org/products/switchboard/), an open access resource of ICRAF that provides links to information on >20,000 plants. Current names were set as ‘organism’ in NCBI searches. In () is the number of ESTs listed in NCBI nucleotide citations (if any). In [] is the number of NGS studies cited in NCBI’s Sequence Read Archive (if any).eAs well as being of importance to small-scale farmers, Acacia mangium and Jatropha curcas have wide commercial interests (see text), explaining the high NCBI citations.fSpecies were subject to primer validation (see text).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Loblolly pine (Pinus taeda L.) is one of the most widely planted and commercially important forest tree species in the USA and worldwide, and is an object of intense genomic research. However, whole genome resequencing in loblolly pine is hampered by its large size and complexity and a lack of a good reference. As a valid and more feasible alternative, entire exome sequencing was hence employed to identify the gene-associated single nucleotide polymorphisms (SNPs) and to genotype the sampled trees. Resources in this dataset:Resource Title: Availability of supporting data. File Name: Web Page, url: https://doi.org/10.1186/s12864-016-3081-8 The data sets supporting the results of this article are included within the article and additional files. The raw SNP data and Illumina HiSeq short read sequences are deposited in the NCBI Single Nucleotide Polymorphism Database (dbSNP) (accession numbers ss1995911273-ss1996900602; http://www.ncbi.nlm.nih.gov/SNP) and Sequence Read Archive (SRA) (accession number SRP075763; http://www.ncbi.nlm.nih.gov/sra).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
JRP24-FBZSH9-BEONE WP1 deliverable 1.2. WP Leader: Vítor Borges (INSA) Other contributors: Verónica Mixão (INSA), Miguel Pinto (INSA), Holger Brendebach (BfR), Simon Tausch (BfR), Carlus Deneke (BfR), Karin Lagesen (NVI) In order to contribute to the accomplishment of specific objectives of the BeOne project, WP1-T2 compiled an anonymized dataset (including sequencing reads and respective metadata) aiming to capture the genomic diversity within the populations of Listeria monocytogenes, Salmonella enterica, Escherichia coli (STEC) and Campylobacter jejuni. This dataset counts with data shared by the BeOne partners and comprises a total of 3,884 isolates, from which the anonymized sequencing reads were released in the European Nucleotide Archive (ENA) and the anonymized genome assemblies in the Zenodo repository [1,426 L. monocytogenes (accession: PRJEB57166 and 10.5281/zenodo.7267486); 1,540 S. enterica (accession: PRJEB57179 and 10.5281/zenodo.7267785); 308 E. coli (accession: PRJEB57098 and10.5281/zenodo.7267844); 610 C. jejuni (accession: PRJEB57119 and 10.5281/zenodo.7267879)]. As a complement to the BeOne dataset, additional samples were carefully selected among the WGS data publicly available at the beginning of the analysis (November 2021) in ENA or the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA), in order to ensure the representativeness of the genomic diversity within public databases (assessed in terms of sequence type or serotype, depending on the species). In the end, a so-called “public dataset” with the 8,383 samples that passed the curation step was released in Zenodo repository [1,874 L. monocytogenes (accession: 10.5281/zenodo.7116878); 1,434 S. enterica (accession: 10.5281/zenodo.7119735), 1,999 E. coli (accession: 10.5281/zenodo.7120057); 3,076 C. jejuni (accession: 10.5281/zenodo.7120166)].
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the gene annotation data for three species of Blastobotrys yeats: B. mokoenaii, B. illinoisensis, and B. malaysiensis.
The genome assemblies for B. mokoenaii (NRRL Y-27120) and B. malaysiensis (NRRL Y-6417) were publicly available on the National Center for Biotechnology Information (NCBI) under accessions GCA_003705765.3 and GCA_030558815.1, respectively.
The genome assembly for B. illinoisensis (NRRL YB-1343) was generated by SciLifeLab's National Genomics Infrastructure (NGI) using PacBio long-read data and deposited in the European Nucleotide Archive (ENA) under accession GCA_965113335.1.
File description- bmokoenaii_annotation.gff This file contains the gene models predicted for B. mokoenaii (GCA_003705765.3). - billinoisensis_annotation.gff This file contains the gene models predicted for B. illinoisensis (GCA_003705765.3). - bmalaysiensis_annotation.gff This file contains the gene models predicted for B. malaysiensis (GCA_030558815.1). Gene annotation methodsRepeat MaskingPrior to annotation, a repeat library was built for each species using RepeatModeler2 v2.0.2 and the genomes were soft-masked using RepeatMasker v4.1.5.
$ RepeatModeler -database ${DB} -engine ncbi -pa 16 $ RepeatMasker -dir . -gff -u -no_is -xsmall -e ncbi -lib ${LIBRARY} -pa 16 genome.fasta
Structural Annotation Structural annotation was performed on the soft-masked genomes using Braker3 v3.0.3 incorporating external evidence in the form of all fungal proteins from OrthoDB v11 (available at https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11).
$ braker.pl --genome="$genome" \
--prot_seq=${protein} --workingdir=${PWD}
--gff3 --threads=16 --verbosity=3
--nocleanup --species=${i}
Functional Annotation
The predicted genes were functionally annotated using the National Bioiformatics Infrastructure Sweden (NBIS) functional_annotation nextflow pipeline v2.0.0 (https://github.com/NBISweden/pipelines-nextflow). Briefly, this pipeline performs similarity searches between the annotated proteins and the UniProtKB/Swiss-Prot database (downloaded on 2023-12) using the Basic Local Alignment Search Tool (BLAST). Then it uses InterProScan to query the proteins against InterPro v59-91 databases, and merges results using AGAT v1.2.0.
tRNAs and rRNAs
Transfer RNA (tRNA) and ribosomal RNA (rRNA) genes were annotated using tRNAscan-SE v2.0.12 and barrnap v0.9, respectively. Other ncRNAs, such as SRP RNA, RNase P RNA, spliceosomal ncRNAs etc. have not been predicted. Finnally, the functionally annotated protein-coding genes, tRNAs, and rRNAs were combined into a single GFF file using AGAT v1.2.0.
$ tRNAscan-SE -E --gff ${output}_trnas.gff --thread 16 ${genome}.fasta $ barrnap --kingdom euk --threads 6 ${genome}.fasta > ${output}_rrna.gff
Annotation integrationFinnally, the functionally annotated protein-coding genes, tRNAs, and rRNAs were combined into a single GFF file using AGAT v1.2.0.
$ agat_sp_complement_annotations.pl --ref ${protein_coding} --add ${trna} --add ${rrna} --out full_annotation.gff
This release contains 9 tar.gz folders. Each of them contains two folders, Arthropoda and Mollusca which in turn contain the files for the species that belong to each phyla. MATEdb_trinity_assemblies.tar.gz Contains the assembled transcriptomes. The raw data was downloaded from the Sequence Read Archive (SRA) from NCBI using the SRA Toolkit version 2.10.7 whenever possible or manually from other repositories. Next, the raw data was filtered using fastp version 0.20.1. Finally it was assembled using Trinity version 2.11.0. MATEdb_transdecoder_cds.tar.gz Contains the ORFs, in nucleotides, for all transcripts annotated using TransDecoder 5.5.0. The TransDecoder.LongOrfs was run to extract all ORFs with a minimum length of 100 amino acids. Next TransDecoder.Predict was run training with the top 25% longest ORFs. MATEdb_transdecoder_pep.tar.gz Same as MATEdb_transdecoder_pep.tar.gz but the sequences are in amino acids. MATEdb_blobtools_filtered_cds.tar.gz Contains the files from MATEdb_transdecoder_cds filtered by removing the transcripts with a non-metazoan origin using BlobTools v2.3.3. The files in MATEdb_blobtools_filtered_pep were mapped against the nr (non-redundant protein) database using diamond blastp version 2.0.8. This file was used to create a database, using the create command from BlobTools, which was used to obtain the list of contaminants using the extract_phyla_for_blobtools.py custom script. Next the filter command from BlobTools was used to remove non-metazoan transcripts. MATEdb_blobtools_filtered_pep.tar.gz Same as MATEdb_blobtools_filtered_cds.tar.gz but the sequences are in amino acids. MATEdb_longest_pep.tar.gz Contains the files that include only the longest isoform for each gene in amino acids. These were obtained using the custom script fetch_longest_iso.py MATEdb_genomes_cds.tar.gz Contains the sequences in nucleotides corresponding to the longest isoform for every coding gene. The sequences were downloaded from different sources for each species and the longest isoform was obtained using the custom script fetch_longest_iso.py MATEdb_genomes_pep.tar.gz Same as MATEdb_genomes_cds.tar.gz but the sequences are in amino acids. MATEdb_eggnog_annotation.tar.gz Contains the eggnog annotation for the longest isoform for every species in the database. EggNOG-mapper version 2.1.6 was used. For more details see: https://github.com/MetazoaPhylogenomicsLab/MATEdb
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In order to contribute to the accomplishment of specific objectives of the BeOne project, WP1-T2 compiled an anonymized dataset (including sequencing reads and respective metadata) aiming to capture the genomic diversity within the populations of Listeria monocytogenes, Salmonella enterica, Escherichia coli (STEC) and Campylobacter jejuni. This dataset counts with data shared by the BeOne partners and comprises a total of 3,884 isolates, from which the anonymized sequencing reads were released in the European Nucleotide Archive (ENA) and the anonymized genome assemblies in the Zenodo repository [1,426 L. monocytogenes (accession: PRJEB57166 and 10.5281/zenodo.7267486 ); 1,540 S. enterica (accession: PRJEB57179 and 10.5281/zenodo.7267785 ); 308 E. coli (accession: PRJEB57098 and10.5281/zenodo.726784 4); 610 C. jejuni (accession: PRJEB57119 and 10.5281/zenodo.7267879 )]. As a complement to the BeOne dataset, additional samples were carefully selected among the WGS data publicly available at the beginning of the analysis (November 2021) in ENA or the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA), in order to ensure the representativeness of the genomic diversity within public databases (assessed in terms of sequence type or serotype, depending on the species). In the end, a so-called “public dataset” with the 8,383 samples that passed the curation step was released in Zenodo repository [1,874 L. monocytogenes (accession: 10.5281/zenodo.7116878 ); 1,434 S. enterica (accession: 10.5281/zenodo.7119735 ), 1,999 E. coli (accession: 10.5281/zenodo.7120057 ); 3,076 C. jejuni (accession: 10.5281/zenodo.7120166 )].
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
“Dual SAGE” refers to the simultaneous analysis of host and pathogen by Serial Analysis of Gene Expression (SAGE), and “Multi RNA-seq” refers to a metatranscriptomic analysis of bacterial species constituting the airway microbiota in conjunction with nasal epithelial host cells. “M,” million; “TPM,” transcripts per million; “RPKMO,” reads per kilobase pairs of a gene per million reads aligning to annotated ORFs. Databases containing raw sequencing data: NCBI (National Center for Biotechnology Information), ENA (European Nucleotide Archive), GEO (Gene Expression Omnibus).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Archive database for output data generated by next-generation sequencing machines including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, and others. DRA is a member of the International Nucleotide Sequence Database Collaboration (INSDC) and archiving the data in a close collaboration with NCBI Sequence Read Archive (SRA) and EBI Sequence Read Archive (ERA). Please submit the trace data from conventional capillary sequencers to DDBJ Trace Archive.