Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.
The GenBank non-redundant protein sequence database (NRDB) is a component of the NCBI BLAST databases and contains entries from GenPept, Swissprot, PIR, PDF, PDB and NCBI RefSeq.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MaizeMine is the data mining resource of the Maize Genetics and Genome Database (MaizeGDB; http://maizemine.maizegdb.org). It enables researchers to create and export customized annotation datasets that can be merged with their own research data for use in downstream analyses. MaizeMine uses the InterMine data warehousing system to integrate genomic sequences and gene annotations from the Zea mays B73 RefGen_v3 and B73 RefGen_v4 genome assemblies, Gene Ontology annotations, single nucleotide polymorphisms, protein annotations, homologs, pathways, and precomputed gene expression levels based on RNA-seq data from the Z. mays B73 Gene Expression Atlas. MaizeMine also provides database cross references between genes of alternative gene sets from Gramene and NCBI RefSeq. MaizeMine includes several search tools, including a keyword search, built-in template queries with intuitive search menus, and a QueryBuilder tool for creating custom queries. The Genomic Regions search tool executes queries based on lists of genome coordinates, and supports both the B73 RefGen_v3 and B73 RefGen_v4 assemblies. The List tool allows you to upload identifiers to create custom lists, perform set operations such as unions and intersections, and execute template queries with lists. When used with gene identifiers, the List tool automatically provides gene set enrichment for Gene Ontology (GO) and pathways, with a choice of statistical parameters and background gene sets. With the ability to save query outputs as lists that can be input to new queries, MaizeMine provides limitless possibilities for data integration and meta-analysis.
Purpose of experiments:
Sequence data obtained to determine community structure of pack sea-ice microbial communities and whether it is effected by exposures to elevated CO2 levels.
Summary of Methods:
Cells in sea-ice brines were filtered onto 0.2 micron filters and material extracted using the MoBio Water DNA extraction kit. The DNA was analysed by Research and Testing Laboratories Inc. (Lubbock, Texas, USA) via 454 pyrosequencing. The bacteria were analysed using primers set 10F-519R, which targets 16S rRNA genes. 16S rRNA genes associated with chloroplast and mitochondria are included in this dataset but represent a minority of sequences in most samples. Eukaryotes were analysed using primers set 550F-1055R, which targets 18S rRNA genes. The 454 pyrosequencing analysis with the Titanium GS FLX+ kit used generates on average 3000 reads incorporating custom pyrotags for later stages of the data analysis. The specific steps used for subsequent data analysis are described in the attached PDF file (Data_Analysis_Methodology.PDF). This output was further refined by first determining consensus sequences at the 98% similarity level using Weizhong Li’s online software site CD-HIT (http://weizhongli-lab.org/cd-hit/) Reference: Niu B, Fu L, Sun S, Li W. 2010. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 1:187 doi:10.1186/1471-2105-11-187. The consensus sequences were then checked for errors, manually curated, and aligned against closest matching sequences obtained from the NCBI database (www.ncbi.nlm.nih.gov) to finally obtained a list of consensus operational taxonomic entities and the number of reads obtained for each samples analysed.
File: SIPEXII_DNA_Sample_information.xlsx provides sampling and analysis information for the detailed results in the other two files File: SCIPEXII_sea_ice_bacteria_OTUs.xlsx contains information on the number of 16S rRNA reads in bacteria Phylum/Class and OTUs File: SCIPEXII_sea_ice_brines_eukaryote_community_OTU_data.xlsx contains information on the number of 16S rRNA reads in eukaryotic microbes: Phylum/Order/Closest taxon and OTUs
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
AbstractRecent global surveys of marine biodiversity have revealed that a group of organisms known as “marine diplonemids” constitutes one of the most abundant and diverse planktonic lineages [1]. Though discovered over a decade ago [2 and 3], their potential importance was unrecognized, and our knowledge remains restricted to a single gene amplified from environmental DNA, the 18S rRNA gene (small subunit [SSU]). Here, we use single-cell genomics (SCG) and microscopy to characterize ten marine diplonemids, isolated from a range of depths in the eastern North Pacific Ocean. Phylogenetic analysis confirms that the isolates reflect the entire range of marine diplonemid diversity, and comparisons to environmental SSU surveys show that sequences from the isolates range from rare to superabundant, including the single most common marine diplonemid known. SCG generated a total of ∼915 Mbp of assembled sequence across all ten cells and ∼4,000 protein-coding genes with homologs in the Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology database, distributed across categories expected for heterotrophic protists. Models of highly conserved genes indicate a high density of non-canonical introns, lacking conventional GT-AG splice sites. Mapping metagenomic datasets [4] to SCG assemblies reveals virtually no overlap, suggesting that nuclear genomic diversity is too great for representative SCG data to provide meaningful phylogenetic context to metagenomic datasets. This work provides an entry point to the future identification, isolation, and cultivation of these elusive yet ecologically important cells. The high density of nonconventional introns, however, also portends difficulty in generating accurate gene models and highlights the need for the establishment of stable cultures and transcriptomic analyses. Usage notesSingle-cell genomic scaffolds from 10 'wild-caught' marine diplonemidsFASTA format single-cell genomic scaffolds of 10 marine diplonemid (protist) cells are presented. Scaffolds were generated with the SPAdes assembler; contaminating sequences were removed, as described in the publication. Each FASTA file is derived from a single cell. Cells are referred to by the numbers used in the publication (i.e., cells 3, 13, 21, 27, 37, 47, 1sb, 4sb, 9sb, 21sb) as no species names exist.marine_diplonemid_SAGs.zipFigure S1 (related to Figure 1). Taxon-annotated GC plots demonstrate the effectiveness of our decontamination procedure.Plots were generated using blobtools (https://github.com/DRL/blobtools) for each SCG assembly before and after decontamination using the megablast/blastx protocol described in Experimental Procedures. Plots are based on megablast queries of the NCBI nt database according to taxonomic Order.FigS1.pdf
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Main data deposit for "Dominant contribution of Asgard archaea to eukaryogenesis".
Victor Tobiasson, Jacob Luo, Yuri I Wolf, Eugene V Koonin
Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
The Origin of eukaryotes is one of the key problems in evolutionary biology. The demonstration that the Last Eukaryotic Common Ancestor (LECA) already contained the mitochondrion, an endosymbiotic organelle derived from an alphaproteobacterium, and the discovery of Asgard archaea, the closest archaeal relatives of eukaryotes inform and constrain evolutionary scenarios of eukaryogenesis. We undertook a comprehensive analysis of the origins of the core eukaryotic genes tracing to the LECA within a rigorous statistical framework centered around evolutionary hypotheses testing using constrained phylogenetic trees. The results reveal dominant contributions of Asgard archaea to the origin of most of the conserved eukaryotic functional systems and pathways. A limited contribution from Alphaproteobacteria was identified, primarily relating to the energy transformation systems and Fe-S cluster biogenesis, whereas ancestry from other bacterial phyla was scattered across the eukaryotic functional landscape, without consistent trends. These findings suggest a model of eukaryogenesis in which key features of eukaryotic cell organization evolved in the Asgard ancestor, followed by the capture of the Alphaproteobacterial endosymbiont, and augmented by numerous but sporadic horizontal acquisition of genes from other bacteria both before and after endosymbiosis.
Version 0.3, updated 180325
Main data repository for:
Dominant contribution of Asgard archaea to eukaryogenesis (2024)
Tobiasson, V., Koonin, E.
Contains all final parsed data from the main Eukaryogenesis project
investigating the evolutionary ancetries of eukaryotic protein families.
Currently (non-static) available at:
https://www.biorxiv.org/content/10.1101/2024.10.14.618318v2
https://assets-eu.researchsquare.com/files/rs-5352492/v1/2f9c68ae-cf3e-420a-8d29-867b6fb1a878.pdf
All code used to generate the data present within this repository available at:
https://github.com/VictorTobiasson/eukgen
To identify associations between prokaryotic and eukaryotic protein families, separate
hidden Markov model (HMM) databases for prokaryotes and eukaryotes were constructed
using a custom, cascaded, sequence-to-profile clustering pipeline, implemented using
mmseqs2, followed by a multistep data-reduction and multiple sequence alignment (MSA)
procedure to generate HMM profiles using hhsuite.
A prokaryotic database of 37 million protein sequences was curated from prokaryotic
genomes obtained from the NCBI GenBank in November 2023 and supplemented with proteins
extracted from 146 Asgard genome assemblies. To avoid inclusion of genes present only
within a narrow subset of species, possibly resulting from horizontal transfer from
eukaryotes post LECA, we reconstructed the “soft-core” pangenome for each of the 26
curated prokaryotic taxonomic classes. These pangenomes include only those genes that
are present in at least 67% of the families within each class of Bacteria and Archaea.
The initial eukaryotic database consisted of 30 million protein sequences from 993
species taken from EukprotV3 and cleaned using mmseqs2 to remove likely prokaryotic
contaminants.
Both databases were clustered and MSAs constructed for all non, singleton clusters
and HMM profiles created. The resulting eukaryotic HMM dataset was queried against
the prokaryotic dataset using hhblits to identify sets of homologous protein sequences.
Each eukaryotic cluster and all its significant prokaryotic hits constituted an individual
sequence set, hereinafter referred to as an Eukaryotic/Prokaryotic Orthologous Cluster
(EPOC). The EPOCs constitute groups of homologous proteins from eukaryotes and prokaryotes
(each EPOC contains a unique set of eukaryotic proteins, but some clusters of prokaryotic
proteins can be present in multiple EPOCs) that were used for phylogenetic tree
construction, annotation, and evolutionary hypothesis testing.
To infer the most likely prokaryotic ancestry of the eukaryotic proteins in each EPOC,
rather than relying on the tree topology directly, we employed a probabilistic approach
for evolutionary hypothesis testing using constraint trees. We exhaustively sampled all
arrangements of likely sister clades and obtained Expected Likelihood Weights (ELW) for
the set of possible sister clade models. As the ELW metric is analogous to model selection
confidence, here we take it to be proportional to the probability of a sampled prokaryotic
clade to be the true sister group of the given eukaryotic clade among a set of competing
sister clades. For each EPOC, our analysis dynamically accounts for long branch outliers
and is robust to phylogenetically non-homogenous clades. This analysis is further capable
of resolving eukaryotic paraphyly, treating each eukaryotic clade within a EPOC as a
single datapoint for downstream analysis. Our resulting data contains EPOCs annotated
using profiles generated from KEGG Orthology Groups (KOGs), each with an MSA generated
using muscle5, a maximum likelihood tree inferred using IQtree2 and associated ELW values
for all candidate prokaryotic sister phyla. The analysis of prokaryotic ancestry was
performed only for those eukaryotic clades that included more than 5 distinct taxonomic
labels, with at least one coming from Amorphea and one from Diaphoretickes, the two
expansive eukaryotic clades considered to represent either the first or the second
bifurcation in the evolution of eukaryotes. Thus, these clades likely represent genes
mapping back to the LECA.
For further details please see main publication or contact
victor.tobiasson@nih.gov
eugene.koonin@nih.gov
Unless otherwise stated all files contained are tab separated and utf-8 encoded
with the first row containing header information.
All data entries encoding lists are “|” (pipe) separated.
Fields without data values are filled with string entries of “none”.
--- Databases ---
euk72_ep.tar.gz
prok2311_as.tar.gz
Prok2311As_final_clusters.tsv
Euk72Ep_final_clusters.tsv
prok2311_as.hmmDB.tar.gz
euk72_ep.hmmDB.tar.gz
--- Annotation and Curation ---
NCBI_taxonomy_species_addendum.tsv
NCBI_taxonomy_class_addendum.tsv
Euk72Ep_Prok2311As_final_classes.tsv
Euk72Ep_Prok2311As_final_classes.GTDB.tsv
KEGG_category_mapping.tsv
KEGG_metadata.tsv
--- EPOC data ---
EPOC_data.tar.gz
EPOC_annotation_KEGG.tsv
EPOC_data.tsv
EPOC_data.pangenomes_s10.tsv
EPOC_data.pangenomes_s25.tsv
EPOC_data.pangenomes_s67.tsv
EPOC_data.GTDB.tsv
Gunzip-ed .tar archive containing a single directory with 10 files
constituting the initial eukaryotic mmseqs2 database with taxonomy annotation.
Constructed from a pre-selected list of 72 eukaryotic proteomes downloaded from
NCBI as well as a “clean” version of Eukprot, lacking highly prokaryotic-like
contaminant sequences.
Gunzip-ed .tar archive containing a single directory with 10 files constituting the
initial prokaryotic mmseqs2 database with taxonomy annotation. Constructed from
47545 complete genomes retrieved from NCBI in November 2023.
Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted
from prok2311_as non--singleton clusters, contains 26286 profiles.
Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted
from euk72_ep non-singleton clusters, contains 1631704 profiles.
Taxonomy mapping file with manually curated ‘class’ level annotation for poorly
annotated species.
taxid: NCBI taxid
proposed_class_id: Manually assigned NCBI taxid
proposed_class_label: NCBI class name
org_name: NCBI organism name
Class revision file mapping poorly populated class level entries to higher order
manually curated labels. Also includes information for small classes with shallow
taxonomy which are deleted from the EPOC analysis at the level of tree construction.
taxid: NCBI taxid
ncbi_class: NCBI taxid of rank corresponding to ‘class’ following manual
amendment as per NCBI_taxonomy_species_addendum.tsv
revised_class_id: Manually assigned NCBI taxid of rank corresponding to ‘class’
revised_class_label: Proposed cleartext name of manually revised revised_class_id
Final taxonomy at NCBI rank ‘class’ following revisions for all sequences in Euk72Ep or
Prok2311As. These taxonomic labels are used for EPOC tree annotation.
acc: mmseqs database header in either prok2311_as or euk72_ep databases
taxid: NCBI taxid for organism
superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya,
used to define Eukaryotic outgroups in EPOC analysis
class: Cleartext name of manually revised NCBI rank ‘class’ identifier for annotation
Final taxonomy at GTDB rank ‘phylum’ transferred using marker genes from GTDB release 220
acc: mmseqs database header in either prok2311_as or euk72_ep databases
taxid: NCBI taxid for organism
superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya,
used to define Eukaryotic outgroups in EPOC analysis
class: Cleartext name of assigne GTDB phylum
Cluster mapping file for accessions within the initial Prok2311A database to the
final clusters used for HMM creation
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R1: Establishment and purification of neuropeptide sequences
The LW, APGW, RPCH, AKH, CRZ, and GnRH neuropeptide families were searched in the GenBank database using 10 keywords: the neuropeptide name, the precursor abbreviation, the full name of the precursor, the full name of the precursor with the word “prepropeptide,” and the combinations of these terms. The candidate sequences were downloaded in FASTA format using the appropriate commands in the GenBank database. The AKH neuropeptide family was classified according to the groups published in the literature, as well as the amino acid number and sequence. Furthermore, the ACP hybrid family was identified in the GenBank database using BLAST alignments.
C00: Neuropeptide Precursor. Eight folders were named with the initials of each neuropeptide family. The AKH family folder was the only one containing four subfolders. All of the folders contained the same type of files: three text files named after the neuropeptide initials and the obtained result. The files identified with the words “with codes” contained the sequences with the codes generated for this study, whereas the documents with the word “Full” contained the GenBank database search results obtained with the 10 aforementioned keywords. These files were located in a folder named “Fasta Keywords.” Each file contained the results from each respective keyword. The files with the words “selected EA” contained the sequences that were selected for evolutionary analyses.
C01: BLAST ACP. The text file named “00 BLAST ACP” contains the BLAST alignment results obtained from the NCBI database generated with the Adipokinetic Hormone/Corazonin-related peptide from the transcriptome of Callinectes toxotes. The file named “01 ACP Selected” contains the precursors selected for this study. All sequences were in FASTA format and contained the codes summarized in Supplementary Material 3 “Database Sequences.”
The file named “02 ACP selected EA” contains the ACP precursors of other species, which were used for the evolutionary analyses of C. toxotes ACP. The PDF file titled “03 ACP ProP 1.0 Serv” contains the results of the proteolytic cleavage sites of the precursors indicated in the file named “02 ACP selected EA,” which were generated using the aforementioned software.
C02: BLAST VP. The folder contains the results of the BLAST alignment against the NCBI database, which were generated with the virtual peptide sequences reported by Martinez-Perez et al. (2007). This folder contains seven text files. The name of each file corresponds to the precursor and species in which it was identified. Moreover, the PDF document named “Virtual peptides ProP 1.0 Serv” contains the results of the proteolytic cleavage sites generated with the aforementioned software.
C03: Debugging sequences with software. This folder contains three subfolders containing the results obtained with each software used in this study for the detection of each of the neuropeptide sequences using the appropriate keywords.
The folder named “BioDataToolKit” contains six subfolders with the abbreviated name of each neuropeptide. Additionally, there is a file containing the sequences downloaded from the GenBank database, as well as a Microsoft Excel file containing the details generated by the software. The name of each file corresponds to the keywords used for each search. The software used in this study can be found in the following repository: https://github.com/rduarte24/BiodataToolkit.
The folder named “Pro1.0Server” was organized in the same way as the results derived for the “BioDataToolKit” for each neuropeptide family. However, each of the neuropeptide folders contained a file with the pertinent sequences whereas another file contained the endoproteolytic cleavage sites of the neuropeptide precursors obtained with the software.
The folder named “Proteios” contains seven files. The file names indicate the precursor analyzed with the software and the identified sequences in FASTA format. The Proteios software is available in the following website: https://github.com/Martin-Munive/Proteios.
C04: Neuropeptide precursors for evolutionary analysis. Files with the sequences of the neuropeptide precursors used for the generation of the phylogenetic trees in Supplementary Materials 4 and 7. The name of each file corresponds to the name of each of the analyzed neuropeptides.
R2: Transcriptome BLAST
Microsoft Excel file containing the BLAST alignments conducted using the sequences of the AKH/CRZ-related peptide (ACP) from C. toxotes and Corazonin (CRZ) from C. arcuatus. The following information is summarized in the spreadsheets named C. toxotes and C. arcuatus: Column A, neuropeptide name; Column B, species name; Columns C–G, BLAST alignment results; Column H, GenBank protein accession number; Column I, precursor sequence.
R3: Construction of neuropeptide database
Microsoft Excel file with information pertaining to the database and a detailed description of each of the neuropeptide precursors analyzed in this study. The Excel file contains seven spreadsheet tabs. Each of the tabs contains the following columns:
Neuropeptides. Column A, sequence numbering in descending order; Column B, neuropeptide name; Column C, identification code used in this study; Column D, accession number; Columns E–G, species taxonomy; Columns H–L, GenBank sequence description; Columns M–N, literature reference and link. Taxonomy. Taxonomic description of each of the examined species derived from the NCBI database. Sequences evolutionary anal. This tab contains the code developed for this work in Column C; the GenBank accession codes of each neuropeptide are summarized in Column D and species taxonomy details are summarized in Columns E y F. Table of differences. Column B shows the codes of identical sequences and Column C shows the code of the sequence selected for this study. Codes deleted. This tab contains the accession codes of the species and the species name but contains no details on the properties of the neuropeptide precursors. Sequences Paper. Neuropeptide sequences reported in previous studies that were later reported in the GenBank database. The sequences marked with asterisks have not been previously reported in public databases. The codes used in this study to designate the sequences are also included. Keywords. Keywords used to conduct the GenBank database searches to obtain the members of each neuropeptide family.
R4: In silico validation, alignments, and phylogenetic relationships
Generated phylogenetic trees and results obtained from individual runs for each of the neuropeptide families with the DNA-LM and Kalign parameters using the IQ-TREE software.
The folder named “RUN” contains the “DNALM and kalign 2.0 default parameters” subfolder. Both folders contain 11 subfolders with the names of each of the neuropeptide families, as well as the results obtained with the IQ-TREE software. The folder named “Trees” contains the folder “DNALM and kalign 2.0 default parameters” containing the phylogenetic trees for each of the neuropeptide families, which were created with the Itol software.
R5: BLAST alignment of the virtual peptide precursors
Results of the BLAST alignment of the virtual peptides described by Martinez-Perez et al. (2007) with respect to the sequences in the GenBank database. The files follow the same nomenclature as in the folder named “Carpeta 02 BLAST VP” in Repository 1.
R6: Alignment of neuropeptide precursors
“DNALM and Kalign 2.0 default parameter” folders. Each of these folders contains the alignments of the examined neuropeptide precursors from each family and each folder is named after the corresponding neuropeptide. The remaining files contain the alignments in ascending order in the evolutionary scale and are appropriately named after the corresponding neuropeptide. The file named “All Sequence FASTA” contains the sequences used in our study in FASTA format.
R7: Phylogenetic clustering of the precursors
“DNALM and Kalign 2.0 default parameter” folders. Both folders contain the phylogenetic tree clustering results from Supplementary Material 6, which were obtained using the DNA-LM y Kalign parameters and the IQ-TREE software. All analyses were conducted using the GUANE-1 supercomputer (Universidad Industrial de Santander). The phylogenetic clustering results of all of the precursors are contained in the folders with the respective precursor name. The folder also contains Figure 6, which was included in our main manuscript.
Additionally, a folder entitled "Orthofinder and Robinson-Foulds" is included, which corresponds to the analyses carried out for: the Robinson-Foulds metric and the Orthofinder software.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Material for:
Emerling C.A., Springer M.S., Gatesy J., Jones Z., Hamilton D., Xia-Zhu D., Collin M.A., and Delsuc F. (2021). Genomic evidence for the parallel regression of melatonin synthesis and signaling pathways in placental mammals. Open Research Europe.
Supplementary File Legends:
- Supplementary_Figure_S1.pdf: RAxML AANAT gene tree.
- Supplementary_Figure_S2.pdf: RAxML ASMT gene tree.
- Supplementary_Figure_S3.pdf: RAxML MTNR1A+MTNR1B tree.
- Supplementary_Figure_S4.pdf: PAML AANAT results, model 1 (see Supplementary Table S7).
- Supplementary_Figure_S5.pdf: PAML ASMT results, model 2 (see Supplementary Table S8).
- Supplementary_Figure_S6.pdf: PAML MTNR1A results, model 1 (see Supplementary Table S9).
- Supplementary_Figure_S7.pdf: PAML MTNR1B results, model 1 (see Supplementary Table S10).
- Supplementary_Table_S1.xlsx: List of species examined in this study and the sources of the genes. Source key: WGS: Sequences derived from NCBI's Whole Genome Shotgun database; Whole Genome Sequencing of Short Reads: whole genomes were sequenced using short-read technologies. The methodologies varied for the species, and will be published with other projects, so please contact the author(s) for information on the specific methodology and samples used; SRA: sequences derived from NCBI's Sequence Read Archive; GenBank: sequences derived from NCBI's nucleotide collection; Bowhead Whale Genome Resource: sequences derived from http://www.bowhead-whale.org; Ensembl: sequences derived from Ensembl genome browser (www.ensembl.org)l; Discovar de novo: sequences derived genomes assembled via Discovar de novo (https://software.broadinstitute.org/software/discovar/blog/).
- Supplementary_Table_S2.xlsx: Accession numbers and functionality of AANAT in species examined. Parentheses after accession number indicates coordinates for sequence on the contig / scaffold. Exon colors code for the following: green = putatively functional; yellow = missing; pink = one or more inactivating mutations found. Abbreviations for mutations are as follows: del = deletion; ins = insertion; start = start codon mutation; stop = premature stop codon; ? = ambiguity whether the mutation is shared among all members of the clade. Abbreviations in brackets following an inactivating mutation indicate shared inactivating mutation. Key for each abbreviation follows: Bacu = Balaenoptera acutorostrata; BALA = Balaenidae; BALAEN = Balaenopteridae; Bbon = Balaenoptera bonaerensis; CAB = Cabassous; Ccap = Cebus capucinus; CETA = Cetacea; CHLAM = Chlamyphoridae; CHOL = Choloepus; Cjac = Callithrix jacchus; CING = Cingulata; DASY = Dasypodidae; DELP = Delphinidae; DERM = Dermoptera; Erob = Eschrichtius robustus; INIA = Inia; FOLI = Folivora; GALE = Galeopterus; LIPO = Lipotes; Lobl = Lagenorhynchus obliquidens; MANI = Manidae; MONO = Monodontidae; MYRM = Myrmecophagidae; MYST = Mysticeti; NPP = Not present in Platanista or Physeteroidea, but present in other Odontocetes; NPZ = Not present in Ziphiidae, but present in other Odontocetes; Oorc = Orcinus orca; PEUT = Tolypeutinae; PHOC = Phocoenidae; PHOL = Pholidota; PHOR = Chlamyphorinae; PILO = Pilosa; PHYS = Physeteroidea; PONT = Pontoporia; Schi = Sousa chinensis; SIRE = Sirenia; Tadu = Tursiops aduncus; TOLY = Tolypeutes; VERM = Vermilingua; XEN = Xenarthra.
- Supplementary_Table_S3.xlsx: Accession numbers and functionality of ASMT in species examined. See Table S2 caption for details.
- Supplementary_Table_S4.xlsx: Accession numbers and functionality of MTNR1A in species examined. See Table S2 caption for details.
- Supplementary_Table_S5.xlsx: Accession numbers and functionality of MTNR1B in species examined. See Table S2 caption for details.
- Supplementary_Table_S6.xlsx: Codon frequency model selection. These are the results from one ratio dN/dS analyses using different codon frequency models.
- Supplementary_Table_S7.xlsx: Results of AANAT PAML dN/dS analyses. Model: BG = branch(es) grouped with background; fixed 1 = branch(es) fixed at 1. p-value: specific p-value only shown if lower than 0.05. Model Comparison: if model comparison yields statistically significant differences (p < 0.05), model comparison bolded and given green background. For most models, w only shown for branch(es) of interest.
- Supplementary_Table_S8.xlsx: Results of ASMT PAML dN/dS analyses. Refer to Table S7 caption for additional details.
- Supplementary_Table_S9.xlsx: Results of MTNR1A PAML dN/dS analyses. Refer to Table S7 caption for additional details.
- Supplementary_Table_S10.xlsx: Results of MTNR1B PAML dN/dS analyses. Refer to Table S7 caption for additional details.
- Supplementary_Table_S11.xlsx: Results of BLASTing and mapping short reads from Alligator mississippiensis RNA sequencing experiments.
- Supplementary_Dataset_S1_all_ali_fasta.txt: Genomic alignments in fasta format used to determine the pseudogene/functional status of the different genes in different taxonomic groups.
- Supplementary_Dataset_S2_AANAT_RAxML_ali.phy: Alignment of AANAT in phylip format used in maximum likelihood phylogenetic reconstruction with RAxML.
- Supplementary_Dataset_S3_ASMT_RAxML_ali.phy: Alignment of ASMT in phylip format used in maximum likelihood phylogenetic reconstruction with RAxML.
- Supplementary_Dataset_S4_MTNR1A_MTNR1B_RAxML_ali.phy: Alignment of MTNR1A and MTNR1B in phylip format used in maximum likelihood phylogenetic reconstruction with RAxML.
- Supplementary_Dataset_S5_AANAT_PAML_alig.fasta: Codon alignment of AANAT in fasta format used in selection pressure analyses with PAML.
- Supplementary_Dataset_S6_ASMT_PAML_ali.fasta: Codon alignment of ASMT in fasta format used in selection pressure analyses with PAML.
- Supplementary_Dataset_S7_MTNR1A_PAML_ali.fasta: Codon alignment of MTNR1A in fasta format used in selection pressure analyses with PAML.
- Supplementary_Dataset_S8_MTNR1B_PAML_ali.fasta: Codon alignment of MTNR1B in fasta format used in selection pressure analyses with PAML.
- Supplementary_Dataset_S9_PAML_topology.tre: Tree topology in newick format used in selection pressure analyses with PAML.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data File Descriptions and Methods
ReferencesGu, C., 2020. FindFur: A Tool for Predicting Furin Cleavage Sites of Viral Envelope Substrates. Master’s Thesis, San Jose State University, CA, USA. doi: 10.31979/etd.4ahv-9jya Nakai, K., Horton, P., 1999. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci 24, 34–36. doi: 10.1016/s0968-0004(98)01336-x Steentoft, C., Vakhrushev, S.Y., Joshi, H.J., Kong, Y., Vester-Christensen, M.B., Schjoldager, K.T.-B.G., Lavrsen, K., Dabelsteen, S., Pedersen, N.B., Marcos-Silva, L., Gupta, R., Bennett, E.P., Mandel, U., Brunak, S., Wandall, H.H., Levery, S.B., Clausen, H., 2013. Precision mapping of the human O-GalNAc glycoproteome through SimpleCell technology. EMBO J 32, 1478–1488. doi: 10.1038/emboj.2013.79 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The recent incorporation of bacterial whole-genome sequencing (WGS) into Public Health laboratories has enhanced foodborne outbreak detection and source attribution. As a result, large volumes of publicly available datasets can be used to study the biology of foodborne pathogen populations at an unprecedented scale. To demonstrate the application of a heuristic and agnostic hierarchical population structure guided pan-genome enrichment analysis (PANGEA), we used populations of S. enterica lineage I to achieve two main objectives: (i) show how hierarchical population inquiry at different scales of resolution can enhance ecological and epidemiological inquiries; and (ii) identify population-specific inferable traits that could provide selective advantages in food production environments. Publicly available WGS data were obtained from NCBI database for three serovars of Salmonella enterica subsp. enterica lineage I (S. Typhimurium, S. Newport, and S. Infantis). Using the hierarchical genotypic classifications (Serovar, BAPS1, ST, cgMLST), datasets from each of the three serovars showed varying degrees of clonal structuring. When the accessory genome (PANGEA) was mapped onto these hierarchical structures, accessory loci could be linked with specific genotypes. A large heavy-metal resistance mobile element was found in the Monophasic ST34 lineage of S. Typhimurium, and laboratory testing showed that Monophasic isolates have on average a higher degree of copper resistance than the Biphasic ones. In S. Newport, an extra sugE gene copy was found among most isolates of the ST45 lineage, and laboratory testing of multiple isolates confirmed that isolates of S. Newport ST45 were on average less sensitive to the disinfectant cetylpyridimium chloride than non-ST45 isolates. Lastly, data-mining of the accessory genomic content of S. Infantis revealed two cryptic Ecotypes with distinct accessory genomic content and distinct ecological patterns. Poultry appears to be the major reservoir for Ecotype 1, and temporal analysis further suggested a recent ecological succession, with Ecotype 2 apparently being displaced by Ecotype 1. Altogether, the use of a heuristic hierarchical-based population structure analysis that includes bacterial pan-genomes (core and accessory genomes) can (1) improve genomic resolution for mapping populations and accessing epidemiological patterns; and (2) define lineage-specific informative loci that may be associated with survival in the food chain.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
16S rRNA genes sequencing has been used for routine species identification and phylogenetic studies of bacteria. However, the high sequence similarity between some species and heterogeneity within copies at the intragenomic level could be a limiting factor of discriminatory ability. In this study, we aimed to compare 16S rRNA genes sequences and genome-based analysis (core SNPs and ANI) for identification of non-pathogenic Yersinia. We used complete and draft genomes of 373 Yersinia strains from the NCBI Genome database. The taxonomic affiliations of 34 genomes based on core SNPs and the ANI results did not match those specified in the GenBank database (NCBI). The intragenic homology of the 16S rRNA gene copies exceeded 99.5% in complete genomes, but above 50% of genomes have four or more variants of the 16S rRNA gene. Among 327 draft genomes of non-pathogenic Yersinia, 11% did not have a full-length 16S rRNA gene. Most of draft genomes has one copy of gene and it is not possible to define the intragenomic heterogenicity. The average homology of 16S rRNA gene was 98.76%, and the maximum variability was 2.85%. The low degree of genetic heterogenicity of the gene (0.36%) was determined in group Y. pekkanenii/Y. proxima/Y. aldovae/Y. intermedia/Y. kristensenii/Y. rochesterensis. The identical gene sequences were found in the genomes of the Y. intermedia and Y. rochesterensis strains identified using ANI and core SNPs analyses. The phylogenetic tree based on 16S rRNA genes differed from the tree based on core SNPs of the genomes and did not represent phylogenetic relationship between the Yersinia species. These findings will help to fill the data gaps in genome characteristics of deficiently studied non-pathogenic Yersinia.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gastric cancer (GC) is a common malignant tumor of the digestive system. Recent studies revealed that high gamma-glutamyl-transferase 5 (GGT5) expression was associated with a poor prognosis of gastric cancer patients. In the present study, we aimed to confirm the expression and prognostic value of GGT5 and its correlation with immune cell infiltration in gastric cancer. First, we compared the differential expression of GGT5 between gastric cancer tissues and normal gastric mucosa in the cancer genome atlas (TCGA) and GEO NCBI databases using the most widely available data. Then, the Kaplan-Meier method, Cox regression, and univariate logistic regression were applied to explore the relationships between GGT5 and clinical characteristics. We also investigated the correlation of GGT5 with immune cell infiltration, immune-related genes, and immune checkpoint genes. Finally, we estimated enrichment of gene ontologies categories and relevant signaling pathways using GO annotations, KEGG, and GSEA pathway data. The results showed that GGT5 was upregulated in gastric cancer tissues compared to normal tissues. High GGT5 expression was significantly associated with T stage, histological type, and histologic grade (p < 0.05). Moreover, gastric cancer patients with high GGT5 expression showed worse 10-years overall survival (p = 0.008) and progression-free intervals (p = 0.006) than those with low GGT5 expression. Multivariate analysis suggested that high expression of GGT5 was an independent risk factor related to the worse overall survival of gastric cancer patients. A nomogram model for predicting the overall survival of GC was constructed and computationally validated. GGT5 expression was positively correlated with the infiltration of natural killer cells, macrophages, and dendritic cells but negatively correlated with Th17 infiltration. Additionally, we found that GGT5 was positively co-expressed with immune-related genes and immune checkpoint genes. Functional analysis revealed that differentially expressed genes relative to GGT5 were mainly involved in the biological processes of immune and inflammatory responses. In conclusion, GGT5 may serve as a promising prognostic biomarker and a potential immunological therapeutic target for GC, since it is associated with immune cell infiltration in the tumor microenvironment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data for python package oggmap/orthomap.
OrthoFinder: Ensembl release-105 (-S diamond_ultra_sens)
Includes OrthoFinder results (-S diamond_ultra_sens) for all translated coding sequences (CDS) from Ensembl release-105 (keeping only longest isoforms) and Xtropicalisv9.0.Named.primaryTrs.pep.fa from www.xenbase.org:
Includes a table specifying the OrthoFinder species file names and its corresponding NCBI taxonomic IDs:
Includes NCBI taxonomic tree for Ensembl release-105 species analysed:
OrthoFinder: Ensembl release-110 (-S last)
Includes OrthoFinder results (-S last) for all translated coding sequences (CDS) from Ensembl release-110 (keeping only longest isoforms) and Xtropicalisv9.0.Named.primaryTrs.pep.fa from www.xenbase.org:
Includes a table specifying the OrthoFinder species file names and its corresponding NCBI taxonomic IDs:
Includes NCBI taxonomic tree for Ensembl release-110 species analysed:
OrthoFinder: Ensembl release-111 (-S last)
Includes OrthoFinder results (-S last) for all translated coding sequences (CDS) from Ensembl release-111 (keeping only longest isoforms) and Xtropicalisv9.0.Named.primaryTrs.pep.fa from www.xenbase.org:
Includes a table specifying the OrthoFinder species file names and its corresponding NCBI taxonomic IDs:
OrthoFinder: WormBase release-WS288 + WormBase ParaSite release-WBPS18 (-S last)
Includes OrthoFinder results (-S last) for all translated coding sequences (CDS) from WormBase release-WS288, WormBase ParaSite release-WBPS18 (keeping only longest isoforms) and dd_Smed_v6.pcf.contigs.fasta (transdecoder and miniprothint peptides) from https://planmine.mpibpc.mpg.de:
Includes a table specifying the OrthoFinder species file names and its corresponding NCBI taxonomic IDs:
Includes NCBI taxonomic tree for WormBase release-WS288 and WormBase ParaSite release-WBPS18 species analysed:
Pre-calculated orthomaps:
Includes pre-calculated gene age assignments for C. elegans (Sun et al. 2021), H. vulgaris (Cazet et al. 2022) and D. rerio (Ensembl-105; Ensembl-110):
Pre-calculated evolutionary indices:
Includes pre-calculated TajimaD, NormalizedPi, FayWu, Fst for C. elegans (Ma et al. 2021):
eggNOG database version 6.0 orthomaps:
Includes extracted orthomaps for all Eukaryota from eggNOG database version 6.0 (Hernández-Plaza et al. 2022):
myTAI example data:
Includes example data from the myTAI R package (Drost et al. 2018)
PLAZA database version 5.0 orthomaps:
Includes extracted orthomaps for either HOMFAM or ORTHOFAM groups of plants from PLAZA database version 5.0 (Van Bel et al. 2022):
Mouse synonyms:
Table of Mus musculus gene synonyms obtained from here https://github.com/mustafapir/geneName/blob/master/data/mouse_synonyms1.rda and converted into a table.
We report the results of chromatin immunoprecipitation following by high-thoughput tag sequencing (ChIP-Seq) using the GA II platform from Illumina for the human transcription factor STAT1 in HeLa S3 cells. The STAT1 ChIP was performed using HeLa S3 cells that are stimulated using gamma-interferon. We have also generated a seqenced input DNA dataset for gamma-interferon stimulated HeLa S3 cells. Raw data for this study is available for download from the Short Read Archive database at: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000703. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Examination of the STAT1 transcription factor in Human HeLa S3.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gastric cancer (GC) is a common malignant tumor of the digestive system. Recent studies revealed that high gamma-glutamyl-transferase 5 (GGT5) expression was associated with a poor prognosis of gastric cancer patients. In the present study, we aimed to confirm the expression and prognostic value of GGT5 and its correlation with immune cell infiltration in gastric cancer. First, we compared the differential expression of GGT5 between gastric cancer tissues and normal gastric mucosa in the cancer genome atlas (TCGA) and GEO NCBI databases using the most widely available data. Then, the Kaplan-Meier method, Cox regression, and univariate logistic regression were applied to explore the relationships between GGT5 and clinical characteristics. We also investigated the correlation of GGT5 with immune cell infiltration, immune-related genes, and immune checkpoint genes. Finally, we estimated enrichment of gene ontologies categories and relevant signaling pathways using GO annotations, KEGG, and GSEA pathway data. The results showed that GGT5 was upregulated in gastric cancer tissues compared to normal tissues. High GGT5 expression was significantly associated with T stage, histological type, and histologic grade (p < 0.05). Moreover, gastric cancer patients with high GGT5 expression showed worse 10-years overall survival (p = 0.008) and progression-free intervals (p = 0.006) than those with low GGT5 expression. Multivariate analysis suggested that high expression of GGT5 was an independent risk factor related to the worse overall survival of gastric cancer patients. A nomogram model for predicting the overall survival of GC was constructed and computationally validated. GGT5 expression was positively correlated with the infiltration of natural killer cells, macrophages, and dendritic cells but negatively correlated with Th17 infiltration. Additionally, we found that GGT5 was positively co-expressed with immune-related genes and immune checkpoint genes. Functional analysis revealed that differentially expressed genes relative to GGT5 were mainly involved in the biological processes of immune and inflammatory responses. In conclusion, GGT5 may serve as a promising prognostic biomarker and a potential immunological therapeutic target for GC, since it is associated with immune cell infiltration in the tumor microenvironment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Yokenella regensburgei, a member of the family Enterobacteriaceae, is usually isolated from environmental samples and generally resistant to early generations of cephalosporins. To characterize the resistance mechanism of Y. regensburgei strain W13 isolated from the sewage of an animal farm, whole genome sequencing, comparative genomics analysis and molecular cloning were performed. The results showed that a novel chromosomally encoded class C β-lactamase gene with the ability to confer resistance to β-lactam antibiotics, designated blaYOC–1, was identified in the genome of Y. regensburgei W13. Kinetic analysis revealed that the β-lactamase YOC-1 has a broad spectrum of substrates, including penicillins, cefazolin, cefoxitin and cefotaxime. The two functionally characterized β-lactamases with the highest amino acid identities to YOC-1 were CDA-1 (71.69%) and CMY-2 (70.65%). The genetic context of the blaYOC–1-ampR-encoding region was unique compared with the sequences in the NCBI nucleotide database. The plasmid pRYW13-125 of Y. regensburgei W13 harbored 11 resistance genes (blaOXA–10, blaLAP–2, dfrA14, tetA, tetR, cmlA5, floR, sul2, ant(3″)-IIa, arr-2 and qnrS1) within an ∼34 kb multidrug resistance region; these genes were all related to mobile genetic elements. The multidrug resistance region of pYRW13-125 shared the highest identities with those of two plasmids from clinical Klebsiella pneumoniae isolates, indicating the possibility of horizontal transfer of these resistance genes between bacteria of various origins.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Streptococcus dysgalactiae subsp. dysgalactiae (SDSD) has been considered a strict animal pathogen. Nevertheless, the recent reports of human infections suggest a niche expansion for this subspecies, which may be a consequence of the virulence gene acquisition that increases its pathogenicity. Previous studies reported the presence of virulence genes of Streptococcus pyogenes phages among bovine SDSD (collected in 2002–2003); however, the identity of these mobile genetic elements remains to be clarified. Thus, this study aimed to characterize the SDSD isolates collected in 2011–2013 and compare them with SDSD isolates collected in 2002–2003 and pyogenic streptococcus genomes available at the National Center for Biotechnology Information (NCBI) database, including human SDSD and S. dysgalactiae subsp. equisimilis (SDSE) strains to track temporal shifts on bovine SDSD genotypes. The very close genetic relationships between humans SDSD and SDSE were evident from the analysis of housekeeping genes, while bovine SDSD isolates seem more divergent. The results showed that all bovine SDSD harbor Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/Cas IIA system. The widespread presence of this system among bovine SDSD isolates, high conservation of repeat sequences, and the polymorphism observed in spacer can be considered indicators of the system activity. Overall, comparative analysis shows that bovine SDSD isolates carry speK, speC, speL, speM, spd1, and sdn virulence genes of S. pyogenes prophages. Our data suggest that these genes are maintained over time and seem to be exclusively a property of bovine SDSD strains. Although the bovine SDSD genomes characterized in the present study were not sequenced, the data set, including the high homology of superantigens (SAgs) genes between bovine SDSD and S. pyogenes strains, may indicate that events of horizontal genetic transfer occurred before habitat separation. All bovine SDSD isolates were negative for genes of operon encoding streptolysin S, except for sagA gene, while the presence of this operon was detected in all SDSE and human SDSD strains. The data set of this study suggests that the separation between the subspecies “dysgalactiae” and “equisimilis” should be reconsidered. However, a study including the most comprehensive collection of strains from different environments would be required for definitive conclusions regarding the two taxa.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionBurkholderia cepacia complex (Bcc) clonal complex (CC) 31, the predominant lineage causing devastating outbreaks globally, has been a growing concern of infections in non-cystic fibrosis (NCF) patients in India. B. cenocepacia is very challenging to treat owing to its virulence determinants and antibiotic resistance. Improving the management of these infections requires a better knowledge of their resistance patterns and mechanisms.MethodsWhole-genome sequences of 35 CC31 isolates obtained from patient samples, were analyzed against available 210 CC31 genomes in the NCBI database to glean details of resistance, virulence, mobile elements, and phylogenetic markers to study genomic diversity and evolution of CC31 lineage in India.ResultsGenomic analysis revealed that 35 isolates belonging to CC31 were categorized into 11 sequence types (ST), of which five STs were reported exclusively from India. Phylogenetic analysis classified 245 CC31 isolates into eight distinct clades (I-VIII) and unveiled that NCF isolates are evolving independently from the global cystic fibrosis (CF) isolates forming a distinct clade. The detection rate of seven classes of antibiotic-related genes in 35 isolates was 35 (100%) for tetracyclines, aminoglycosides, and fluoroquinolones; 26 (74.2%) for sulphonamides and phenicols; 7 (20%) for beta-lactamases; and 1 (2.8%) for trimethoprim resistance genes. Additionally, 3 (8.5%) NCF isolates were resistant to disinfecting agents and antiseptics. Antimicrobial susceptibility testing revealed that majority of NCF isolates were resistant to chloramphenicol (77%) and levofloxacin (34%). NCF isolates have a comparable number of virulence genes to CF isolates. A well-studied pathogenicity island of B. cenocepacia, GI11 is present in ST628 and ST709 isolates from the Indian Bcc population. In contrast, genomic island GI15 (highly similar to the island found in B. pseudomallei strain EY1) is exclusively reported in ST839 and ST824 isolates from two different locations in India. Horizontal acquisition of lytic phage ST79 of pathogenic B. pseudomallei is demonstrated in ST628 isolates Bcc1463, Bcc29163, and BccR4654 amongst CC31 lineage.DiscussionThe study reveals a high diversity of CC31 lineages among B. cenocepacia isolates from India. The extensive information from this study will facilitate the development of rapid diagnostic and novel therapeutic approaches to manage B. cenocepacia infections.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Oliveria decumbens Vent. is a wild, rare, annual medicinal plant and endemic plant of Iran that has metabolites (mostly terpenes) which make it a precious plant in Persian Traditional Medicine and also a potential chemotherapeutic agent. The lack of genetic resources has slowed the discovery of genes involved in the terpenes biosynthesis pathway. It is a wild relative of Daucus carota. In this research, we performed the transcriptomic differences between two samples, flower and root of Oliveria decumbens, and also analyze the expression value of the genes involved in terpenoid biosynthesis by RNA-seq and its essential oil’s phytochemicals analyzed by GC/MS. In total, 136,031,188 reads from two samples of flower and root have been produced. The result shows that the MEP pathway is mostly active in the flower and the MVA in the root. Three genes of GPP, FPPS, and GGPP that are the precursors in the synthesis of mono, di, and triterpenes are upregulated in root and 23 key genes were identified that are involved in the biosynthesis of terpenes. Three genes had the highest upregulation in the root including, and on the other hand, another three genes had the expression only in the flower. Meanwhile, 191 and 185 upregulated genes in the flower and root of the plant, respectively, were selected for the gene ontology analysis and reconstruction of co-expression networks. The current research is the first of its kind on Oliveria decumbens transcriptome and discussed 67 genes that have been deposited into the NCBI database. Collectively, the information obtained in this study unveils the new insights into characterizing the genetic blueprint of Oliveria decumbens Vent. which paved the way for medical/plant biotechnology and the pharmaceutical industry in the future.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The human pathogen Acinetobacter baumannii has emerged as a frequent cause of hospital-acquired infections, but infection of animals has rarely been observed. Here we analyzed an outbreak of epidemic pneumonia killing hundreds of sheep on a farm in Pakistan and identified A. baumannii as the infecting agent. A pure culture of strain AbPK1 isolated from lungs of sick animals was inoculated into healthy sheep, which subsequently developed similar disease symptoms. Bacteria re-isolated from the infected animals were shown to be identical to the inoculum, fulfilling Koch’s postulates. Comparison of the AbPK1 genome against 2283 A. baumannii genomes from the NCBI database revealed that AbPK1 carries genes for unusual surface structures, including a unique composition of iron acquisition genes, genes for O-antigen synthesis and sialic acid-specific acetylases of cell-surface carbohydrates that could enable immune evasion. Several of these unusual and otherwise rarely present genes were also identified in genomes of phylogenetically unrelated A. baumannii isolates from combat-wounded US military from Afghanistan indicating a common gene pool in this geographical region. Based on core genome MLST this virulent isolate represents a newly emerging lineage of Global Clone 2, suggesting a human source for this disease outbreak. The observed epidemic, direct transmission from sheep to sheep, which is highly unusual for A. baumannii, has important consequences for human and animal health. First, direct animal-to-animal transmission facilitates fast spread of pathogen and disease in the flock. Second, it may establish a stable ecological niche and subsequent spread in a new host. And third, it constitutes a serious risk of transmission of this hyper-virulent clone from sheep back to humans, which may result in emergence of contagious disease amongst humans.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.