17 datasets found

Z
Data from: MarFERReT: an open-source, version-controlled reference library...
data.niaid.nih.gov
zenodo.org
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blaskowski, Stephen (2025). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7055911
Explore at:
Dataset updated
Jan 22, 2025
Dataset provided by
Blaskowski, Stephen
Coesel, Sacha
Armbrust, E. Virginia
Groussman, Mora J
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:

MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier.

accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.

marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.

tax_id: The NCBI Taxonomy ID (taxID).

pr2_accession: Best-matching PR2 accession ID associated with entry

pr2_rank: The lowest shared rank between the entry and the pr2_accession

pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession

data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).

data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).

source_link: URL where the original sequence data and/or metadata was collected.

pub_year: Year of data release or publication of linked reference.

ref_link: Pubmed URL directs to the published reference for entry, if available.

ref_doi: DOI of entry data from source, if available.

source_filename: Name of the original sequence file name from the data source.

seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.

n_seqs_raw: Number of sequences in the original sequence file.

source_name: Full organism name from entry source

original_taxID: Original NCBI taxID from entry data source metadata, if available

alias: Additional identifiers for the entry, if available

MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier

marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.

tax_id: Verified NCBI taxID used in MarFERReT

taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)

taxID_notes: Notes on the original_taxID

n_seqs_raw: Number of sequences in the original sequence file

n_pfams: Number of Pfam domains identified in protein sequences

qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.

flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.

VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).

flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.

rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.

rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.

flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.

flag_sum: Count of the number of flag columns (qc_flag, flag_Lasek, flag_VanVlierberghe, and flag_rp63). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).

accepted: Acceptance into the final MarFERReT build (Y or N).

MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

The columns in this file contain the following information:

accession: (NA)

accession.version: The unique MarFERReT sequence identifier ('mftX').

taxid: The NCBI Taxonomy ID associated with this reference sequence.

gi: (NA).

MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

aa_id: the unique identifier for each MarFERReT protein sequence.

entry_id: The unique numeric identifier for each MarFERReT entry.

source_defline: The original, unformatted sequence identifier

MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:

aa_id: The unique MarFERReT protein sequence ID ('mftX').

pfam_name: The shorthand Pfam protein family name.

pfam_id: The Pfam identifier.

pfam_eval: hmm profile match e-value score

pfam_score: hmm profile match bitscore

MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.
Data from: TempO-seq and RNA-seq gene expression levels are highly...
catalog.data.gov
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). TempO-seq and RNA-seq gene expression levels are highly correlated for most genes: A comparison using 39 human cell lines [Dataset]. https://catalog.data.gov/dataset/tempo-seq-and-rna-seq-gene-expression-levels-are-highly-correlated-for-most-genes-a-compar
Explore at:
Dataset updated
Jun 8, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Journal article published in PLOS One, Vol 20, Issue 5, e0320862, 2025; DOI: https://doi.org/10.1371/journal.pone.0320862; PMC12064016. The datasets generated and analyzed during the current study are provided in Supplemental S1 File. The RNA-seq data is Protein Atlas Version 23 from the Human Protein Atlas website (https://www.proteinatlas.org/about/download, “RNA HPA cell line gene data” released 2023.06.19). All FASTQ files and aligned counts for the U.S. EPA TempO-seq data have been deposited into NCBI Gene Expression Omnibus under the accession number GSE288929 and are publicly available at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE288929. The R code is available through FigShare at: https://doi.org/10.23645/epacomptox.27341970.v1. This dataset is associated with the following publication: Word, L., C. Willis, R. Judson, L. Everett, S. Davidson-Fritz, D. Haggard, B. Chambers, J. Rogers, J. Bundy, I. Shah, N. Sipes, and J. Harrill. TempO-seq and RNA-seq Gene Expression Levels are Highly Correlated for Most Genes: A Comparison Using 39 Human Cell Lines. PLOS ONE. Public Library of Science, San Francisco, CA, USA, 20(5): e0320862, (2025).
d
Perlegen/NIEHS National Toxicology: Mouse Genome Resequencing Project
dknet.org
scicrunch.org
+1more
Updated Oct 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Perlegen/NIEHS National Toxicology: Mouse Genome Resequencing Project [Dataset]. http://identifiers.org/RRID:SCR_000726
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_000726
Dataset updated
Oct 16, 2019
Description
THIS RESOURCE IS NO LONGER IN SERVICE, Documented on August 12, 2014. Data, grouped by chromosome, available as flat files for download, of identified DNA polymorphisms (SNPs) in 15 commonly used strains of inbred laboratory mice. Perlegen's SNP, genotype (empirical and imputed), haplotype, trace, and PCR primer data has been compiled with NCBI Mouse Build information to produce data files for public use. Using high-density oligonuclueotide array technology, the study identified over 8 million SNPs and other genetic differences between these strains and the previously sequenced C57BL/6J reference strains (Phase 1). By leveraging data provided by Mark Daly's research team at the Broad Institute, genotypes were also predicted for 40 other common strains (Phase 2). Under an extension to the contract, Eleazar Eskin's group at UCLA has used this data to evaluate SNP associations with phenotypes from the Mouse Phenome Project (the Mouse Phenome Database), and to construct haplotype maps for a total of 94 inbred strains (the Mouse HapMap Project). SNP and genotype positions have been mapped from their original reference coordinates to NCBI Mouse Build 37 coordinates. Note that C57BL6/J strain was not selected for re-sequencing as this data would have been almost entirely redundant with the NCBI reference sequence. Since we did not actually determine genotypes for C57BL6/J, we did not submit genotypes for this strain to dbSNP. However, implicit genotypes for C57BL6/J can be obtained from the reference sequence at each SNP position (the reference allele is the first allele in the ALLELES column). The data is available for download in two different compressed file formats. The files are saved as both PC .zip files and Unix compressed .gz files. At this website, you can: * Learn more about the goals of the Perlegen mouse resequencing project. * Learn more about the array-based resequencing technology used in the project. * Download the SNPs, genotypes, and other data generated by the project, plus sequences of the long-range PCR primers used for SNP discovery. * Browse the mouse genome for SNPs. * View the haplotype blocks within the mouse genome. Mouse Genome Browser The Mouse Genome Browser can be used to visualize genes and the SNPs discovered in this study of genome-wide DNA variation in 15 commonly used, genetically diverse strains of inbred laboratory mice. The reference genome is the C57BL/6J strain NCBI build 37 mouse sequence. In addition to the experimentally-derived genotypes for the original 15 strains, the imputed genotypes for 40 additional inbred mouse strains can also be accessed. Mouse Haplotype Analysis The sequences of 16 commonly used, genetically diverse strains of inbred laboratory mice were analyzed to determine their haplotype structure. The Ancestry Browser shows which ancestral sequence each inbred strain most resembles, along with statistics on the pairwise similarity between the ancestral strains. The Haplotype Viewer shows the haplotype block boundaries and the pairwise similarity for all 56 strains: the 15 used for SNP discovery, the reference strain (C57BL/6J), and the 40 additional strains for which the genotypes were imputed.
s
The Chironomus tentans draft genome annotation
figshare.scilifelab.se
researchdata.se
+1more
application/gzip
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexey Kutsenko; Thomas Svensson; Björn Nystedt; Joakim Lundeberg; Petra Björk; Erik Sonnhammer; Stefania Giacomello; Neus Visa; Lars Wieslander (2025). The Chironomus tentans draft genome annotation [Dataset]. http://doi.org/10.17044/scilifelab.23532288.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.23532288.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Stockholm university
Authors
Alexey Kutsenko; Thomas Svensson; Björn Nystedt; Joakim Lundeberg; Petra Björk; Erik Sonnhammer; Stefania Giacomello; Neus Visa; Lars Wieslander
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use this data, please cite: Kutsenko, A., Svensson, T., Nystedt, B. et al. The Chironomus tentans genome sequence and the organization of the Balbiani ring genes. BMC Genomics 15, 819 (2014). https://doi.org/10.1186/1471-2164-15-819

The dipteran Chironomus tentans (C. tentans) and its Balbiani ring (BR) genes serve as a model system for eukaryotic gene expression studies. Kutsenko, A. et al. (2014), reports the first draft genome of C. tentans, characterizing its gene expression machinery and the genomic architecture of its BR genes.

In brief, genomic DNA was extracted and sequenced, resulting in an assembly size of 213 Mb, which was likely an overestimate due to allelic variants. The estimated genome size is around 200 Mb, with low GC content (31%) and repeat fraction (15%) compared to other dipterans. Phylogenetic analysis places it as a sister clade to mosquitoes, diverging 150-250 million years ago. The assembled genome was relatively fragmented (scaffold NG50=65 Kbp), but was still found to be reasonably complete regarding gene content, with 97% of 248 highly conserved core eukaryotic genes being represented.

For transcriptome sequencing and genome annotation, poly (A)+ RNA was extracted from various tissues and developmental stages. This data was used as evidence for ab initio predictions of gene models and alternative splice variants, resulting in a draft annotation of 15,120 predicted genes. The C. tentans draft genome assembly can be downloaded here or from NCBI: GenBank accession number: CBTT000000000.1 https://www.ncbi.nlm.nih.gov/assembly/GCA_000786525.1/

The draft genome annotation and the corresponding longest predicted proteins for each gene locus is provided here for download. Note that these preliminary annotations are provided as is, and incomplete, missing, or incorrect gene models are to be expected to some extent.

Acknowledgements We acknowledge the Science for Life Laboratory and the National Genomics Infrastructure (NGI) for sequencing service. Computations were mainly performed on resources provided by SNIC through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX). Microscopy was performed at IFSU, Stockholm University. Ann-Charlotte Sonnhammer at BILS is acknowledged for assistance concerning the initial bioinformatics analysis. We thank Magnus Bjursell for initial support in the project. This work was financed by grants from The Knut and Alice Wallenberg Foundation through The Center for Metagenomic Sequence analysis (CMS), The Granholm’s Foundation, The Carl Trygger’s Foundation and The Swedish Research Council (VR).
Code for Predicting MIEs from Gene Expression and Chemical Target Labels...
s.cnmilf.com
datasets.ai
+1more
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear
Explore at:
Dataset updated
Apr 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).
r
CaspBase
rrid.site
neuinfo.org
+2more
Updated Jul 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). CaspBase [Dataset]. http://identifiers.org/RRID:SCR_018975
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_018975
Dataset updated
Jul 19, 2025
Description
Database for evolutionary biochemical studies of caspase functional divergence and ancestral sequence inference. Tool to rapidly disseminate organized caspase sequence data. Includes all animal species with currently available annotated genomes in NCBI genome database. Manually curated and not curated sequences are available to download.
MARMICRODB database for taxonomic classification of (marine) metagenomes
zenodo.org
explore.openaire.eu
application/gzip, bin +3
Updated Mar 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shane L Hogle; Shane L Hogle (2020). MARMICRODB database for taxonomic classification of (marine) metagenomes [Dataset]. http://doi.org/10.5281/zenodo.3520509
Explore at:
bin, application/gzip, tsv, html, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3520509
Dataset updated
Mar 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shane L Hogle; Shane L Hogle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction:
This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.

Motivation:
We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.

Results/Description:
MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.

Methods:
The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.

The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.

Software/databases used:
checkM v1.0.11[16]
HMMERv3.1b2 (http://hmmer.org/)
prodigal v2.6.3 [22]
trimAl v1.4.rev22 [24]
AliView v1.18.1 [33] [34]
Phyx v0.1 [35]
RAxML v8.2.12 [36]
Pplacer v1.1alpha [28]
GTDB-Tk v0.1.3 [19]
Kaiju v1.6.0 [34]
GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
NCBI Taxonomy (accessed 2018-07-02) [23]
TIGRFAM v14.0 [37]
PFAM v31.0 [38]

Discussion/Caveats:
MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high
n
Data from: PrimerMiner: an R package for development and in silico...
data.niaid.nih.gov
datadryad.org
zip
Updated Oct 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vasco Elbrecht; Florian Leese (2017). PrimerMiner: an R package for development and in silico validation of DNA metabarcoding primers [Dataset]. http://doi.org/10.5061/dryad.4c3g9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4c3g9
Dataset updated
Oct 12, 2017
Dataset provided by
University of Duisburg-Essen
Authors
Vasco Elbrecht; Florian Leese
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
DNA metabarcoding is a powerful tool to assess biodiversity by amplifying and sequencing a standardized gene marker region. Its success is often limited due to variable binding sites that introduce amplification biases. Thus the development of optimized primers for communities or taxa under study in a certain geographic region and/or ecosystems is of critical importance. However, no tool for obtaining and processing of reference sequence data in bulk that can serve as a backbone for primer design is currently available.

We developed the R package PrimerMiner, which batch downloads DNA barcode gene sequences from BOLD and NCBI databases for specified target taxonomic groups and then applies sequence clustering into operational taxonomic units (OTUs) to reduce biases introduced by the different number of available sequences per species. Additionally, PrimerMiner offers functionalities to evaluate primers in silico, which are in our opinion more realistic then the strategy employed in another available software for that purpose, ecoPCR.

We used PrimerMiner to download cytochrome c oxidase subunit I (COI) sequences for 15 important freshwater invertebrate groups, relevant for ecosystem assessment. By processing COI markers from both databases, we were able to increase the amount of reference data 249-fold on average, compared to using complete mitochondrial genomes alone. Furthermore, we visualized the generated OTU sequence alignments and describe how to evaluate primers in silico using PrimerMiner.

With PrimerMiner we provide a useful tool to obtain relevant sequence data for targeted primer development and evaluation. The OTU based reference alignments generated with PrimerMiner can be used for manual primer design, or processed with bioinformatic tools for primer development.
Hackathon - TF-TG literature triage additional data
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GREEKC; GREEKC (2020). Hackathon - TF-TG literature triage additional data [Dataset]. http://doi.org/10.5281/zenodo.2562967
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2562967
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
GREEKC; GREEKC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TF = Transcription Factor, TG= target Gene

Jamboree: manual triage/TF-TG relation annotation exercise

Hackathon: automatic triage/TF-TG relation annotation exercise

This link contains three files:

Exp_methods: A list of experimental methods used in the study of transcription factor and target gene interactions. These method names might be useful as semantic features or to index records characterizing experimentally TF-TG relations.

Name: GREEKC_Hackathon_Exp_methods.zip

Example: TRE_EM_prl_2 Electric Mobility Shift

Format: tsv-separated columns (method Id, method name/alias)

Hs_tf_hgnc_ncbigeneid: A list of human transcription factor HUGO gene symbols and their respective Entrez Gene NCBI gene identifier. This gene list might be useful as features for the literature triage systems.

Name: GREEKC_Hackathon_hs_tf_hgnc_ncbigeneid.zip

Example: ADNP 23394

Format: tsv-separated columns (HUGO gene symbol Id, Entrez Gene Id)

Sentences_methods: A list sentences that contain automatically labeled mentions of experimental method that can be used to characterize TF-TG interactions. This dataset might be useful as a more fine-grained additional training set as a considerable number of sentences to describe TF-TG interactions together with the experimental technique used to characterize them.

Name: GREEKC_Hackathon_sentences_methods.zip

Example: 23675312 A 7 We used chromatin immunoprecipitation (ChIP) experiments to show that Bck2 localizes to the promoters of M/G1-specific genes, in a manner dependent on functional ECB elements, as well as to the promoters of G1/S and G2/M genes.

Format: tsv-separated columns (PubMed ID (PMID), text type (A=abstract, T= Title), sentence number, sentence text string)
f
List of species and accessions for RNA-seq data used in this study.
figshare.com
txt
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander L. Cope; Premal Shah (2023). List of species and accessions for RNA-seq data used in this study. [Dataset]. http://doi.org/10.1371/journal.pgen.1010256.s021
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgen.1010256.s021
Dataset updated
Jun 14, 2023
Dataset provided by
PLOS Genetics
Authors
Alexander L. Cope; Premal Shah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All data were download from NCBI’s Sequence Read Archive, except for L. kluyveri (European Nucleotide Archive). (TSV)
d
Data from: Supporting data for: Late Quaternary dynamics of Arctic biota...
search.dataone.org
dataverse.no
+2more
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang, Yucheng; Pedersen, Mikkel Winther; Alsos, Inger Greve; De Sanctis, Bianca; Racimo, Fernando; Prohaska, Ana; Coissac, Eric; Owens, Hannah Lois; Merkel, Marie Kristine Føreid; Fernandez-Guerra, Antonio; Rouillard, Alexandra; Lammers, Youri; Alberti, Adriana; Denoeud, France; Money, Daniel; Ruter, Anthony H.; McColl, Hugh; Larsen, Nicolaj Krog; Cherezova, Anna A.; Edwards, Mary E.; Fedorov, Grigory B.; Haile, James; Orlando, Ludovic; Vinner, Lasse; Korneliussen, Thorfinn Sand; Beilman, David W.; Bjørk, Anders A.; Cao, Jialu; Dockter, Christoph; Esdale, Julie; Gusarova, Galina; Kjeldsen, Kristian K.; Mangerud, Jan; Rasic, Jeffrey T.; Skadhauge, Birgitte; Svendsen, John-Inge; Tikhonov, Alexei; Wincker, Patrick; Xing, Yingchun; Zhang, Yubin; Froese, Duane G.; Rahbek, Carsten; Bravo Nogues, David; Holden, Philip B.; Edwards, Neil R.; Durbin, Richard; Meltzer, David J.; Kjær, Kurt H.; Möller, Per; Willerslev, Eske (2024). Supporting data for: Late Quaternary dynamics of Arctic biota revealed by ancient environmental metagenomics [Dataset]. http://doi.org/10.18710/3CVQAG
Explore at:
Unique identifier
https://doi.org/10.18710/3CVQAG
Dataset updated
Jan 5, 2024
Dataset provided by
DataverseNO
Authors
Wang, Yucheng; Pedersen, Mikkel Winther; Alsos, Inger Greve; De Sanctis, Bianca; Racimo, Fernando; Prohaska, Ana; Coissac, Eric; Owens, Hannah Lois; Merkel, Marie Kristine Føreid; Fernandez-Guerra, Antonio; Rouillard, Alexandra; Lammers, Youri; Alberti, Adriana; Denoeud, France; Money, Daniel; Ruter, Anthony H.; McColl, Hugh; Larsen, Nicolaj Krog; Cherezova, Anna A.; Edwards, Mary E.; Fedorov, Grigory B.; Haile, James; Orlando, Ludovic; Vinner, Lasse; Korneliussen, Thorfinn Sand; Beilman, David W.; Bjørk, Anders A.; Cao, Jialu; Dockter, Christoph; Esdale, Julie; Gusarova, Galina; Kjeldsen, Kristian K.; Mangerud, Jan; Rasic, Jeffrey T.; Skadhauge, Birgitte; Svendsen, John-Inge; Tikhonov, Alexei; Wincker, Patrick; Xing, Yingchun; Zhang, Yubin; Froese, Duane G.; Rahbek, Carsten; Bravo Nogues, David; Holden, Philip B.; Edwards, Neil R.; Durbin, Richard; Meltzer, David J.; Kjær, Kurt H.; Möller, Per; Willerslev, Eske
Area covered
Arctic
Description
ALTERNATIVE DOWNLOAD FROM FILESENDER To access the data, click "Read Full Description" [+] and then click here [Dataset abstract] This dataset contains the assembled genome contigs (whole genome level) of the PhyloNorway plant database used in Wang et al. 2021 Late Quaternary Dynamics of Arctic Biota Revealed by Ancient Environmental Metagenomics. Methods for generating this database can be found in the paper. The 7 fasta files are the database. The PhyloNorway_com_acc2TaxaID.txt supplies a NCBI format acc2TaxaID file matching accession ID to NCBI TaxaID. Additional information about the database can be found in Alsos et al. 2020., [Article abstract Wang et al. submitted] During the last glacial-interglacial cycle, arctic biota experienced drastic climatic changes, yet the nature, extent and rate of their responses are not fully understood. Here we report the first large-scale environmental DNA metagenomic study of ancient plant and mammal communities using 535 permafrost and lake sediment samples from across the Arctic spanning the last 50,000 years. Additionally, we present 1,541 contemporary plant genome assemblies generated as reference sequences. Our study provides several novel insights into the long-term dynamics of the arctic biota at circumpolar and regional scales. Key findings include: (i) a relatively homogeneous steppe-tundra flora dominated the Arctic during the Last Glacial Maximum, followed by regional divergence of vegetation in the Holocene; (ii) certain grazing animals consistently co-occurred in space and time; (iii) humans appear to have been a minor factor in driving animal distributions; (iv) higher effective precipitation, and an increase in the proportion of wetland plants, show negative effects on animal diversity; (v) the persistence of the steppe-tundra vegetation in northern Siberia allowed the late survival of several now-extinct megafauna species, including woolly mammoth to 3.9±0.2 ka (kilo annum Before Present) and woolly rhinoceros to 9.8±0.2 ka; and (vi) phylogenetic analysis of mammoth eDNA reveals a previously unsampled mitochondrial lineage. Our findings highlight the power of ancient environmental metagenomics to advance understanding of population histories and long-term ecological dynamics., [Article abstract Alsos et al. 2020] Genome skimming has the potential for generating large data sets for DNA barcoding and wider biodiversity genomic studies, particularly via the assembly and annotation of full chloroplast (cpDNA) and nuclear ribosomal DNA (nrDNA) sequences. We compare the success of genome skims of 2051 herbarium specimens from Norway/Polar regions with 4604 freshly collected, silica gel dried specimens mainly from the European Alps and the Carpathians. Overall, we were able to assemble the full chloroplast genome for 67% of the samples and the full nrDNA cluster for 86%. Average insert length, cover and full cpDNA and rDNA assembly were considerably higher for silica gel dried than herbarium-preserved material. However, complete plastid genomes were still assembled for 54% of herbarium samples compared to 70% of silica dried samples. Moreover, there was comparable recovery of coding genes from both tissue sources (121 for silica gel dried and 118 for herbarium material) and only minor differences in assembly success of standard barcodes between silica dried (89% ITS2, 96% matK and rbcL) and herbarium material (87% ITS2, 98% matK and rbcL). The success rate was > 90% for all three markers in 1034 of 1036 genera in 160 families, and only Boraginaceae worked poorly, with 7 genera failing. Our study shows that large-scale genome skims are feasible and work well across most of the land plant families and genera we tested, independently of material type. It is therefore an efficient method for increasing the availability of plant biodiversity genomic data to support a multitude of downstream applications.
o
STAT1 transcription factor in Human HeLa S3
omicsdi.org
xml
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Rozowsky,Mark Gerstein,Michael P Wilson,Ghia Euskirchen,Michael Snyder, STAT1 transcription factor in Human HeLa S3 [Dataset]. https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-12782
Explore at:
xmlAvailable download formats
Authors
Joel Rozowsky,Mark Gerstein,Michael P Wilson,Ghia Euskirchen,Michael Snyder
Variables measured
Genomics
Description
We report the results of chromatin immunoprecipitation following by high-thoughput tag sequencing (ChIP-Seq) using the GA II platform from Illumina for the human transcription factor STAT1 in HeLa S3 cells. The STAT1 ChIP was performed using HeLa S3 cells that are stimulated using gamma-interferon. We have also generated a seqenced input DNA dataset for gamma-interferon stimulated HeLa S3 cells. Raw data for this study is available for download from the Short Read Archive database at: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000703. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Examination of the STAT1 transcription factor in Human HeLa S3.
f
Table S1 from Cornet et al. Consensus assessment of the contamination level...
figshare.com
bin
Updated Apr 15, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luc Cornet; Denis BAURAIN (2018). Table S1 from Cornet et al. Consensus assessment of the contamination level of publicly available cyanobacterial genomes. [Dataset]. http://doi.org/10.6084/m9.figshare.6143072.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6143072.v1
Dataset updated
Apr 15, 2018
Dataset provided by
figshare
Authors
Luc Cornet; Denis BAURAIN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Excel file is Table S1: List of the public cyanobacterial genome assemblies used in this study. The table gives the accession, the name (binomial incl. strain), the taxonomy (order, NCBI lineage), the ecology (morphology and habitat), and the assembly properties (as determined by QUAST) of the 440 assemblies surveyed. Assemblies are sorted by NCBI lineage. (*) indicates assemblies for which raw read data are in principle available for download from NCBI SRA; (˚) indicates assemblies that are devoid of SSU rRNA (16S) classified as Cyanobacteria; (+) indicates assemblies that are too large (>15,000 kbp); (-) indicates assemblies that are too small (
d
Bovine Genome Project
dknet.org
scicrunch.org
Updated Sep 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Bovine Genome Project [Dataset]. http://identifiers.org/RRID:SCR_008370
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008370
Dataset updated
Sep 1, 2024
Description
Downloadable files of the bos taurus genome. Draft assemblies available for download as contigs or linearized scaffolds of the genomic sequence of cow, Bos taurus, including the final draft assembly (7.1 coverage) and the two previous assemblies. The genome is sequenced to 6- to 8-fold sequence depth, with high-quality finished sequence in some areas. Accompanying EST and SNP analyses is also included. The bovine genome assembly and analysis and the study of cattle genetic history were published in April 24, 2009 issue of Science. The Human Genome Sequencing Center provides BLAST searches of the genome assemblies, either as contigs or as linearized chromosome sequences. The WGS sequence enriched BAC assemblies and the unassembled reads (sequencing reads that did not end up in the genome assembly) can also be searched by BLAST. Traces are available from the NCBI Trace Archive by using the link in the sidebar or by using NCBI MegaBLAST with a same species or cross species query.
d
Fertility decline in Aedes aegypti mosquitoes is associated with reduced...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Nov 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olayinka David; Matthew DeGennaro (2024). Fertility decline in Aedes aegypti mosquitoes is associated with reduced maternal transcript deposition and does not depend on female age [Dataset]. http://doi.org/10.5061/dryad.mkkwh717f
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.mkkwh717f
Dataset updated
Nov 15, 2024
Dataset provided by
Dryad Digital Repository
Authors
Olayinka David; Matthew DeGennaro
Description
Female mosquitoes undergo multiple rounds of reproduction known as gonotrophic cycles. A gonotrophic cycle spans the period from blood meal intake to egg laying. Nutrients from vertebrate host blood are necessary for completing egg development. During oogenesis, a female pre-packages mRNA into her oocytes, and these maternal transcripts drive the first two hours of embryonic development before zygotic genome activation. In this study, we profiled transcriptional changes in 1-2 hour-old Aedes aegyptiÂ embryos across two gonotrophic cycles. We found that homeotic genes which are regulators of embryogenesis are downregulated in embryos from the second gonotrophic cycle. Interestingly, embryos produced byÂ Ae. aegyptiÂ females progressively reduced their ability to hatch as the number of gonotrophic cycles increased. We show that this fertility decline is due to increased reproductive output and not the mosquitoesâ€™ age. Moreover, we found a similar decline in fertility and fecundity across thr..., Transcriptomic data processing and analysis RNA sequencing (RNA-seq) data for 1-2hrs old wild-type Ae. aegypti embryos were retrieved from the transcriptome reported in our previous study [1]. Each library read was trimmed to remove the adapter sequence and then mapped to gene models from the AaegL5.0 genome. Upon mapping, transcript level abundance was determined from mapped reads using tximport. Principal component and differential gene expression analyses were performed on read counts using Râ€™s shiny DEBrowser. Differential expression was calculated using Deseq2 at a corrected FDR of Î± < 0.01. Gene ontology (GO) enrichment was performed on differentially expressed genes at Î± < 0.01 using the bioinformatics resources on vectorbase.org, and redundant or obsolete GO terms were removed. Â Fecundity and fertility assays To determine fecundity, the number of eggs deposited per female, 5â€“7-day old, mated females were starved on water overnight and then provided with a blood meal to re..., , # Fertility decline in Aedes aegypti mosquitoes is associated with reduced maternal transcript deposition and does not depend on female age

https://doi.org/10.5061/dryad.mkkwh717f

The dataset includes transcriptome analysis of Aedes aegypti early embryos. The raw sequencing reads of this dataset are available for download at the NCBI Sequence Read Archive (SRA) and are associated with BioProject ID PRJNA957289.Â The dataset also contains information on Aedes aegypti and Aedes albopictus mosquito egg deposition rate and egg hatch rate over three reproductive cycles (gonotrophic cycles). The data is presented on an Excel spreadsheet with multiple tabs:

RNA-seq sample metadataÂ (data related to Figure 2)

RNA-seq read counts (data related to Figure 2)

Downregulated genesÂ (data related to Figure 2)

Upregulated genesÂ (data related to Figure 2)

Figure 1b (raw data presented in figure 1b, percent eggs hatched/female forÂ *Ae...
All published genomes used in this study, including links to the assemblies...
plos.figshare.com
xlsx
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Vanderpool; Bui Quang Minh; Robert Lanfear; Daniel Hughes; Shwetha Murali; R. Alan Harris; Muthuswamy Raveendran; Donna M. Muzny; Mark S. Hibbins; Robert J. Williamson; Richard A. Gibbs; Kim C. Worley; Jeffrey Rogers; Matthew W. Hahn (2023). All published genomes used in this study, including links to the assemblies and NCBI BioProjects. [Dataset]. http://doi.org/10.1371/journal.pbio.3000954.s007
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pbio.3000954.s007
Dataset updated
Jun 6, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dan Vanderpool; Bui Quang Minh; Robert Lanfear; Daniel Hughes; Shwetha Murali; R. Alan Harris; Muthuswamy Raveendran; Donna M. Muzny; Mark S. Hibbins; Robert J. Williamson; Richard A. Gibbs; Kim C. Worley; Jeffrey Rogers; Matthew W. Hahn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotation information is included for each genome at the time of download. (XLSX)
d
Data from: Cold-water coral microbiomes (Lophelia pertusa) from Gulf of...
datadiscoverystudio.org
zip
Updated May 21, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Cold-water coral microbiomes (Lophelia pertusa) from Gulf of Mexico and Atlantic Ocean: raw data. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/01a7d94b557d42c38eaad7d3830a73b8/html
Explore at:
zipAvailable download formats
Dataset updated
May 21, 2018
Area covered
Atlantic Ocean, Gulf of Mexico (Gulf of America)
Description
description: The files in this data release are the raw deoxyribonucleic acid (DNA) sequence files referenced in the submitted journal article by Christina A. Kellogg, Dawn B. Goldsmith and Michael A. Gray entitled "Biogeographic comparison of Lophelia-associated bacterial communities in the western Atlantic reveals conserved core microbiome". They represent a 16S ribosomal ribonucleic acid (rRNA) gene amplicon survey of the coral s microbiomes completed using Roche 454 pyrosequencing with Titanium series reagents. Samples from the Gulf of Mexico were collected in 2009 and 2010. Samples from the Atlantic Ocean were collected in 2009. The raw data files associated with this study have also been submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under Bioproject number PRJNA305617. Minimum information about a marker gene (MIMARKS) compliant metadata is provided in "Lophelia metadata", which is included in the data download file. For more information, please contact Christina Kellogg at the U.S. Geological Survey (USGS) St. Petersburg Coastal and Marine Science Center, 600 4th Street South, St. Petersburg, Florida, USA, 33701; Telephone: (727) 502-8128; email: ckellogg@usgs.gov.; abstract: The files in this data release are the raw deoxyribonucleic acid (DNA) sequence files referenced in the submitted journal article by Christina A. Kellogg, Dawn B. Goldsmith and Michael A. Gray entitled "Biogeographic comparison of Lophelia-associated bacterial communities in the western Atlantic reveals conserved core microbiome". They represent a 16S ribosomal ribonucleic acid (rRNA) gene amplicon survey of the coral s microbiomes completed using Roche 454 pyrosequencing with Titanium series reagents. Samples from the Gulf of Mexico were collected in 2009 and 2010. Samples from the Atlantic Ocean were collected in 2009. The raw data files associated with this study have also been submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under Bioproject number PRJNA305617. Minimum information about a marker gene (MIMARKS) compliant metadata is provided in "Lophelia metadata", which is included in the data download file. For more information, please contact Christina Kellogg at the U.S. Geological Survey (USGS) St. Petersburg Coastal and Marine Science Center, 600 4th Street South, St. Petersburg, Florida, USA, 33701; Telephone: (727) 502-8128; email: ckellogg@usgs.gov.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Blaskowski, Stephen (2025). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7055911

Data from: MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

Explore at:

Dataset updated

Jan 22, 2025

Dataset provided by

Blaskowski, Stephen
Coesel, Sacha
Armbrust, E. Virginia
Groussman, Mora J

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:

MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier.

accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.

marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.

tax_id: The NCBI Taxonomy ID (taxID).

pr2_accession: Best-matching PR2 accession ID associated with entry

pr2_rank: The lowest shared rank between the entry and the pr2_accession

pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession

data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).

data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).

source_link: URL where the original sequence data and/or metadata was collected.

pub_year: Year of data release or publication of linked reference.

ref_link: Pubmed URL directs to the published reference for entry, if available.

ref_doi: DOI of entry data from source, if available.

source_filename: Name of the original sequence file name from the data source.

seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.

n_seqs_raw: Number of sequences in the original sequence file.

source_name: Full organism name from entry source

original_taxID: Original NCBI taxID from entry data source metadata, if available

alias: Additional identifiers for the entry, if available

MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier

marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.

tax_id: Verified NCBI taxID used in MarFERReT

taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)

taxID_notes: Notes on the original_taxID

n_seqs_raw: Number of sequences in the original sequence file

n_pfams: Number of Pfam domains identified in protein sequences

qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.

flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.

VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).

flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.

rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.

rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.

flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.

flag_sum: Count of the number of flag columns (qc_flag, flag_Lasek, flag_VanVlierberghe, and flag_rp63). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).

accepted: Acceptance into the final MarFERReT build (Y or N).

MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

The columns in this file contain the following information:

accession: (NA)

accession.version: The unique MarFERReT sequence identifier ('mftX').

taxid: The NCBI Taxonomy ID associated with this reference sequence.

gi: (NA).

MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

aa_id: the unique identifier for each MarFERReT protein sequence.

entry_id: The unique numeric identifier for each MarFERReT entry.

source_defline: The original, unformatted sequence identifier

MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:

aa_id: The unique MarFERReT protein sequence ID ('mftX').

pfam_name: The shorthand Pfam protein family name.

pfam_id: The Pfam identifier.

pfam_eval: hmm profile match e-value score

pfam_score: hmm profile match bitscore

MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.

Clear search

Close search

Google apps

Main menu

Data from: MarFERReT: an open-source, version-controlled reference library...

Data from: TempO-seq and RNA-seq gene expression levels are highly...

Perlegen/NIEHS National Toxicology: Mouse Genome Resequencing Project

The Chironomus tentans draft genome annotation

Code for Predicting MIEs from Gene Expression and Chemical Target Labels...

CaspBase

MARMICRODB database for taxonomic classification of (marine) metagenomes

Data from: PrimerMiner: an R package for development and in silico...

Hackathon - TF-TG literature triage additional data

List of species and accessions for RNA-seq data used in this study.

Data from: Supporting data for: Late Quaternary dynamics of Arctic biota...

STAT1 transcription factor in Human HeLa S3

Table S1 from Cornet et al. Consensus assessment of the contamination level...

Bovine Genome Project

Fertility decline in Aedes aegypti mosquitoes is associated with reduced...

All published genomes used in this study, including links to the assemblies...

Data from: Cold-water coral microbiomes (Lophelia pertusa) from Gulf of...

Data from: MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes