100+ datasets found
  1. Z

    Data from: MarFERReT: an open-source, version-controlled reference library...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blaskowski, Stephen (2025). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7055911
    Explore at:
    Dataset updated
    Jan 22, 2025
    Dataset provided by
    Coesel, Sacha
    Armbrust, E. Virginia
    Blaskowski, Stephen
    Groussman, Mora J
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

    The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

    This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:

    MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

    entry_id: Unique MarFERReT sequence entry identifier.

    accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.

    marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.

    tax_id: The NCBI Taxonomy ID (taxID).

    pr2_accession: Best-matching PR2 accession ID associated with entry

    pr2_rank: The lowest shared rank between the entry and the pr2_accession

    pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession

    data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).

    data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).

    source_link: URL where the original sequence data and/or metadata was collected.

    pub_year: Year of data release or publication of linked reference.

    ref_link: Pubmed URL directs to the published reference for entry, if available.

    ref_doi: DOI of entry data from source, if available.

    source_filename: Name of the original sequence file name from the data source.

    seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.

    n_seqs_raw: Number of sequences in the original sequence file.

    source_name: Full organism name from entry source

    original_taxID: Original NCBI taxID from entry data source metadata, if available

    alias: Additional identifiers for the entry, if available

    MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

    entry_id: Unique MarFERReT sequence entry identifier

    marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.

    tax_id: Verified NCBI taxID used in MarFERReT

    taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)

    taxID_notes: Notes on the original_taxID

    n_seqs_raw: Number of sequences in the original sequence file

    n_pfams: Number of Pfam domains identified in protein sequences

    qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.

    flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.

    VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).

    flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.

    rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.

    rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.

    flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.

    flag_sum: Count of the number of flag columns (qc_flag, flag_Lasek, flag_VanVlierberghe, and flag_rp63). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).

    accepted: Acceptance into the final MarFERReT build (Y or N).

    MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

    MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

    The columns in this file contain the following information:

    accession: (NA)

    accession.version: The unique MarFERReT sequence identifier ('mftX').

    taxid: The NCBI Taxonomy ID associated with this reference sequence.

    gi: (NA).

    MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

    aa_id: the unique identifier for each MarFERReT protein sequence.

    entry_id: The unique numeric identifier for each MarFERReT entry.

    source_defline: The original, unformatted sequence identifier

    MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:

    aa_id: The unique MarFERReT protein sequence ID ('mftX').

    pfam_name: The shorthand Pfam protein family name.

    pfam_id: The Pfam identifier.

    pfam_eval: hmm profile match e-value score

    pfam_score: hmm profile match bitscore

    MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.

  2. e

    NCBIFAM

    • ebi.ac.uk
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). NCBIFAM [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Dec 16, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).

  3. w

    Entrez Gene Extract

    • data.wu.ac.at
    Updated May 15, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Entrez Gene Extract [Dataset]. https://data.wu.ac.at/schema/linkeddatacatalog_dws_informatik_uni-mannheim_de/YzE1YTAyZWQtNDViYS00ZjVhLWJhYTEtNWU3MjcwYjU2YmUy
    Explore at:
    Dataset updated
    May 15, 2014
    Description

    Data exposed: Entrez Gene Extract from [ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz] Size of dump and data set: 5.6 MB Notes: NCBI Copyright and Disclaimers

  4. Data from: NCBI Taxonomy

    • gbif.org
    • demo.gbif-test.org
    Updated Feb 19, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GBIF (2015). NCBI Taxonomy [Dataset]. http://doi.org/10.15468/rhydar
    Explore at:
    Dataset updated
    Feb 19, 2015
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/
    Description

    The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such.

  5. n

    UniSTS

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). UniSTS [Dataset]. http://identifiers.org/RRID:SCR_006843
    Explore at:
    Dataset updated
    Aug 5, 2024
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 22, 2016. Database of sequence tagged sites (STSs) derived from STS-based maps and other experiments. STSs are defined by PCR primer pairs and are associated with additional information such as genomic position, genes, and sequences. Chromosome maps are labeled by name of the originating organism, the map title, total markers, total UniSTSs and links to view maps as well as research documents available through PubMed, another NCBI database. The search functions within UniSTS allow the user to search by gene marker, chromosome, gene symbol and gene description terms to locate markers on specified genes. A representation of the UniSTS datasets is available by ftp. NOTE: All data from this resource have been moved to the Probe database, http://www.ncbi.nlm.nih.gov/probe. You can retrieve all UniSTS records by searching the probe database using the search term unists(properties). (use brackets insead of parenthesis). Additionally, legacy data remain on the NCBI FTP Site in the UniSTS Repository (ftp://ftp.ncbi.nih.gov/pub/ProbeDB/legacy_unists).

  6. w

    Entrez Gene

    • data.wu.ac.at
    example/rdf+xml, rdf
    Updated May 15, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Entrez Gene [Dataset]. https://data.wu.ac.at/odso/linkeddatacatalog_dws_informatik_uni-mannheim_de/OTU4NDE5YTEtMzczMS00N2RiLWFmZmMtMGQ2MWJkOThmMjVl
    Explore at:
    rdf, example/rdf+xmlAvailable download formats
    Dataset updated
    May 15, 2014
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    About

    Data exposed: Select fields from Entrez Gene records Size of dump and data set: 7.7 MB Notes: NCBI Copyright and Disclaimers

    Openness

    Data appears to be in public domain. Disclaimer says:

    Information that is created by or for the US government on this site is within the public domain. Public domain information on the National Library of Medicine (NLM) Web pages may be freely distributed and copied. However, it is requested that in any subsequent use of this work, NLM be given appropriate acknowledgment.

  7. u

    Data from: Bradysia coprophila genome annotations Bcop_v1.0

    • agdatacommons.nal.usda.gov
    application/x-gzip
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Urban (2025). Bradysia coprophila genome annotations Bcop_v1.0 [Dataset]. http://doi.org/10.15482/USDA.ADC/1522618
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    John Urban
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset presents the Bradysia coprophila genome annotations Bcop_v1.0. It will be used as a starting point to manually improve annotations. The annotations were generated using Maker2. Highly detailed bioinformatic methods information can be found in the supplemental material of our preprint titled, "Single-molecule sequencing of long DNA molecules allows high contiguity de novo genome assembly for the fungus fly, Sciara coprophila" (doi: https://doi.org/10.1101/2020.02.24.963009 ). See the Table of Contents therein. A far briefer description is below. Note that Sciara coprophila is synonymous with Bradysia coprophila, and was used in the title of our publication for historical reasons. Repeat library used for masking: species-specific repeat libraries were built using RepeatModeler. A more comprehensive repeat library was created by adding previously-known repeat sequences from Bradysia coprophila and all Arthropod repeats in the RepeatMasker Combined Database: Dfam_Consensus-20181026, RepBase-20181026. The comprehensive repeat library was used with RepeatMasker as part of the Maker2 pipeline. Automated gene finding: To predict/find protein-coding genes, Maker2 was used to take of 3 sources of evidence: RNA-seq expression evidence, homology, and gene prediction. RNA-seq data from both male and female embryos, larvae, pupae, and adults were combined to create transcriptome assemblies using Trinity (de novo) and HiSat2 followed by StringTie (genome-guided). The transcriptome assemblies were used as EST evidence in Maker2. Transcript and protein sequences from related species was used for homology evidence. Three gene predictors were used: Augustus, SNAP, GeneMark-ES. See the supplemental materials in our preprint for more information on iterative Maker2 rounds, training each gene predictor, RNA-seq methods, and transcriptome assembly generation. The Maker2 gene annotations of the final round were evaluated using annotation edit distances, BUSCO, RSEM-Eval, and TransRate. Functional information: InterProScan was used to identify Pfam domains and GO terms from predicted protein sequences, and BLASTp was to find best matches to curated proteins in the UniProtKB/Swiss-Prot database. Resources in this dataset:Resource Title: Bradysia coprophila genome annotations Bcop_v1.0. File Name: bradysia_coprophila.bcop_v1.0.tar.gzResource Description: Primary file: - Bradysia_coprophila.Bcop_v1.0_gene_set.gff
    - Contains automated annotations from Maker2 (described in https://doi.org/10.1101/2020.02.24.963009). - This is the main file in this tar archive. - The reference genome fasta is available from GenBank: https://www.ncbi.nlm.nih.gov/assembly/GCA_014529535.1. - The Seqid in Column 1 of this gff3 file corresponds to the 'Sequence name' in the GenBank assembly report: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/529/535/GCA_014529535.1_BU_Bcop_v1/GCA_014529535.1_BU_Bcop_v1_assembly_report.txt

    Supplementary files: - Bradysia_coprophila.Bcop_v1.0_evidence.rnd3.gff - Contains aligned evidence Maker2 used.

    • Bradysia_coprophila.Bcop_v1.0_masked_genome.rnd3.gff - Contains coordinates for masked regions of the genome as seen by Maker2.

    • Bradysia_coprophila.Bcop_v1.0_proteins_with_putative_function.fasta - Contains predicted protein sequences

    • Bradysia_coprophila.Bcop_v1.0_transcripts_with_putative_function.fasta - Contains predicted transcript sequences

  8. Distinguishing between canonical and non-canonical tRNA genes reveals that...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +3
    Updated Jul 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter T.S. van der Gulik; Martijn Egas; Martijn Egas; Ken Kraaijeveld; Ken Kraaijeveld; Nina Dombrowski; Nina Dombrowski; Astrid T. Groot; Astrid T. Groot; Anja Spang; Anja Spang; Jenna Gallie; Jenna Gallie; Peter T.S. van der Gulik (2022). Distinguishing between canonical and non-canonical tRNA genes reveals that Thermococcaceae adhere to the standard archaeal tRNA gene set [Dataset]. http://doi.org/10.5281/zenodo.6782366
    Explore at:
    txt, bin, html, zip, application/gzipAvailable download formats
    Dataset updated
    Jul 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peter T.S. van der Gulik; Martijn Egas; Martijn Egas; Ken Kraaijeveld; Ken Kraaijeveld; Nina Dombrowski; Nina Dombrowski; Astrid T. Groot; Astrid T. Groot; Anja Spang; Anja Spang; Jenna Gallie; Jenna Gallie; Peter T.S. van der Gulik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Automated genome annotation is an essential tool for extracting biological information from sequence data. The identification and annotation of tRNA genes is frequently performed by the software package tRNAscan-SE, the output of which is listed – for selected genomes – in the Genomic tRNA database (GtRNAdb). Given the central role of tRNA in molecular biology, the accuracy and proper application of tRNAscan-SE is important for both interpretation of the output, and continued improvement of the software. Here, we report a manual annotation of the predicted tRNA gene sets for 20 complete genomes from the archaeal taxon Thermococcaceae. According to GtRNAdb, these 20 genomes contain a number of putative deviations from the standard set of canonical tRNA genes in Archaea. However, manual annotation reveals that only one represents a true divergence; the other instances are either (i) non-canonical tRNA genes resulting from the integration of horizontally transferred genetic elements, or CRISPR-Cas activity, or (ii) attributable to errors in the input DNA sequence. To distinguish between canonical and non-canonical archaeal tRNA genes, we recommend using a combination of automated pseudogene detection by tRNAscan-SE and the tRNAscan-SE isotype score, greatly reducing manual annotation efforts and leading to improved predictions of tRNA gene sets in Archaea.

    Repository contents

    01_workflow_tRNAscanSE_predictions_210archaea.html contains the workflow and graphical output for tRNA gene set predictions in 20 Thermococcaceae genomes and 210 archaeal genomes. Files 03 to 06 below are the files quoted in this workflow.

    02_workflow_tRNAscanSE_predictions_210archaea.Rmd contains the markdown file associated with 01_workflow_tRNAscanSE_predictions_210archaea.html above.

    03_thermo_trnas_GtRNAdb.txt contains the predicted tRNA gene sets of 20 Thermococcaceae genomes as listed on GtRNAdb (Data Release 19 (June 2021)).

    04_Archaea_genome_list.txt contains the details of all 217 archaeal genomes listed on GtRNAdb (Data Release 19 (June 2021)). The seven genomes for which the NCBI genome sequences were no longer available are indicated by #### preceding the name.

    05_thermo_tRNAs_genome.txt contains the predicted tRNA gene sets of 20 Thermococcaceae genomes as predicted by locally run tRNAscan-SE (version 2.0.6), with standard settings for Archaea (option -A). To display the output, options -H and --detail were added. We note that pseudogene detection is active under these conditions.

    06_Archaea_210_GtRNAdb_tRNAs.txt contains the predicted tRNA gene sets of the 210 archaeal genomes as listed on GtRNAdb (Data Release 19 (June 2021)).

    07_Archaea_210genomes_tRNAs.txt contains the predicted tRNA gene sets of the 210 archaeal genomes as predicted by locally run tRNAscan-SE (version 2.0.6), with standard settings for Archaea (option -A). To display the output, options -H and --detail were added. We note that pseudogene detection is active under these conditions.

    08_NCBI_genomes.zip contains the NCBI GenBank genome sequence files used in this study. These include the 20 Thermococcaceae genomes, the wider 210 archaeal genomes, and several others of interest.

    09_phylogeny.tar.zip contains the data used to draw a phylogenetic tree for the 20 Thermococcaceae organisms. The folder includes a file listing the details of all data in the folder (Readme.md), a workflow file (workflow_UndinMarkers_v2.md), and data folders.

    Notes

    The extended TIGRFAM database referred to in the phylogenetic tree construction process can be found at https://zenodo.org/record/3839790#.YjByaVzMI3g

    The perl script used during phylogenetic tree construction, catfasta2phyml.pl, is available in the GitHub repository https://github.com/nylander/catfasta2phyml

    tRNAscan-SE is a freely available resource available online (http://lowelab.ucsc.edu/tRNAscan-SE/)

    GtRNAdb is a publicly accessible resource available online (http://gtrnadb.ucsc.edu/)

    NCBI is a publicly accessible resource available online (https://www.ncbi.nlm.nih.gov/)

    rrnDB is a publicly accessible resource available online (https://rrndb.umms.med.umich.edu/)

    BLAST is a publicly accessible resource available online (https://blast.ncbi.nlm.nih.gov/Blast.cgi)

  9. Z

    Curated Phage Database (CPD) fasta file

    • data.niaid.nih.gov
    Updated Oct 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barkal, Layla (2022). Curated Phage Database (CPD) fasta file [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7154235
    Explore at:
    Dataset updated
    Oct 7, 2022
    Dataset provided by
    Barkal, Layla
    Haddock, Naomi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the fasta file including phage genomes used to generate the Curated Phage Database (CPD) utilized in our in review manuscript "The circulating phageome reflects bacterial infections". The corresponding phage characteristic data will be present in the manuscript as a supplemental file, and can be used to connect a Genbank ID to identified bacteriophage host and phage taxonomic information if known.

    Please note that this database is built from phage sequences in the NCBI nucleotide repository. Due to field bias towards sequencing human disease-related bacteria and their phage, this database is reflective of this bias and is most representative of bacteriophage associated with human pathogens and as such underrepresents environmental phages in comparison - a limitation to keep in mind when utilizing to interpret potential phage sequences.

  10. e

    CDD

    • ebi.ac.uk
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). CDD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Apr 18, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.

  11. h

    ncbi_disease

    • huggingface.co
    Updated Sep 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLM/DIR BioNLP Group (2023). ncbi_disease [Dataset]. https://huggingface.co/datasets/ncbi/ncbi_disease
    Explore at:
    Dataset updated
    Sep 2, 2023
    Dataset authored and provided by
    NLM/DIR BioNLP Group
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.

    For more details, see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/

    The original dataset can be downloaded from: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/NCBI_corpus.zip This dataset has been converted to CoNLL format for NER using the following tool: https://github.com/spyysalo/standoff2conll Note: there is a duplicate document (PMID 8528200) in the original data, and the duplicate is recreated in the converted data.

  12. E

    [Metabarcoding zooplankton at station ALOHA: NCBI SRA accession numbers] -...

    • erddap.bco-dmo.org
    Updated Aug 2, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BCO-DMO (2017). [Metabarcoding zooplankton at station ALOHA: NCBI SRA accession numbers] - NCBI Sequence Read Archive (SRA) accession numbers for fastq sequence files for each zooplankton community sample (Plankton Population Genetics project) (Basin-scale genetics of marine zooplankton) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_700961/index.html
    Explore at:
    Dataset updated
    Aug 2, 2017
    Dataset provided by
    Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
    Authors
    BCO-DMO
    License

    https://www.bco-dmo.org/dataset/700961/licensehttps://www.bco-dmo.org/dataset/700961/license

    Area covered
    Variables measured
    title, latitude, platform, longitude, library_ID, analysis_name, biosample_link, library_layout, library_source, bioproject_link, and 6 more
    Description

    These data include sample information and accession links to sequence data at The National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).

    This data submission consists of metabarcoding data for the zooplankton community in the epipelagic, mesopelagic and upper bathypelagic zones (0-1500m) of the North Pacific Subtropical Gyre. The goal of this study was to assess the hidden diversity present in zooplankton assemblages in midwaters, and detect vertical gradients in species richness, depth distributions, and community composition of the full zooplankton assemblage. Samples were collected in June 2014 from Station ALOHA (22.75, -158) using a 1 meter square Multiple Opening and Closing Nets and Environmental Sampling System (MOCNESS, 200um mesh), on R/V Falkor cruise FK140613. \u00a0Next generation sequence data (Illumina MiSeq, V3 chemistry, 300-bp paired-end) of the zooplankton assemblage derive from amplicons of the V1-V2 region of 18S rRNA (primers described in Fonseca et al. 2010). The data includes sequences and read count abundance information for molecular OTUs from both holoplanktonic and meroplanktonic taxa

    Related dataset containing OTU tables and fasta sequences (representative / most abundance read for each OTU):
    Metabarcoding zooplankton at station ALOHA: OTU tables and fasta files access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=SAMPLE INFORMATION

    Sample identifiers include the following codes.

    MOCNESS tow
    FA3: Night sampling
    FA4: Day sampling

    Depth range:
    N1: 1500-1000m
    N2: 1000-700m
    N3: 700-500m
    N4: 500-300m
    N5: 300-200m
    N6: 200-150m
    N7: 150-100m
    N8: 100-50m
    N9: 50m-0m

    Wet-sieved zooplankton size fractions
    SF1: 0.2-0.5 mm
    SF2: 0.5-1.0 mm
    SF3: 1.0-2.0 mm awards_0_award_nid=473046 awards_0_award_number=OCE-1255697 awards_0_data_url=http://www.nsf.gov/awardsearch/showAward?AWD_ID=1255697 awards_0_funder_name=NSF Division of Ocean Sciences awards_0_funding_acronym=NSF OCE awards_0_funding_source_nid=355 awards_0_program_manager=David L. Garrison awards_0_program_manager_nid=50534 awards_1_award_nid=537990 awards_1_award_number=OCE-1338959 awards_1_data_url=http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1338959 awards_1_funder_name=NSF Division of Ocean Sciences awards_1_funding_acronym=NSF OCE awards_1_funding_source_nid=355 awards_1_program_manager=David L. Garrison awards_1_program_manager_nid=50534 awards_2_award_nid=539716 awards_2_award_number=OCE-1029478 awards_2_data_url=http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1029478 awards_2_funder_name=NSF Division of Ocean Sciences awards_2_funding_acronym=NSF OCE awards_2_funding_source_nid=355 awards_2_program_manager=David L. Garrison awards_2_program_manager_nid=50534 cdm_data_type=Other comment=ALOHA Zooplankton metabarcoding: SRA PI: Erica Goetze data version: 07 Jun 2017 Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.704665 Easternmost_Easting=-158.0 geospatial_lat_max=22.75 geospatial_lat_min=22.75 geospatial_lat_units=degrees_north geospatial_lon_max=-158.0 geospatial_lon_min=-158.0 geospatial_lon_units=degrees_east infoUrl=https://www.bco-dmo.org/dataset/700961 institution=BCO-DMO instruments_0_acronym=Automated Sequencer instruments_0_dataset_instrument_description=Illumina MiSeq using V3 chemistry (300-bp, paired-end) instruments_0_dataset_instrument_nid=700970 instruments_0_description=General term for a laboratory instrument used for deciphering the order of bases in a strand of DNA. Sanger sequencers detect fluorescence from different dyes that are used to identify the A, C, G, and T extension reactions. Contemporary or Pyrosequencer methods are based on detecting the activity of DNA polymerase (a DNA synthesizing enzyme) with another chemoluminescent enzyme. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step. instruments_0_instrument_name=Automated DNA Sequencer instruments_0_instrument_nid=649 instruments_0_supplied_name=Illumina MiSeq instruments_1_acronym=Thermal Cycler instruments_1_dataset_instrument_nid=700972 instruments_1_description=General term for a laboratory apparatus commonly used for performing polymerase chain reaction (PCR). The device has a thermal block with holes where tubes with the PCR reaction mixtures can be inserted. The cycler then raises and lowers the temperature of the block in discrete, pre-programmed steps.

    (adapted from http://serc.carleton.edu/microbelife/research_methods/genomics/pcr.html) instruments_1_instrument_name=PCR Thermal Cycler instruments_1_instrument_nid=471582 instruments_1_supplied_name=quantitative PCR by the Evolutionary Genetics Core Facility (Hawaii Institute of Marine Biology) instruments_2_acronym=Bioanalyzer instruments_2_dataset_instrument_nid=700971 instruments_2_description=A Bioanalyzer is a laboratory instrument that provides the sizing and quantification of DNA, RNA, and proteins. One example is the Agilent Bioanalyzer 2100. instruments_2_instrument_name=Bioanalyzer instruments_2_instrument_nid=626182 instruments_2_supplied_name=Agilent 2100 Bioanalyzer metadata_source=https://www.bco-dmo.org/api/dataset/700961 Northernmost_Northing=22.75 param_mapping={'700961': {'lat': 'master - latitude', 'lon': 'master - longitude'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/700961/parameters people_0_affiliation=University of Hawaii people_0_person_name=Erica Goetze people_0_person_nid=473048 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=University of Hawaii people_1_person_name=Erica Goetze people_1_person_nid=473048 people_1_role=Contact people_1_role_type=related people_2_affiliation=Woods Hole Oceanographic Institution people_2_affiliation_acronym=WHOI BCO-DMO people_2_person_name=Amber York people_2_person_nid=643627 people_2_role=BCO-DMO Data Manager people_2_role_type=related project=Plankton Population Genetics projects_0_acronym=Plankton Population Genetics projects_0_description=Description from NSF award abstract: Marine zooplankton show strong ecological responses to climate change, but little is known about their capacity for evolutionary response. Many authors have assumed that the evolutionary potential of zooplankton is limited. However, recent studies provide circumstantial evidence for the idea that selection is a dominant evolutionary force acting on these species, and that genetic isolation can be achieved at regional spatial scales in pelagic habitats. This RAPID project will take advantage of a unique opportunity for basin-scale transect sampling through participation in the Atlantic Meridional Transect (AMT) cruise in 2014. The cruise will traverse more than 90 degrees of latitude in the Atlantic Ocean and include boreal-temperate, subtropical and tropical waters. Zooplankton samples will be collected along the transect, and mitochondrial and microsatellite markers will be used to identify the geographic location of strong genetic breaks within three copepod species. Bayesian and coalescent analytical techniques will test if these regions act as dispersal barriers. The physiological condition of animals collected in distinct ocean habitats will be assessed by measurements of egg production (at sea) as well as body size (condition index), dry weight, and carbon and nitrogen content. The PI will test the prediction that ocean regions that serve as dispersal barriers for marine holoplankton are areas of poor-quality habitat for the target species, and that this is a dominant mechanism driving population genetic structure in oceanic zooplankton. Note: This project is funded by an NSF RAPID award. This RAPID grant supported the shiptime costs, and all the sampling reported in the AMT24 zooplankton ecology cruise report (PDF). Online science outreach blog at: https://atlanticplankton.wordpress.com projects_0_end_date=2015-11 projects_0_geolocation=Atlantic Ocean, 46 N - 46 S projects_0_name=Basin-scale genetics of marine zooplankton projects_0_project_nid=537991 projects_0_start_date=2013-12 sourceUrl=(local files) Southernmost_Northing=22.75 standard_name_vocabulary=CF Standard Name Table v55 subsetVariables=latitude,longitude,bioproject_accession,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,bioproject_link version=1 Westernmost_Easting=-158.0 xml_source=osprey2erddap.update_xml() v1.3

  13. ProSynTaxDB: Prochlorococcus and Synechococcus Taxonomy Database

    • zenodo.org
    bin, csv, tsv
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Coe; Allison Coe; James I. Mullet; James I. Mullet; Nhi N. Vo; Nhi N. Vo; Maya Anjur-Dietrich; Maya Anjur-Dietrich; Paul M. Berube; Paul M. Berube; Eli Salcedo; Eli Salcedo; Sierra Parker; Sierra Parker; Konnor VonEmster; Konnor VonEmster; Christina Bliem; Christina Bliem; Aldo Arellano; Aldo Arellano; Kurt Castro; Kurt Castro; Jamie Becker; Jamie Becker; Sallie Chisholm; Sallie Chisholm (2025). ProSynTaxDB: Prochlorococcus and Synechococcus Taxonomy Database [Dataset]. http://doi.org/10.5281/zenodo.14889681
    Explore at:
    tsv, bin, csvAvailable download formats
    Dataset updated
    Mar 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Allison Coe; Allison Coe; James I. Mullet; James I. Mullet; Nhi N. Vo; Nhi N. Vo; Maya Anjur-Dietrich; Maya Anjur-Dietrich; Paul M. Berube; Paul M. Berube; Eli Salcedo; Eli Salcedo; Sierra Parker; Sierra Parker; Konnor VonEmster; Konnor VonEmster; Christina Bliem; Christina Bliem; Aldo Arellano; Aldo Arellano; Kurt Castro; Kurt Castro; Jamie Becker; Jamie Becker; Sallie Chisholm; Sallie Chisholm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    INTRODUCTION:

    Understanding the distribution of the abundant and closely related picocyanobacteria, Prochlorococcus and Synechococcus, is essential for understanding marine ecosystems. These organisms are highly diverse, making the accurate classification of clusters/clades/grades within each genus challenging. As a result, Prochlorococcus and Synechococcus populations are often characterized as a single strain, concealing the well-documented fine-scale niche partitioning within these groups (Johnson, et al., 2006, Hunter-Cevera, et al., 2016, Larkin, et al., 2016, Kent, et al., 2019, Thompson, et al., 2021, Ustick et al., 2023). Here, we introduce ProSynTaxDB and its associated workflow, designed to significantly improve metagenomic classification of Prochlorococcus and Synechococcus using a substantial amount of high-quality genomic reference data collected over the past decade and from this study. ProSynTaxDB includes 1,260 single-cell amplified genomes, high-quality draft cultured genomes, and unpublished closed genomes, featuring new closed circular assemblies for 40 Prochlorococcus, 12 Synechococcus, and 10 marine heterotrophic bacterial strains. This includes 21 Prochlorococcus genomes that were previously partially assembled (Biller et al., 2014) and 16 genomes from unpublished isolates. Additionally, the database includes 27,799 genomes of marine heterotrophic bacteria, archaea, and viruses to assess communities surrounding Prochlorococcus and Synechococcus. ProSynTaxDB and the accompanying workflow can accurately identify clades in metagenomic samples containing at least 0.60% Prochlorococcus reads or 0.09% Synechococcus reads, thereby improving our understanding of these picocyanobacteria in low-abundance regions.

    Github to the associated workflow: https://github.com/jamesm224/ProSynTaxDB-workflow

    FILE DESCRIPTION:

    ProSynTaxDB_genomes.tsv

    Table of genomes included in the ProSynTaxDB and their associated metadata (Data Citation 1). Data fields are as follows:

    organism: The name of the organism recorded in NCBI when available. For genomes/organisms obtained from sources other than NCBI, the organism name is provided in NCBI format

    genome_short_name: The genome name used in the ProSynTaxDB

    domain: Bacteria, Archaea, Eukarya, or Virus

    genus: The genus of the organism in NCBI

    clade: The major cluster/clade/grade of Prochlorococcus or Synechococcus based on phylogenetic reconstruction using a concatenated alignment of proteins encoded by single-copy core genes

    NCBI_BioProject: The NCBI BioProject accession number associated with the organism, when available

    NCBI_BioSample: The NCBI BioSample accession number associated with the organism, when available

    NCBI_GenBank: The NCBI GenBank accession number associated with the genome sequence data, when available

    IMG_Genome_ID: The IMG Genome ID accession number, when available, associated with the genome/organism in the Joint Genome Institute’s (JGI) Integrated Microbial Genomes (IMG) repository. The IMG Genome ID is synonymous with the IMG Taxon ID

    ProSynTaxDB_names.dmp

    Names taxonomy file for use with the ProSynTaxDB.

    ProSynTaxDB_nodes.dmp

    Nodes taxonomy file for use with the ProSynTaxDB.

    ProSynTaxDB.fmi

    Index file containing contents of ProSynTaxDB_v1.faa for use with ProSynTaxDB.

    CyCOG6.dmnd

    Database containing orthologous groups of proteins used in the cluster/clade/grade normalization step.

    ProSynTaxDB.faa

    File containing protein sequences used by Kaiju for classification of reads. Each protein sequence contains a header starting with “>”.

    average_cycog_length.csv

    Comma separated file containing the average length for each protein sequence used in the normalization step. Data fields are as follows:
    cycog: name for single-copy core gene

    mean_AA_length: the average length of amino acids in the protein sequence of the gene

    ProSynTaxDB-workflow_benchmarking_genomes.tsv

    This tab-delimited file contains a list of subsetted genomes used in each benchmarking experiment done in the Technical Validation section. Data fields are as follows:

    Experiment Name: name of benchmarking experiment conducted

    Subset ID: unique ID from the random genome subsetting

    Genome Name: name of genome used in benchmarking experiment

    ProSynTaxDB-workflow_benchmarking_composition.tsv

    This tab-delimited file contains the taxon composition of all samples used in each benchmarking experiment done in the Technical Validation section. Data fields are as follows:

    Experiment Name: name of benchmarking experiment conducted

    Sample Name: unique sample name

    Percent Prochlorococcus: percent of reads in simulated sample originating from Prochlorococcus genomes

    Percent Synechococcus: percent of reads in simulated sample originating from Synechococcus genomes

    Percent Heterotroph: percent of reads in simulated sample originating from marine heterotrophic bacterial genomes

    Notes: additional information about the simulated sample

  14. Germinal centre-driven maturation of B cell response to SARS-CoV-2 mRNA...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    application/gzip, bin +1
    Updated Feb 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Q Zhou; Julian Q Zhou (2022). Germinal centre-driven maturation of B cell response to SARS-CoV-2 mRNA vaccination [Dataset]. http://doi.org/10.5281/zenodo.5895181
    Explore at:
    application/gzip, tsv, binAvailable download formats
    Dataset updated
    Feb 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julian Q Zhou; Julian Q Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are the processed BCR repertoire and transcriptomics data described in Kim & Zhou et al., Nature, 2022. The raw sequencing data new to this study are available on SRA under BioProject PRJNA777934. This study also used BCR repertoire data from Turner & O'Halloran et al., Nature, 2021 (PRJNA731610) and Schmitz, Turner & Liu et al., Immunity, 2021 (PRJNA741267).

    Code

    Code along with Docker containers for reproducing the NGS data-based figures and analyses in the published paper can be found on GitHub.

    Metadata

    File: WU368_kim_et_al_nature_2022_meta.tsv

    Notes:

    • Sample breakdown by `sequence_type` (132 total)
    • Sample collection time was originally recorded in days in the `timepoint` column. Timepoints were referenced in weeks in the manuscript, as shown in the `timepoint_ms` column.
    • `bio_rep` and `tech_rep` = biological replicate and technical replicate respectively.

    Abbreviations:

    • LN = lymph node
    • BM = bone marrow
    • PB = plasmablast
    • GC = germinal centre
    • LLPC = long-lived plasma cell
    • NS = no sorting
    • mAb = monoclonal antibody

    Information on the 2099 recombinant mAbs generated in this study

    File: WU368_kim_et_al_nature_2022_mabs.tsv

    Notes on columns:

    • `h_sequence_id` and `l_sequence_id`: Sequence IDs of the heavy and light chains respectively.
    • `elisa`: ELISA results for binding to SARS-CoV-2 S (`TRUE` = positive).

    Processed BCR data - heavy chains

    File: WU368_kim_et_al_nature_2022_bcr_heavy.tsv

    Analysis was based on heavy chain-based clonal inference.

    Notes on columns:

    The columns largely follow the AIRR-C Rearrangement format. The main deviation is that CDR3s were used, as opposed to IMGT-defined "junctions". Nonetheless, junction-related columns are included here as some repositories such as iReceptor use these. Non-standard columns are noted below.

    • `cell_id`: Only sequences from single-cell samples and the 37 mAbs from Turner & O'Halloran et al., Nature, 2021 have cell IDs following the format `[donor]_[sample]@[id]`. `NA` for bulk sequences.
    • `sequence_id`: Sequence IDs follow the format `[donor]_[sample]@[id]`.
    • `v_call_genotyped`: V gene annotation reassigned after individualized genotyping by TIgGER.
    • `germline_[vdj]_call`: Clonal consensus germline calls after corresponding clonal consensus sequence were reconstructed via `CreateGermlines.py --cloned` from Change-O.
    • `isotype`: IGH[ADEGM].
    • `cdr3`: CDR3 nucleotide sequence.
    • `cdr3_length`: CDR3 nucleotide sequence length.
    • `cdr3_aa`: CDR3 amino acid sequence.
    • `collapse_count`: Number of duplicate IMGT-aligned V(D)J sequences that were collapsed by `alakazam::collapseDuplicates`.
    • `donor`, `timepoint`, `tissue`, `sorting`, `seq_type`: Propagated as is from the metadata file.
      • In `seq_type`, `tgx` corresponds to 10x Genomics data; `mab` corresponds specifically to the 37 S-binding mAbs from Turner & O'Halloran et al., Nature, 2021.
    • `timepoint_2`: Same as `timepoint`, except that `d28+d35` and `d201+d208` were treated as `d28` (week 4) and `d201` (week 29) respectively as described in Materials & Methods.
    • `gex_anno`: Cell type identity annotation based on transcriptomic profiles. Mapped from `anno_leiden_0.18` from WU368_kim_et_al_nature_2022_gex_b_cells.h5ad.
    • `compartment`: B cell compartment.
      • ABC = activated B cell. LNPC = lymph node plasma cell. RMB = resting memory B cell.
      • Minor differences in terminology
        • The manuscript refers to the memory compartment as MBCs, whereas the terminology used in the data is RMB. As described in Materials & Methods, analysis involving the memory compartment used specifically d201 bulk-sequenced memory sorts from blood. To get these sequences, subset `s_pos_clone`, `seq_type`, `compartment`, and `timepoint_2` to, respectively, `TRUE`, `bulk`, `RMB`, and `d201`.
        • The manuscript uses the term BMPC (bone marrow plasma cell), whereas the data uses the term LLPC.
    • `clone_id`: B cell clonal lineage IDs follow the format `[donor]@[id]`.
    • `s_pos_clone`: `TRUE` if a sequence belonged to a B cell clone that was designated as S-binding by virtue of containing one of the recombinant mAbs that tested positive via ELISA or one of the S-binding mAbs from Turner & O'Halloran et al., Nature, 2021.
    • `expressed_id`: mAb IDs for the 2099 recombinant mAbs generated in this study (mapped from `mab_id` from WU368_kim_et_al_nature_2022_mabs.tsv) and the 37 mAbs from Turner & O'Halloran et al., Nature, 2021. `NA` for everything else.
    • `elisa`: ELISA results for binding of recombinant mAbs to SARS-CoV-2 S. `TRUE` if positive. `NA` if not tested.
    • `nuc_RS_19_312`: number of replacement and silent mutations between IMGT-numbered nucleotide positions 19-312 along IGHV sequences, calculated by `shazam::calcObservedMutations`.
    • `nuc_denom_19_312`: number of informative nucleotide positions for counting mutations, excluding non-A/T/G/C positions (such as "N", "-", ".").
    • `nuc_RS_freq_19_312`: nucleotide-level mutation frequency (= nuc_RS_19_312 / nuc_denom_19_312).

    Processed BCR data - light chains

    File: WU368_kim_et_al_nature_2022_bcr_light.tsv

    Light chains were not used for heavy chain-based clonal inference or analysis.

    Processed transcriptomics data

    Files:

    • WU368_kim_et_al_nature_2022_gex_all_cells.h5ad (clustering all cells)
    • WU368_kim_et_al_nature_2022_gex_b_cells.h5ad (re-clustering only the B cells)

    Notes:

    • The `h5ad` files can be imported into Scanpy as an AnnData object.
    • Each `AnnData` object has 3 `.layers`, each representing a version of the count matrix.
      • `raw_counts`: Imported from `cellranger aggr` output by `scanpy.read_10x_mtx`.
      • `log_norm`: Log-noramlized expression values outputted by `scanpy.pp.normalize_total` followed by `scanpy.pp.log1p`.
      • `scaled`: The `log_norm` layer scaled to unit variance and zero mean by `scanpy.pp.scale`.
      </li>
      <li>The `gene_name` and `biotype` columns in `.var` were extracted from GENCODE v32 GTF.</li>
      <li>Columns in `.obs` (each row corresponds to a cell)
      <ul>
        <li>`n_feature`: The `n_genes_by_counts` column produced by `scanpy.pp.calculate_qc_metrics`, renamed. The number of genes expressed. This is before subsetting the genes.</li>
        <li>`n_umi`: The `total_counts` column produced by `scanpy.pp.calculate_qc_metrics`, renamed. The total UMI counts in a cell.</li>
        <li>`pct_mt`: The `pct_counts_mt` column produced by `scanpy.pp.calculate_qc_metrics`, renamed. The percentage of counts in mitochondrial genes.</li>
        <li>`n_hkg`: The number of housekeeping genes for which expression was detected.</li>
        <li>`n_gene_expressed`: The total number of genes for which expression was detected. This is after subsetting the genes.</li>
        <li>`pre_qc_bcr`: `TRUE` if a cell also had paired BCR data available. Produced by cross-referencing the cellular barcodes in `cell_barcodes.json` outputted by `cellranger vdj`. At this point the BCR data had not gone through the QC process in the BCR processing pipeline (hence `pre_qc`). </li>
        <li>`leiden_[resolution]`: Cluster assignment by `scanpy.tl.leiden`.</li>
        <li>`anno_leiden_[resolution]`: Cell type identity annotations based on transcriptomic profiles. This was mapped onto the `gex_anno` column in the processed heavy chain BCR
      
  15. d

    Data from: Halyomorpha halys Official Gene Set v1.2

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Halyomorpha halys Official Gene Set v1.2 [Dataset]. https://catalog.data.gov/dataset/halyomorpha-halys-official-gene-set-v1-2-c0a0b
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset presents the Halyomorpha halys Official Gene Set (OGS) v1.2. OGSv1.2 is an update of Halyomorpha halys OGSv1.1 (https://doi.org/10.15482/USDA.ADC/1504240) to the coordinates of genome assembly GCA_000696795.3 (https://www.ncbi.nlm.nih.gov/assembly/GCA_000696795.3) using https://github.com/NAL-i5K/coordinates_conversion/. The original OGSv1.0 is an integration of automatic gene predictions from NCBI's eukaryotic annotation pipeline, NCBI Halyomorpha halys Annotation Release 100 (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Halyomorpha_halys/100/; ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/696/795/GCF_000696795.1_Hhal_1.0), with manual annotations by the research community (performed via the Apollo manual curation software, http://genomearchitect.org/). Manual annotations performed by the community were downloaded from Apollo, QC'd, and merged with NCBI Halyomorpha halys Annotation Release 100 using the GFF3toolkit software (https://github.com/NAL-i5K/GFF3toolkit/releases/tag/v1.4.4). The resulting merged dataset was formatted for ingest into the i5k Workspace and GenBank databases, resulting in Halyomorpha halys Official Gene Set (OGS) v1.0. Halyomorpha Official Gene Set halhal_OGSv1.1 is a minor update of halhal_OGSv1.0: Alias attributes were added to all manually annotated cathepsin models; six models from contaminated scaffolds were removed; and notes were added to 3 models located on possibly contaminated scaffolds. Resources in this dataset:Resource Title: Halymorpha halys Official Gene Set OGSv1.2. File Name: halhal_OGSv1.2.tar.gzResource Description: The attached tar.gz archive (halhal_OGSv1.2.tar.gz) contains the following files: halhal_OGSv1.2.gff. Gff3 of all gene predictions of Halymorpha halys genome annotations OGSv1.2 halhal_OGSv1.2_CDS.fa. CDS sequences of Halymorpha halys genome annotations OGSv1.2 halhal_OGSv1.2_pep.fa. Amino acid sequences of Halymorpha halys genome annotations OGSv1.2 halhal_OGSv1.2_trans.fa. Transcript sequences of Halymorpha halys genome annotations OGSv1.2 readme. Readme file describing Halymorpha halys genome annotations OGSv1.2

  16. n

    Perlegen/NIEHS National Toxicology: Mouse Genome Resequencing Project

    • neuinfo.org
    • scicrunch.org
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Perlegen/NIEHS National Toxicology: Mouse Genome Resequencing Project [Dataset]. http://identifiers.org/RRID:SCR_000726
    Explore at:
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, Documented on August 12, 2014. Data, grouped by chromosome, available as flat files for download, of identified DNA polymorphisms (SNPs) in 15 commonly used strains of inbred laboratory mice. Perlegen's SNP, genotype (empirical and imputed), haplotype, trace, and PCR primer data has been compiled with NCBI Mouse Build information to produce data files for public use. Using high-density oligonuclueotide array technology, the study identified over 8 million SNPs and other genetic differences between these strains and the previously sequenced C57BL/6J reference strains (Phase 1). By leveraging data provided by Mark Daly's research team at the Broad Institute, genotypes were also predicted for 40 other common strains (Phase 2). Under an extension to the contract, Eleazar Eskin's group at UCLA has used this data to evaluate SNP associations with phenotypes from the Mouse Phenome Project (the Mouse Phenome Database), and to construct haplotype maps for a total of 94 inbred strains (the Mouse HapMap Project). SNP and genotype positions have been mapped from their original reference coordinates to NCBI Mouse Build 37 coordinates. Note that C57BL6/J strain was not selected for re-sequencing as this data would have been almost entirely redundant with the NCBI reference sequence. Since we did not actually determine genotypes for C57BL6/J, we did not submit genotypes for this strain to dbSNP. However, implicit genotypes for C57BL6/J can be obtained from the reference sequence at each SNP position (the reference allele is the first allele in the ALLELES column). The data is available for download in two different compressed file formats. The files are saved as both PC .zip files and Unix compressed .gz files. At this website, you can: * Learn more about the goals of the Perlegen mouse resequencing project. * Learn more about the array-based resequencing technology used in the project. * Download the SNPs, genotypes, and other data generated by the project, plus sequences of the long-range PCR primers used for SNP discovery. * Browse the mouse genome for SNPs. * View the haplotype blocks within the mouse genome. Mouse Genome Browser The Mouse Genome Browser can be used to visualize genes and the SNPs discovered in this study of genome-wide DNA variation in 15 commonly used, genetically diverse strains of inbred laboratory mice. The reference genome is the C57BL/6J strain NCBI build 37 mouse sequence. In addition to the experimentally-derived genotypes for the original 15 strains, the imputed genotypes for 40 additional inbred mouse strains can also be accessed. Mouse Haplotype Analysis The sequences of 16 commonly used, genetically diverse strains of inbred laboratory mice were analyzed to determine their haplotype structure. The Ancestry Browser shows which ancestral sequence each inbred strain most resembles, along with statistics on the pairwise similarity between the ancestral strains. The Haplotype Viewer shows the haplotype block boundaries and the pairwise similarity for all 56 strains: the 15 used for SNP discovery, the reference strain (C57BL/6J), and the 40 additional strains for which the genotypes were imputed.

  17. e

    PROSITE profiles

    • ebi.ac.uk
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

  18. Pre-processed PubMed data for a study of coauthorship

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cory Brunson; Xiaoyan Wang; Cory Brunson; Xiaoyan Wang (2020). Pre-processed PubMed data for a study of coauthorship [Dataset]. http://doi.org/10.5281/zenodo.345934
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cory Brunson; Xiaoyan Wang; Cory Brunson; Xiaoyan Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was collected from the PubMed portal to MEDLINE and other repositories of biomedical research (https://www.ncbi.nlm.nih.gov/pubmed/). Analysis of the dataset led to the paper "Effects of research complexity and competition on the incidence and growth of coauthorship in biomedicine", published in PLOS One (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173444). The raw data were pre-processed using the script "clean.r" in the project directory on GitHub (https://github.com/corybrunson/coauthor) to obtain the file presented here.

    The dataset is formatted as a data table (https://cran.r-project.org/web/packages/data.table/index.html), a class of data frame in R, and saved as a .RData file, which can be loaded into an R session via `load("path/to/dataset/pmDat.RData")`. The fields are as follows:

    • `pmid` - the unique publication identifier (PMID) used by PubMed
    • `jid` - the unique journal identifier used by PubMed
    • `issn` - the (print) ISSN of the journal
    • `ym` - the month and year of publication
    • `nau` - the number of authors credited by the publication (up to any limits imposed by PubMed, and counting each author collective as a single author)
    • `cau` - whether any corporate author was credited
    • `rev` - whether the publication was tagged as a review
    • `trial` - whether the publication was tagged as a clinical trial
    • `npmt` - the number of MeSH terms assigned to the publication that were flagged as "major" topics
    • `nmh` - the number of top-level MeSH headings assigned to the publication
    • `supp` - whether the publication was tagged as having received financial support
    • `ng` - the number of grants acknowledged by the publication
    • `co` - the country in which the journal was published

    Note that the field values for any publication can be validated by searching for the PMID in PubMed.

  19. e

    CATH-Gene3D

    • ebi.ac.uk
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Oct 21, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

  20. f

    Phylogenetic analyses of the somatostatin receptor gene family

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Ocampo Daza; Dan Larhammar; Gorel Sundstrom; Christina A Bergqvist (2016). Phylogenetic analyses of the somatostatin receptor gene family [Dataset]. http://doi.org/10.6084/m9.figshare.94213.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Authors
    Daniel Ocampo Daza; Dan Larhammar; Gorel Sundstrom; Christina A Bergqvist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sequence based phylogenetic analyses of the somatostatin receptor gene family using amino acid sequences predicted from the Ensembl genome browser (http://www.ensembl.org) and the Lepisosteus oculatus (spotted gar) genome assembly LepOcu1 (http://www.ncbi.nlm.nih.gov/genome/assembly/327908/). Database identifiers, location data, genome assembly information and annotation notes for all identified sequences are included in Supplemental Table 1.xlsx (Excel spreadsheet). File information: Alignment files are included in FASTA-format: 'SSTR_alignment.fasta' and 'SSTR_additional_alignment.fasta'. This file format can be opened by most sequence analysis applications as well as text editors. The second alignment file includes additional teleost fish SSTR-sequences from the NCBI Reference Sequence database, as detailed in 'Supplemental Table 1.xlsx'. Alignments were created using the ClustalWS sequence alignment program with standard settings (Gonnet weight matrix, gap opening penalty 10.0 and gap extension penalty 0.20) through the JABAWS 2 tool in Jalview 2.7 (http://www.jalview.org/). The alignments were edited manually in Jalview in order to curate short, incomplete or highly divergent amino acid sequence predictions from the genome database. In this way erroneous automatic exon predictions and exons that had not been predicted could be ratified. Phylogenetic tree files are included in Phylip/Newick format with the extension '.phb'. This file format can be opened by freely available phylogenetic tree viewers such as FigTree (http://tree.bio.ed.ac.uk/software/figtree/) and TreeView (http://darwin.zoology.gla.ac.uk/~rpage/treeviewx/). The phylogenetic analyses were carried out based on the included alignments using both neighbor joining (NJ) and phylogenetic maximum likelihood (PhyML) methods. Phylogenetic trees are rooted with the human kisspeptin receptor (KISS1R/GPR54) amino acid sequence. The NJ trees are supported by a non-parametric bootstrap analysis with 1000 replicates, applied through ClustalX 2.0 (http://www.clustal.org/clustal2/) with standard settings: 'SSTR_NJ_rooted.phb' and 'SSTR_NJ_additional_rooted.phb'. The second phylogenetic tree file includes additional teleost fish SSTR-sequences from the NCBI Reference Sequence database, as detailed above. The PhyML trees are supported by both a non-parametric bootstrap analysis with 100 replicates - 'SSTR_PhyML_bootstrapped_rooted.phb' - and an SH-like approximate likelihood ratio test - 'SSTR_PhyML_aLRT_rooted.phb'. Both trees were made using the PhyML 3.0 algorithm (http://www.atgc-montpellier.fr/phyml/) with the following settings: amino acid frequencies (equilibrium frequencies), proportion of invariable sites (with optimised p-invar) and gamma-shape parameters were estimated from the datasets; the number of substitution rate categories was set to 8; BIONJ was chosen to create the starting tree and both the nearest neighbor interchange (NNI) and subtree pruning and regrafting (SPR) tree improvement methods were used to estimate the best topology; both tree topology and branch length optimization were chosen. The JTT model of amino acid substitution was chosen using ProtTest 3.0 (https://bitbucket.org/diegodl/prottest3/downloads). Species abbreviations are applied as follows: Homo sapiens (Hsa, human), Mus musculus (Mmu, mouse), Canis familiaris (Cfa, dog), Monodelphis domestica (Mdo, grey short-tailed opossum), Gallus gallus (Gga, chicken), Anolis carolinensis (Aca, Carolina anole lizard), Silurana (Xenopus) tropicalis (Xtr, Western clawed frog), Latimeria chalumnae (Lch, Comoran coelacanth), Lepisosteus oculatus (Loc, spotted gar), Danio rerio (Dre, zebrafish), Oryzias latipes (Ola, medaka), Gasterosteus aculeatus (Gac, three-spined stickleback), Tetraodon nigroviridis (Tni, green spotted pufferfish), Takifugu rubripes (Tru, fugu) and Drosophila melanogaster (Dme, fruit fly).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Blaskowski, Stephen (2025). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7055911

Data from: MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

Related Article
Explore at:
Dataset updated
Jan 22, 2025
Dataset provided by
Coesel, Sacha
Armbrust, E. Virginia
Blaskowski, Stephen
Groussman, Mora J
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:

MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier.

accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.

marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.

tax_id: The NCBI Taxonomy ID (taxID).

pr2_accession: Best-matching PR2 accession ID associated with entry

pr2_rank: The lowest shared rank between the entry and the pr2_accession

pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession

data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).

data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).

source_link: URL where the original sequence data and/or metadata was collected.

pub_year: Year of data release or publication of linked reference.

ref_link: Pubmed URL directs to the published reference for entry, if available.

ref_doi: DOI of entry data from source, if available.

source_filename: Name of the original sequence file name from the data source.

seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.

n_seqs_raw: Number of sequences in the original sequence file.

source_name: Full organism name from entry source

original_taxID: Original NCBI taxID from entry data source metadata, if available

alias: Additional identifiers for the entry, if available

MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier

marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.

tax_id: Verified NCBI taxID used in MarFERReT

taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)

taxID_notes: Notes on the original_taxID

n_seqs_raw: Number of sequences in the original sequence file

n_pfams: Number of Pfam domains identified in protein sequences

qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.

flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.

VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).

flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.

rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.

rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.

flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.

flag_sum: Count of the number of flag columns (qc_flag, flag_Lasek, flag_VanVlierberghe, and flag_rp63). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).

accepted: Acceptance into the final MarFERReT build (Y or N).

MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

The columns in this file contain the following information:

accession: (NA)

accession.version: The unique MarFERReT sequence identifier ('mftX').

taxid: The NCBI Taxonomy ID associated with this reference sequence.

gi: (NA).

MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

aa_id: the unique identifier for each MarFERReT protein sequence.

entry_id: The unique numeric identifier for each MarFERReT entry.

source_defline: The original, unformatted sequence identifier

MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:

aa_id: The unique MarFERReT protein sequence ID ('mftX').

pfam_name: The shorthand Pfam protein family name.

pfam_id: The Pfam identifier.

pfam_eval: hmm profile match e-value score

pfam_score: hmm profile match bitscore

MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.

Search
Clear search
Close search
Google apps
Main menu