65 datasets found
  1. Z

    Data from: COInr a comprehensive, non-redundant COI database from NCBI-nt...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated May 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meglecz, Emese (2024). COInr a comprehensive, non-redundant COI database from NCBI-nt and BOLD [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6555984
    Explore at:
    Dataset updated
    May 6, 2024
    Dataset authored and provided by
    Meglecz, Emese
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COInr is a non-redundant, comprehensive database of COI sequences extracted from NCBI-nt and BOLD. It is not limited to a taxon, a gene region, or a taxonomic resolution. Sequences are dereplicated between databases and within taxa.

    Each taxon has a unique taxonomic Identifier (taxID), fundamental to avoid ambiguous associations of homonyms and synonyms in the source database. TaxIDs form a coherent hierarchical system fully compatible with the NCBI taxIDs allowing creating their full or ranked linages.

    COInr is a good starting point to create custom databases according to the users’ needs using mkCOInr scripts available at https://github.com/meglecz/mkCOInr
    It is possible to select/eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for BLAST, QIIME, RDP classifiers.

  2. s

    COI reference sequences from BOLD DB

    • figshare.scilifelab.se
    • researchdata.se
    application/gzip
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Sundh (2025). COI reference sequences from BOLD DB [Dataset]. http://doi.org/10.17044/scilifelab.20514192.v4
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    Swedish Museum of Natural History
    Authors
    John Sundh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The fasta file bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public Data Portal and each fasta header contains the taxonomic ranks + the BIN ID assigned to the record. The taxonomic information for each record is also given in the tab-separated file bold_info_filtered.tsv.gz.The file bold_clustered.sintax.fasta.gz is directly compatible with the SINTAX algorithm in vsearch while files bold_clustered.assignTaxonomy.fasta.gz and bold_clustered.addSpecies.fasta.gz are directly compatible with the assignTaxonomy and addSpecies functions from DADA2, respectively. The dataset was last created on December 16, 2022NOTE: We have noticed that the gzipped files in this upload have been compressed twice for some reason. A quick fix is to unzip any file with a ".gz" extension, then rename the unzipped file by adding the ".gz" extension back. Then running the unzipping once again. Sorry for the inconvenience.MethodsThe code used to generate this dataset consists of a snakemake workflow wrapped into a python package that can be installed with conda (conda install -c bioconda coidb). Firstly sequence and taxonomic information for records in the BOLD database is downloaded from the GBIF Hosted Datasets. This data is then filtered to only keep records annotated as 'COI-5P' and assigned to a BIN ID. The taxonomic information is parsed in order to assign species names and resolve higher level ranks for each BIN ID. Sequences are processed to remove gap characters and leading and trailing Ns. After this, any sequences with remaining non-standard characters are removed. Sequences are then clustered at 100% identity using vsearch (Rognes et al. 2016). This clustering is done separately for sequences assigned to each BIN ID.For more information, see https://github.com/biodiversitydata-se/coidb

  3. COins database

    • figshare.com
    zip
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Magoga (2024). COins database [Dataset]. http://doi.org/10.6084/m9.figshare.19130465.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Giulia Magoga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COins is a database of COI-5P sequences of insects that includes over 532,000 representative sequences of more than 106,000 species specifically formatted for the QIIME2 software platform. It was developed through a combination of automated and manually curated steps, starting from insects COI sequences available in the Barcode of Life Data System selecting sequences that comply to several standards, including a species-level identification.seq-degapped.qza --> reference sequencestaxonomy.qza --> sequences taxonomySklearnClassifier_COins_QIIME2_v2024.5.qza (NEW!) --> naïve Bayes taxonomic classifier trained on COins (QIIME2 version 2024.5)SklearnClassifier_COins_QIIME2_v2023.5.qza --> naïve Bayes taxonomic classifier trained on COins (QIIME2 version 2023.5)SklearnClassifier_COins_QIIME2_v2022.2.qza --> naïve Bayes taxonomic classifier trained on COins (QIIME2 version 2022.2)Sequences_metadata1.tsv --> Identification procedure of voucher specimens from which reference sequences were developed.Identification procedure is reported for each sequence included in COins (BOLD id reported in BOLDid reference column) and for all identical sequences within haplotypes that were removed at Step 5 of COins curation (those for which BOLD id is not available in BOLDid reference column). The haplotype to which each sequence belongs is reported in Haplotype column (haplotypes of each species are labeled with increasing numbers). Identification procedure information derived from sequences associated metadata provided by BOLD system.Sequences_metadata2.tsv -->Identical sequences belonging to different species present within COins.Each row represents a cluster of identical sequences associated to different species, sequences included in the cluster are labeled with species name and BOLD id.

  4. BOLD mtCOI database 2023-03-31 reformatted to SINTAX format for MB_Pipeline

    • zenodo.org
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brandon Seah; Brandon Seah (2024). BOLD mtCOI database 2023-03-31 reformatted to SINTAX format for MB_Pipeline [Dataset]. http://doi.org/10.5281/zenodo.13828767
    Explore at:
    Dataset updated
    Sep 23, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Brandon Seah; Brandon Seah
    License

    Attribution-NonCommercial-ShareAlike 2.0 (CC BY-NC-SA 2.0)https://creativecommons.org/licenses/by-nc-sa/2.0/
    License information was derived automatically

    Description

    Publicly available barcode records for the mitochondrial COI gene in the BOLD database (https://www.boldsystems.org/), release of 2023-03-31, reformatted to SINTAX format for the MB_Pipeline metabarcoding pipeline. The original data were released under a CC-BY-NC-SA license.

    The conversion was performed with scripts from reformat-barcode-db (https://github.com/monagrland/reformat-barcode-db). See log files for full command line and options used to filter sequences and prepare the HMMs. Different clustering, genetic code, and length parameters are required for each marker gene.

    The Fasta files should be indexed to UDB format with Vsearch before use.

    Description of files

    All records:

    • bold.2023-03-31.coi-5p.all.fasta.gz - sequences in Fasta format, gzip compressed
    • bold.2023-03-31.coi-5p.all.log - log file for bold2sintax script
    • bold.2023-03-31.coi-5p.all.acc2bin - table of BIN identifiers for each accession, if available (else None)

    European records only:

    • bold.2023-03-31.coi-5p.eur.fasta.gz - sequences in Fasta format, gzip compressed
    • bold.2023-03-31.coi-5p.eur.log - log file for bold2sintax script
    • european_countries.txt - List of country names used for subsetting
    • bold.2023-03-31.coi-5p.eur.acc2bin - table of BIN identifiers for each accession, if available (else None)

    HMMer model for arthropod sequences:

    • bold.2023-03-31.coi-5p.arthropoda.hmm - HMM file
    • bold.2023-03-31.coi-5p.arthropoda.log - log file for bold2hmm script
  5. f

    BOLD Sequence and Metadata

    • figshare.com
    bin
    Updated Apr 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor French (2023). BOLD Sequence and Metadata [Dataset]. http://doi.org/10.6084/m9.figshare.22557511.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 4, 2023
    Dataset provided by
    figshare
    Authors
    Connor French
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw data from the Barcode of Life Database which was downloaded on 2023-11-19. It is formatted as a CSV file. It contains raw COI sequences, barcode identifiers, spatial coordinates, and other metadata reported by BOLD.

    Necessary code for processing the raw data is available on https://github.com/connor-french/global-insect-macrogenetics, specifically the notebook step-1_seq-filter-align-test.ipynb.

  6. n

    NEON (National Ecological Observatory Network) Ground beetle sequences DNA...

    • data.neonscience.org
    zip
    Updated Jun 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). NEON (National Ecological Observatory Network) Ground beetle sequences DNA barcode (DP1.10020.001), RELEASE-2025 [Dataset]. http://doi.org/10.48443/c71s-n628
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 30, 2025
    License

    https://www.neonscience.org/data-samples/data-policies-citationhttps://www.neonscience.org/data-samples/data-policies-citation

    Time period covered
    Jul 2013 - Nov 2022
    Area covered
    NIWO, BART, CPER, PUUM, GUAN, KONA, RMNP, WOOD, UKFS, OAES
    Description

    COI DNA sequences from select ground beetles

  7. f

    Assessment of BOLD and GenBank – Their accuracy and reliability for the...

    • plos.figshare.com
    • figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelly A. Meiklejohn; Natalie Damaso; James M. Robertson (2023). Assessment of BOLD and GenBank – Their accuracy and reliability for the identification of biological materials [Dataset]. http://doi.org/10.1371/journal.pone.0217084
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Kelly A. Meiklejohn; Natalie Damaso; James M. Robertson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Taxonomic identification of biological materials can be achieved through DNA barcoding, where an unknown “barcode” sequence is compared to a reference database. In many disciplines, obtaining accurate taxonomic identifications can be imperative (e.g., evolutionary biology, food regulatory compliance, forensics). The Barcode of Life DataSystems (BOLD) and GenBank are the main public repositories of DNA barcode sequences. In this study, an assessment of the accuracy and reliability of sequences in these databases was performed. To achieve this, 1) curated reference materials for plants, macro-fungi and insects were obtained from national collections, 2) relevant barcode sequences (rbcL, matK, trnH-psbA, ITS and COI) from these reference samples were generated and used for searching against both databases, and 3) optimal search parameters were determined that ensure the best match to the known species in either database. While GenBank outperformed BOLD for species-level identification of insect taxa (53% and 35%, respectively), both databases performed comparably for plants and macro-fungi (~81% and ~57%, respectively). Results illustrated that using a multi-locus barcode approach increased identification success. This study outlines the utility of the BLAST search tool in GenBank and the BOLD identification engine for taxonomic identifications and identifies some precautions needed when using public sequence repositories in applied scientific disciplines.

  8. n

    NEON (National Ecological Observatory Network) Mosquito sequences DNA...

    • data.neonscience.org
    zip
    Updated Oct 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). NEON (National Ecological Observatory Network) Mosquito sequences DNA barcode (DP1.10038.001) [Dataset]. https://data.neonscience.org/data-products/DP1.10038.001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 15, 2023
    License

    https://www.neonscience.org/data-samples/data-policies-citationhttps://www.neonscience.org/data-samples/data-policies-citation

    Time period covered
    May 2014 - Oct 2023
    Area covered
    ORNL, ABBY, WOOD, LAJA, JERC, SCBI, UKFS, KONZ, RMNP, DSNY
    Description

    COI DNA sequences from select mosquitoes

  9. Z

    BOLD Insecta and Araneae data files for "Mining biodiversity databases...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew R Moore (2023). BOLD Insecta and Araneae data files for "Mining biodiversity databases establishes a global baseline of cosmopolitan Insecta mOTUs: a case study on Platygastroidea (Hymenoptera) with consequences for biological control programs" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7930406
    Explore at:
    Dataset updated
    Oct 24, 2023
    Dataset provided by
    Zach Lahey
    Cheryl G Roberts
    Jonathan S Bremer
    Elijah J Talamas
    James C Fulton
    Matthew R Moore
    Jessica Awad
    Natalie McGathey
    Lynn A Combee
    Description

    These are the BOLD Insecta and Araneae DWC files for use with the automated scripting procedure from "Mining biodiversity databases establishes a global baseline of cosmopolitan Insecta mOTUs: a case study on Platygastroidea (Hymenoptera) with consequences for biological control programs".

  10. n

    NEON (National Ecological Observatory Network) Fish sequences DNA barcode...

    • data.neonscience.org
    zip
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). NEON (National Ecological Observatory Network) Fish sequences DNA barcode (DP1.20105.001), RELEASE-2025 [Dataset]. http://doi.org/10.48443/90q6-en82
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 19, 2025
    License

    https://www.neonscience.org/data-samples/data-policies-citationhttps://www.neonscience.org/data-samples/data-policies-citation

    Time period covered
    Nov 2017 - Dec 2023
    Area covered
    LIRO, TOOK, LEWI, MAYF, REDB, BLDE, OKSR, ARIK, KING, HOPB
    Description

    COI DNA sequences from select fish in lakes and wadeable streams

  11. f

    Data_Sheet_1_Sixteen Years of DNA Barcoding in China: What Has Been Done?...

    • frontiersin.figshare.com
    zip
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cai-qing Yang; Qing Lv; Ai-bing Zhang (2023). Data_Sheet_1_Sixteen Years of DNA Barcoding in China: What Has Been Done? What Can Be Done?.zip [Dataset]. http://doi.org/10.3389/fevo.2020.00057.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Cai-qing Yang; Qing Lv; Ai-bing Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    Over the past 16 years, more than half (59.68%) of research papers in China on DNA barcoding have been published in Chinese rather than English. Using the records in the BOLD (Barcode of Life Data) system, we found Chinese scientists have contributed nearly 120,000 DNA barcodes for more than 16,000 species as of September 2019, with barcoded species distributed throughout China. Based on 2,624 articles and 494 dissertations published during the last 16 years, we reviewed the basic statistics of these studies as well as the type of articles contributed by Chinese scientists, the preference of taxonomic groups, the characteristic of barcoding studies in China, the current limitations, and potential future directions as well. We found that most barcode data pertain primarily to plants and animals. Most work in China has focused on verification of the authenticity of species used in traditional Chinese medicine, while other applications have paid more attention to food safety, inspection and quarantine, and the control of pests and invasive species. In methodology and technology, a number of new DNA barcoding methods have been developed by Chinese scientists. However, there are several significant limitations to research into DNA barcoding in China in general, such as the lack of leadership in pioneering international projects, the absence of an open bioinformatics infrastructure, and the fact that some Chinese journals do not clearly require data transparency and availability for DNA barcodes, impeding the further development of barcode libraries and research in China. In the future, Chinese scientists should build authoritative online libraries, while aiming for theoretical innovations for both concepts and methodology of DNA barcoding.

  12. f

    Extended COI alignment and detailed dataset information (603 bp, 1946...

    • figshare.com
    xlsx
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrik Macko (2025). Extended COI alignment and detailed dataset information (603 bp, 1946 sequences) of aquatic beetles, including 65 of the 68 species recorded in this study and public sequences retrieved from the BOLD database. [Dataset]. http://doi.org/10.6084/m9.figshare.28463552.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset provided by
    figshare
    Authors
    Patrik Macko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High altitudes as reservoirs of unique genetic diversity: a case study on aquatic beetles in glacial lakes of Tatra MountainsPatrik Macko1, Fedor Čiampor Jr2, Michaela Šamulková1,2, Ondrej Vargovčík1, Kornélia Tuhrinová1,2, Zuzana Čiamporová-Zaťovičová1,21Department of Ecology, Faculty of Natural Sciences, Comenius University in Bratislava, Slovak Republic2ZooLab, Department of Biodiversity and Ecology, Plant Science and Biodiversity Centre, Slovak Academy of Sciences, Slovak Republic

  13. Identification success using the Kimura-2 parameter distances with three...

    • plos.figshare.com
    xls
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanislas Talaga; Céline Leroy; Amandine Guidez; Isabelle Dusfour; Romain Girod; Alain Dejean; Jérôme Murienne (2023). Identification success using the Kimura-2 parameter distances with three different criteria: ‘Nearest-neighbour’, ‘best close match’ and ‘BOLD ID’. [Dataset]. http://doi.org/10.1371/journal.pone.0176993.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Stanislas Talaga; Céline Leroy; Amandine Guidez; Isabelle Dusfour; Romain Girod; Alain Dejean; Jérôme Murienne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identification success using the Kimura-2 parameter distances with three different criteria: ‘Nearest-neighbour’, ‘best close match’ and ‘BOLD ID’.

  14. Data from: Gaps in DNA sequence libraries for Macaronesian marine...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin
    Updated Jun 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Vieira; Pedro Vieira; Ana Sofia Lavrador; Manuela Parente; Manuela Parente; Paola Parretti; Ana Costa; Ana Costa; Filipe Costa; Filipe Costa; Sofia Duarte; Sofia Duarte; Ana Sofia Lavrador; Paola Parretti (2022). Gaps in DNA sequence libraries for Macaronesian marine macroinvertebrates imply decades till completion and robust monitoring [Dataset]. http://doi.org/10.5061/dryad.sf7m0cg63
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 4, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pedro Vieira; Pedro Vieira; Ana Sofia Lavrador; Manuela Parente; Manuela Parente; Paola Parretti; Ana Costa; Ana Costa; Filipe Costa; Filipe Costa; Sofia Duarte; Sofia Duarte; Ana Sofia Lavrador; Paola Parretti
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Macaronesia
    Description

    Aim:

    DNA metabarcoding has great potential to improve biomonitoring in island's marine ecosystems, which are highly vulnerable to global change and non-indigenous species (NIS) introductions. However, the depth and accuracy of the taxonomic identifications are largely dependent on reference libraries containing representative and reliable sequences for the targeted species. In this study, we evaluated the gaps in the availability of DNA sequences and their accuracy, for macroinvertebrates inhabiting Macaronesia's shallow marine habitats.

    Location:

    Macaronesia (Azores, Madeira, Selvagens, Canaries).

    Methods:

    Checklists of marine invertebrates occurring above 50m depth were compiled using public databases and published checklists. The availability of cytochrome c oxidase subunit I (COI) and 18S rRNA (18S) gene sequences was verified in BOLD and GenBank. Finally, COI data was audited to check the congruence between morphospecies and Barcode Index Numbers (BINs).

    Results:

    The taxonomic coverage of different phyla was greater for COI but unbalanced and variable among archipelagos. NIS were better represented in genetic databases (up to 73% and 59%, for COI and 18S, respectively) than native species (up to 47% and 31%, for COI and 18S, respectively). NIS displayed a higher number of discordant records, while native species higher cases of multiple BINs. Notably, DNA sequences generated from specimens collected from Macaronesia were found in less than 10% of the species. Analysis of the rates of accretion of DNA sequences suggests that decades will be needed to complete these reference libraries.

    Main conclusions:

    The level of completion of reference libraries for Macaronesia's marine macroinvertebrates is generally poor. Without a strong effort to speed up the production of sequence data (i.e., generate more DNA barcodes), the ability to employ DNA-based biomonitoring of such vulnerable fauna is compromised. The high levels of suspected hidden diversity here reported, further deepens the expected gaps, and reinforces the vulnerability of this endemism-rich fauna.

  15. d

    Data from: Integrating UCE phylogenomics with traditional taxonomy reveals a...

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Dec 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael G. Branstetter; John T. Longino (2021). Integrating UCE phylogenomics with traditional taxonomy reveals a trove of New World Syscia species (Formicidae, Dorylinae) [Dataset]. http://doi.org/10.5061/dryad.08kprr50s
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 29, 2021
    Dataset provided by
    Dryad
    Authors
    Michael G. Branstetter; John T. Longino
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2020
    Description

    We have deposited data and results files that support the molecular phylogenetic analyses presented in the study. Raw Illumina reads and contigs representing UCE loci have been deposited at the NCBI Sequence Read Archive and GenBank, respectively (BioProject# PRJNA615631). All newly generated COI sequences have been deposited at GenBank (MT267540-MT267668). Here we have deposited the concatenated UCE matrix, the COI matrix, all Trinity contigs, all tree files, unfiltered alignment files, and additional data analysis files (partitioning schemes, log files). The methods used to generate these data are described below and in the accompanying paper.

    DNA sequence generation: We selected 130 specimens for inclusion in molecular phylogenomic analysis (Table S1): 128 Syscia and two outgroup specimens from the genus Ooceraea. All sequence data were newly generated for this study, except for 5 samples, for which data were extracted from Oxley et al. (2014; Genome), Branstetter et al. (2017), and Borowiec (2019) (see Table S1). Vouchers were designated for each extraction and may be the same specimen (non-destructive DNA extraction) or with varying degrees of subjectivity from the same nest, collection series, or rarely, population. Full voucher specimen details are in Supplementary Material, Table S2.

    To examine species boundaries and phylogenetic relationships among species and populations, we employed the UCE approach to phylogenomics (Faircloth et al. 2012, Faircloth et al. 2015, Branstetter et al. 2017), a method that combines targeted enrichment of ultraconserved elements (UCEs) with multiplexed, next-generation sequencing. All UCE molecular work was performed following the UCE methodology described in Branstetter et al. (2017). Briefly, the process involves DNA extraction, sample QC, DNA fragmentation (400-600 bp), library preparation, library pooling (equimolar pools of 10 or 11 samples), UCE enrichment, qPCR quantification, final pooling (up to 102 samples per sequencing pool), and sequencing. All sequencing was performed on an Illumina HiSeq 2500 instrument (2x125 bp v4 chemistry; Illumina Inc., San Diego, CA) by the University of Utah genomics core facility. To enrich UCE loci, we used an ant-customized bait set (“ant-specific hym-v2”) that includes 9,898 baits (120 mer) targeting 2,524 UCE loci shared across Hymenoptera and a set of legacy markers (data not used) (Branstetter et al. 2017). The ability of this bait set to successfully enrich UCE loci and resolve relationships in ants has been demonstrated in several studies (Branstetter et al. 2017, Pierce et al. 2017, Ward and Branstetter 2017, Blaimer et al. 2018, Branstetter and Longino 2019, Longino and Branstetter 2020).

    UCE matrix assembly: After sequencing, the University of Utah bioinformatics core demultiplexed the data using bcl2fastq v1.8 (Illumina, 2013) and made the data available for download. Once received, the sequence data were cleaned, assembled and aligned using PHYLUCE v1.6 (Faircloth 2016), which includes a set of wrapper scripts that facilitates batch processing of large numbers of samples. Within the PHYLUCE environment, we used the programs ILLUMIPROCESSOR v2.0 (Faircloth 2013), which incorporates TRIMMOMATIC (Bolger et al. 2014), for quality trimming raw reads, TRINITY v2013-02-25 (Grabherr et al. 2011) for de novo assembly of reads into contigs, and LASTZ v1.0 (Harris 2007) for identifying UCE contigs from all contigs. All optional PHYLUCE settings were left at default values for these steps. For the bait sequences file needed to identify and extract UCE contigs, we used the ant-specific hym-v2 bait file. To calculate assembly statistics, including sequencing coverage, we used scripts from the PHYLUCE package (phyluce_assembly_get_trinity_coverage and phyluce_assembly_get_trinity_coverage_for_uce_loci) that call the programs BWA v 0.7.7 (Li and Durban 2010) and GATK v3.8 (McKenna et al. 2010).

    After extracting UCE contigs, we aligned each UCE locus using a stand-alone version of the program MAFFT v7.130b (Katoh and Standley 2013) and the L-INS-i algorithm. We then used a PHYLUCE wrapper to trim flanking regions and poorly aligned internal regions using the program GBLOCKS (Talavera and Castresana 2007). The program was run with reduced stringency parameters (b1:0.5, b2:0.5, b3:12, b4:7). We then used another PHYLUCE script to filter the initial set of alignments so that each alignment was required to include data for ≥ 90% of taxa. This resulted in a final set of 1,388 alignments and 1,035,633 bp of sequence data for analysis. To calculate summary statistics for the final data matrix, we used a script from the PHYLUCE package (phyluce_align_get_align_summary_data). Information related to UCE sequencing and assembly results can be found in Supplemental Material, Table S3. All steps, including the phylogenetic analyses described below, were performed on a multicore Linux workstation (40 CPUs and 512 Gb of memory).

    Phylogenomic analysis: To partition the UCE data for phylogenetic analysis, we used the Sliding-Window Site Characteristics based on entropy method (SWSC-EN; Tagliacollo and Lanfear 2018), which breaks UCE loci into three regions, corresponding to the right flank, core, and left flank. The theoretical underpinning of the approach comes from the observation that UCE core regions are conserved, while the flanking regions become increasingly more variable (Faircloth et al. 2012). After running the SWSC-EN algorithm, the resulting data subsets were analyzed using PARTITIONFINDER2 (Lanfear et al. 2012, Lanfear et al. 2017). For this analysis we used the rclusterf algorithm, AICc model selection criterion, and the GTR+G model of sequence evolution. The resulting best-fit partitioning scheme included 1,126 data subsets and had a significantly better log likelihood than alternative partitioning schemes (SWSC-EN: -5,608,249.502; By Locus: -5,639,169.680; Unpartitioned: -5,731,679.666).

    Using the SWSC-EN partitioning scheme, we inferred phylogenetic relationships of Syscia with the likelihood-based program IQ-TREE v1.5.5 (Nguyen et al. 2015). For the analysis we selected the “-spp” option for partitioning (linked branch lengths but allowing each partition to have its own evolutionary rate) and the GTR+F+G4 model of sequence evolution. To assess branch support, we performed 1,000 replicates of the ultrafast bootstrap approximation (UFB) (Minh et al. 2013, Hoang et al. 2018) and 1,000 replicates of the branch-based, SH-like approximate likelihood ratio test (Guindon et al. 2010). For these support measures, values ≥ 95% and ≥ 80%, respectively, signal that a clade is supported.

    COI barcode analysis: Due to the high abundance of mitochondrial DNA in samples and the less-than-perfect efficiency of target enrichment methods, Cytochrome Oxidase I (COI) sequence data, and sometimes entire mitochondrial genomes (see Ströher et al. 2016) are often generated as a byproduct of the UCE sequencing process. To provide a separate assessment of species identities, possibly with more samples included, we extracted COI sequences from our UCE enriched samples and combined them with Syscia COI sequences downloaded from the BOLD database (Ratnasingham and Hebert 2007) (Accessed 16 May 2019). To extract COI from UCE data, we downloaded a complete 658 bp barcode sequence of a Costa Rican Syscia specimen from BOLD (Process ID ACGAE095-10, identified by us as S. benevidesae, one of the new species in this work) and used this as the bait input sequence for a PHYLUCE program (phyluce_assembly_match_contigs_to_barcodes) that extracts COI sequence from bulk sets of contigs.

    After extracting COI sequence from UCE sample data, we downloaded accessible barcode sequences from BOLD following a series of steps. First, using the BOLD workbench interface, we searched for all records matching the taxonomy search term “Syscia” or “Cerapachys”. We then copied all of the resulting Barcode Index Numbers (BINs) and performed a second search using these numbers in the identifiers field. This approach recovers taxonomically mislabeled samples because BINs group sequences into units by sequence similarity, not name (Ratnasingham & Hebert 2013). All returned sequences were downloaded examined, and subsequently filtered to remove Old World specimens and entries with no sequence data. We also removed a misidentified sample from Madagascar and a sequence mined from GenBank that had no accompanying specimen data. Because some of the remaining sequences included private, unpublished data, we contacted data owners for permission to use the private sequences in our analyses.

    We combined the final set of BOLD sequences with the successfully extracted COI sequences from UCE samples and aligned the data using MAFFT. We visually inspected the resulting alignment for signs of pseudogenes/numts (e.g. presence of stop codons, indels, or highly divergent sequence) or other anomalies using MESQUITE v3.51 (Maddison and Maddison 2018). The final matrix was partitioned by codon position and analyzed with IQ-TREE using GTR+F+G4, 1,000 ultrafast bootstrap replicates, and 1,000 SH-like replicates. Following a preliminary analysis of all samples, we discovered that a set of 79 putative “Cerapachys” samples actually belonged to the phylogenetically distinct genus Neocerapachys. Consequently, we removed these samples from our data set and updated determinations in BOLD. Sample information for the final set of 86 BOLD specimens included in our analysis is available in Supplemental Material, Table S4.

  16. E

    IMS-METU DNA barcodes dataset

    • edmed.seadatanet.org
    • bodc.ac.uk
    nc
    Updated May 19, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Marine Sciences, Middle East Technical University (2015). IMS-METU DNA barcodes dataset [Dataset]. https://edmed.seadatanet.org/report/6168/
    Explore at:
    ncAvailable download formats
    Dataset updated
    May 19, 2015
    Dataset authored and provided by
    Institute of Marine Sciences, Middle East Technical University
    License

    https://vocab.nerc.ac.uk/collection/L08/current/MO/https://vocab.nerc.ac.uk/collection/L08/current/MO/

    Time period covered
    Jan 1, 2012 - Present
    Area covered
    Mediterranean Sea, Aegean Sea, Black Sea, Sea of Marmara
    Description

    The IMS-METU DNA barcodes dataset contains data on sequences of barcode genes (COI, ITS, RbcL and MatK) of mainly marine organisms from the Eastern Mediterranean, Aegean, Marmara and Black Sea and some other regions of the World Ocean. The DNA barcode data are supplemented with pictures of samples and sampling details. The DNA barcode data are periodically deposited to the Barcode of Life Data Systems BOLD (http://www.boldsystems.org/). BOLD is a web platform that provides an integrated environment for the assembly and use of DNA barcode data. It delivers an online database for the collection and management of specimen, distributional, and molecular data as well as analytical tools to support their validation. Since its launch in 2005, BOLD has been extended to provide a range of functionality including data organisation, validation, visualisation and publication. BOLD shares a tightly integrated data exchange pipeline with NCBI (GenBank). GenBank puts a default 1-year privacy period on records submitted through BOLD, where the records are deposited in GenBank but are still inaccessible to the public. This privacy period allows BOLD users to gain accessions early in the manuscript writing process and removes the need for rushing to gain accessions once the manuscript is in its final stages of acceptance by a journal. Then the data will be freely available. (http://www.boldsystems.org/index.php/resources/handbook?chapter=6_managingdata.html&section=publication).

  17. Z

    GenBank + BOLD CO1 Eukaryotic representative sequence set

    • data.niaid.nih.gov
    Updated Feb 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morien, Evan (2020). GenBank + BOLD CO1 Eukaryotic representative sequence set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3688867
    Explore at:
    Dataset updated
    Feb 27, 2020
    Dataset authored and provided by
    Morien, Evan
    License

    https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft-next-0.3.1https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft-next-0.3.1

    Description

    This is a representative sequence set for cytochrome oxidase subunit 1 (CO1 or COI) combining all available eukaryotic CO1 sequences from GenBank and BOLD, clustered at 99% similarity.

    TODO:

    generate and add 7-level taxonomies for each sequence in this rep set.

  18. Evaluating the genetic variation of the COI gene of Insecta: Implications...

    • zenodo.org
    • datadryad.org
    zip
    Updated Dec 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haiguang Zhang; Wenjun Bu; Haiguang Zhang; Wenjun Bu (2022). Evaluating the genetic variation of the COI gene of Insecta: Implications for DNA barcoding, metabarcoding and species delimitation studies [Dataset]. http://doi.org/10.5061/dryad.qnk98sff2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Haiguang Zhang; Wenjun Bu; Haiguang Zhang; Wenjun Bu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The genetic variation of the COI gene has a great effect on the final results of the species delimitation studies. However, little research has comprehensively investigated the genetic divergence in COI among Insecta. The fast-growing COI data in BOLD provide an opportunity for comprehensively appraising the genetic variation in COI among Insecta. We calculated the K2P distance of 64,414 insect species downloaded from BOLD. The match ratios of the clustering analysis based on different thresholds were compared among 4,288 genera (35,068 species). Besides, we also compared the match ratios obtained from two species delimitation methods: the clustering analysis (distance-based method) and the bPTP analysis (tree-based method). Furthermore, the effectiveness of two different results of the bPTP analysis: bPTP_h and bPTP_ml was also tested. Approximately one-quarter of the species of Insecta showed high intraspecific genetic variation (> 3%), and a conservative estimate of this value is 12.05-22.58%. The application of empirical thresholds (e.g., 2% and 3%) in the clustering analysis may result in the overestimation of species diversity. In metabarcoding studies, a threshold of 3% can only be used to estimate the insect diversity roughly. As for the clustering analysis, the "threshOpt" or "localMinima" algorithms can provide a priori value for the researcher. Nevertheless, if the minimum interspecific genetic distance of congeneric species was greater than or equal to 2%, it is possible to avoid overestimating the species diversity based on the empirical thresholds. Besides, the match ratios of the bPTP_ml results were higher than those of the bPTP_h results. As for the bPTP analysis, the bPTP_ml results were recommended. If a proper threshold was selected, the clustering analysis may outperform the bPTP analysis.

  19. Insektmobilen - National citizen science and DNA metabarcoding survey of...

    • gbif.org
    Updated Aug 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cecilie Svenningsen; Anders P. Tøttrup; Cecilie Svenningsen; Anders P. Tøttrup (2024). Insektmobilen - National citizen science and DNA metabarcoding survey of flying insects in June 2018 and 2019 [Dataset]. http://doi.org/10.15468/m48e3e
    Explore at:
    Dataset updated
    Aug 22, 2024
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Natural History Museum of Denmark
    Authors
    Cecilie Svenningsen; Anders P. Tøttrup; Cecilie Svenningsen; Anders P. Tøttrup
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 1, 2018 - Jun 30, 2018
    Area covered
    Description

    The Insectmobile (Insektmobilen) is a research project at the National History Museum of Denmark, University of Copenhagen, with the goal to investigate the diversity of flying insects in Denmark. In the summer of 2018, 2019 and 2020 almost 400 volunteers collected flying insects using large custom made insect nets mounted on the roof of their cars. The bulk insect samples were processed with a non-destructive DNA extraction DNA metabarcoding protocol (dx.doi.org/10.17504/protocols.io.bmunk6ve) and sequences were assigned taxonomy by importing the fasta file into GBIF's sequence ID tool (https://www.gbif.org/tools/sequence-id). The sequences were queried against a 99% clustered version of the BOLD Public Database v2024-01-06 public data (COI-5P sequences).

    The dataset contains unidentified sequences and potential errors and contaminants. For example, even though the primers used were developed as universal primers targeting freshwater insects, you will find other phyla, classes etc. We share these sequences and associated data for the data to be as open as possible, but please do reannotate sequences and filter the data for your specific needs prior to using the data for analysis. Please be aware that the samples may contain gut content of sampled insects and eDNA.

    Sequence identification certainty is captured in the identificationRemarks field. The bit score is the required size of a sequence database in which the current match could be found just by chance. The bit score is a log2 scaled and normalized raw score. Each increase by one doubles the required database size (2bit-score). The expect value is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the score of the match increases. Hence, a low expect value is better. How much of the query(input) sequence aligns with the match in the the reference database, in percent. Badges representing different identity thresholds (match types); Blast exact match = identity >= 99% and queryCoverage >= 80%. This is within the threshold of the OTU, Blast ambiguous match = identity >= 99% and queryCoverage >= 80%, but there is at least one more match with similar identity, Blast close match = identity < 99% but > 90% and queryCoverage >= 80%. It is something close to the OTU, maybe the same genus, Blast weak match = there is a match, but with identity < 90% or/and queryCoverage < 80%. Depending on the quality of the sequence, bit score, identity and expect value, a higher taxon could be inferred from this, Blast no match = no match to the reference database.

  20. d

    Data from: DNA Barcode database of Marine Species and Freshwater Fish and...

    • data.gov.au
    x-httpd-php
    Updated Sep 13, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO Oceans & Atmosphere - Hobart (2011). DNA Barcode database of Marine Species and Freshwater Fish and Associated Samples [Dataset]. https://data.gov.au/dataset/ds-marlin-daa99604-3949-48df-bff3-448acacc23cc
    Explore at:
    x-httpd-phpAvailable download formats
    Dataset updated
    Sep 13, 2011
    Dataset provided by
    CSIRO Oceans & Atmosphere - Hobart
    Description

    The Barcode of Life Data System (BOLD) is an informatics workbench aiding the acquisition, storage, analysis and publication of DNA barcode records. CSIRO Marine and Atmospheric Research (CMAR) …Show full descriptionThe Barcode of Life Data System (BOLD) is an informatics workbench aiding the acquisition, storage, analysis and publication of DNA barcode records. CSIRO Marine and Atmospheric Research (CMAR) contributes to this database, as of May 2008, it has contributed about 1000 species of fish, mostly from multiple samples, along with ~100 species of decapods and ~100 species of echinoderms (marine invertebrates). There is DNA data for a specific gene (COI). The collection of data includes GPS location, date, depth, who collected and identified sample, and some have photos. The samples used in providing the information to the Database from CMAR are housed at the Marine Laboratories in Hobart.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Meglecz, Emese (2024). COInr a comprehensive, non-redundant COI database from NCBI-nt and BOLD [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6555984

Data from: COInr a comprehensive, non-redundant COI database from NCBI-nt and BOLD

Related Article
Explore at:
Dataset updated
May 6, 2024
Dataset authored and provided by
Meglecz, Emese
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

COInr is a non-redundant, comprehensive database of COI sequences extracted from NCBI-nt and BOLD. It is not limited to a taxon, a gene region, or a taxonomic resolution. Sequences are dereplicated between databases and within taxa.

Each taxon has a unique taxonomic Identifier (taxID), fundamental to avoid ambiguous associations of homonyms and synonyms in the source database. TaxIDs form a coherent hierarchical system fully compatible with the NCBI taxIDs allowing creating their full or ranked linages.

COInr is a good starting point to create custom databases according to the users’ needs using mkCOInr scripts available at https://github.com/meglecz/mkCOInr
It is possible to select/eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for BLAST, QIIME, RDP classifiers.

Search
Clear search
Close search
Google apps
Main menu