Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COInr is a non-redundant, comprehensive database of COI sequences extracted from NCBI-nt and BOLD. It is not limited to a taxon, a gene region, or a taxonomic resolution. Sequences are dereplicated between databases and within taxa.
Each taxon has a unique taxonomic Identifier (taxID), fundamental to avoid ambiguous associations of homonyms and synonyms in the source database. TaxIDs form a coherent hierarchical system fully compatible with the NCBI taxIDs allowing creating their full or ranked linages.
COInr is a good starting point to create custom databases according to the users’ needs using mkCOInr scripts available at https://github.com/meglecz/mkCOInr
It is possible to select/eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for BLAST, QIIME, RDP classifiers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The fasta file bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public Data Portal and each fasta header contains the taxonomic ranks + the BIN ID assigned to the record. The taxonomic information for each record is also given in the tab-separated file bold_info_filtered.tsv.gz.The file bold_clustered.sintax.fasta.gz is directly compatible with the SINTAX algorithm in vsearch while files bold_clustered.assignTaxonomy.fasta.gz and bold_clustered.addSpecies.fasta.gz are directly compatible with the assignTaxonomy and addSpecies functions from DADA2, respectively. The dataset was last created on December 16, 2022NOTE: We have noticed that the gzipped files in this upload have been compressed twice for some reason. A quick fix is to unzip any file with a ".gz" extension, then rename the unzipped file by adding the ".gz" extension back. Then running the unzipping once again. Sorry for the inconvenience.MethodsThe code used to generate this dataset consists of a snakemake workflow wrapped into a python package that can be installed with conda (conda install -c bioconda coidb
). Firstly sequence and taxonomic information for records in the BOLD database is downloaded from the GBIF Hosted Datasets. This data is then filtered to only keep records annotated as 'COI-5P' and assigned to a BIN ID. The taxonomic information is parsed in order to assign species names and resolve higher level ranks for each BIN ID. Sequences are processed to remove gap characters and leading and trailing N
s. After this, any sequences with remaining non-standard characters are removed. Sequences are then clustered at 100% identity using vsearch (Rognes et al. 2016). This clustering is done separately for sequences assigned to each BIN ID.For more information, see https://github.com/biodiversitydata-se/coidb
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COins is a database of COI-5P sequences of insects that includes over 532,000 representative sequences of more than 106,000 species specifically formatted for the QIIME2 software platform. It was developed through a combination of automated and manually curated steps, starting from insects COI sequences available in the Barcode of Life Data System selecting sequences that comply to several standards, including a species-level identification.seq-degapped.qza --> reference sequencestaxonomy.qza --> sequences taxonomySklearnClassifier_COins_QIIME2_v2024.5.qza (NEW!) --> naïve Bayes taxonomic classifier trained on COins (QIIME2 version 2024.5)SklearnClassifier_COins_QIIME2_v2023.5.qza --> naïve Bayes taxonomic classifier trained on COins (QIIME2 version 2023.5)SklearnClassifier_COins_QIIME2_v2022.2.qza --> naïve Bayes taxonomic classifier trained on COins (QIIME2 version 2022.2)Sequences_metadata1.tsv --> Identification procedure of voucher specimens from which reference sequences were developed.Identification procedure is reported for each sequence included in COins (BOLD id reported in BOLDid reference column) and for all identical sequences within haplotypes that were removed at Step 5 of COins curation (those for which BOLD id is not available in BOLDid reference column). The haplotype to which each sequence belongs is reported in Haplotype column (haplotypes of each species are labeled with increasing numbers). Identification procedure information derived from sequences associated metadata provided by BOLD system.Sequences_metadata2.tsv -->Identical sequences belonging to different species present within COins.Each row represents a cluster of identical sequences associated to different species, sequences included in the cluster are labeled with species name and BOLD id.
Attribution-NonCommercial-ShareAlike 2.0 (CC BY-NC-SA 2.0)https://creativecommons.org/licenses/by-nc-sa/2.0/
License information was derived automatically
Publicly available barcode records for the mitochondrial COI gene in the BOLD database (https://www.boldsystems.org/), release of 2023-03-31, reformatted to SINTAX format for the MB_Pipeline metabarcoding pipeline. The original data were released under a CC-BY-NC-SA license.
The conversion was performed with scripts from reformat-barcode-db (https://github.com/monagrland/reformat-barcode-db). See log files for full command line and options used to filter sequences and prepare the HMMs. Different clustering, genetic code, and length parameters are required for each marker gene.
The Fasta files should be indexed to UDB format with Vsearch before use.
All records:
European records only:
HMMer model for arthropod sequences:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw data from the Barcode of Life Database which was downloaded on 2023-11-19. It is formatted as a CSV file. It contains raw COI sequences, barcode identifiers, spatial coordinates, and other metadata reported by BOLD.
Necessary code for processing the raw data is available on https://github.com/connor-french/global-insect-macrogenetics, specifically the notebook step-1_seq-filter-align-test.ipynb.
https://www.neonscience.org/data-samples/data-policies-citationhttps://www.neonscience.org/data-samples/data-policies-citation
COI DNA sequences from select ground beetles
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Taxonomic identification of biological materials can be achieved through DNA barcoding, where an unknown “barcode” sequence is compared to a reference database. In many disciplines, obtaining accurate taxonomic identifications can be imperative (e.g., evolutionary biology, food regulatory compliance, forensics). The Barcode of Life DataSystems (BOLD) and GenBank are the main public repositories of DNA barcode sequences. In this study, an assessment of the accuracy and reliability of sequences in these databases was performed. To achieve this, 1) curated reference materials for plants, macro-fungi and insects were obtained from national collections, 2) relevant barcode sequences (rbcL, matK, trnH-psbA, ITS and COI) from these reference samples were generated and used for searching against both databases, and 3) optimal search parameters were determined that ensure the best match to the known species in either database. While GenBank outperformed BOLD for species-level identification of insect taxa (53% and 35%, respectively), both databases performed comparably for plants and macro-fungi (~81% and ~57%, respectively). Results illustrated that using a multi-locus barcode approach increased identification success. This study outlines the utility of the BLAST search tool in GenBank and the BOLD identification engine for taxonomic identifications and identifies some precautions needed when using public sequence repositories in applied scientific disciplines.
https://www.neonscience.org/data-samples/data-policies-citationhttps://www.neonscience.org/data-samples/data-policies-citation
COI DNA sequences from select mosquitoes
These are the BOLD Insecta and Araneae DWC files for use with the automated scripting procedure from "Mining biodiversity databases establishes a global baseline of cosmopolitan Insecta mOTUs: a case study on Platygastroidea (Hymenoptera) with consequences for biological control programs".
https://www.neonscience.org/data-samples/data-policies-citationhttps://www.neonscience.org/data-samples/data-policies-citation
COI DNA sequences from select fish in lakes and wadeable streams
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the past 16 years, more than half (59.68%) of research papers in China on DNA barcoding have been published in Chinese rather than English. Using the records in the BOLD (Barcode of Life Data) system, we found Chinese scientists have contributed nearly 120,000 DNA barcodes for more than 16,000 species as of September 2019, with barcoded species distributed throughout China. Based on 2,624 articles and 494 dissertations published during the last 16 years, we reviewed the basic statistics of these studies as well as the type of articles contributed by Chinese scientists, the preference of taxonomic groups, the characteristic of barcoding studies in China, the current limitations, and potential future directions as well. We found that most barcode data pertain primarily to plants and animals. Most work in China has focused on verification of the authenticity of species used in traditional Chinese medicine, while other applications have paid more attention to food safety, inspection and quarantine, and the control of pests and invasive species. In methodology and technology, a number of new DNA barcoding methods have been developed by Chinese scientists. However, there are several significant limitations to research into DNA barcoding in China in general, such as the lack of leadership in pioneering international projects, the absence of an open bioinformatics infrastructure, and the fact that some Chinese journals do not clearly require data transparency and availability for DNA barcodes, impeding the further development of barcode libraries and research in China. In the future, Chinese scientists should build authoritative online libraries, while aiming for theoretical innovations for both concepts and methodology of DNA barcoding.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High altitudes as reservoirs of unique genetic diversity: a case study on aquatic beetles in glacial lakes of Tatra MountainsPatrik Macko1, Fedor Čiampor Jr2, Michaela Šamulková1,2, Ondrej Vargovčík1, Kornélia Tuhrinová1,2, Zuzana Čiamporová-Zaťovičová1,21Department of Ecology, Faculty of Natural Sciences, Comenius University in Bratislava, Slovak Republic2ZooLab, Department of Biodiversity and Ecology, Plant Science and Biodiversity Centre, Slovak Academy of Sciences, Slovak Republic
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification success using the Kimura-2 parameter distances with three different criteria: ‘Nearest-neighbour’, ‘best close match’ and ‘BOLD ID’.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Aim:
DNA metabarcoding has great potential to improve biomonitoring in island's marine ecosystems, which are highly vulnerable to global change and non-indigenous species (NIS) introductions. However, the depth and accuracy of the taxonomic identifications are largely dependent on reference libraries containing representative and reliable sequences for the targeted species. In this study, we evaluated the gaps in the availability of DNA sequences and their accuracy, for macroinvertebrates inhabiting Macaronesia's shallow marine habitats.
Location:
Macaronesia (Azores, Madeira, Selvagens, Canaries).
Methods:
Checklists of marine invertebrates occurring above 50m depth were compiled using public databases and published checklists. The availability of cytochrome c oxidase subunit I (COI) and 18S rRNA (18S) gene sequences was verified in BOLD and GenBank. Finally, COI data was audited to check the congruence between morphospecies and Barcode Index Numbers (BINs).
Results:
The taxonomic coverage of different phyla was greater for COI but unbalanced and variable among archipelagos. NIS were better represented in genetic databases (up to 73% and 59%, for COI and 18S, respectively) than native species (up to 47% and 31%, for COI and 18S, respectively). NIS displayed a higher number of discordant records, while native species higher cases of multiple BINs. Notably, DNA sequences generated from specimens collected from Macaronesia were found in less than 10% of the species. Analysis of the rates of accretion of DNA sequences suggests that decades will be needed to complete these reference libraries.
Main conclusions:
The level of completion of reference libraries for Macaronesia's marine macroinvertebrates is generally poor. Without a strong effort to speed up the production of sequence data (i.e., generate more DNA barcodes), the ability to employ DNA-based biomonitoring of such vulnerable fauna is compromised. The high levels of suspected hidden diversity here reported, further deepens the expected gaps, and reinforces the vulnerability of this endemism-rich fauna.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We have deposited data and results files that support the molecular phylogenetic analyses presented in the study. Raw Illumina reads and contigs representing UCE loci have been deposited at the NCBI Sequence Read Archive and GenBank, respectively (BioProject# PRJNA615631). All newly generated COI sequences have been deposited at GenBank (MT267540-MT267668). Here we have deposited the concatenated UCE matrix, the COI matrix, all Trinity contigs, all tree files, unfiltered alignment files, and additional data analysis files (partitioning schemes, log files). The methods used to generate these data are described below and in the accompanying paper.
DNA sequence generation: We selected 130 specimens for inclusion in molecular phylogenomic analysis (Table S1): 128 Syscia and two outgroup specimens from the genus Ooceraea. All sequence data were newly generated for this study, except for 5 samples, for which data were extracted from Oxley et al. (2014; Genome), Branstetter et al. (2017), and Borowiec (2019) (see Table S1). Vouchers were designated for each extraction and may be the same specimen (non-destructive DNA extraction) or with varying degrees of subjectivity from the same nest, collection series, or rarely, population. Full voucher specimen details are in Supplementary Material, Table S2.
To examine species boundaries and phylogenetic relationships among species and populations, we employed the UCE approach to phylogenomics (Faircloth et al. 2012, Faircloth et al. 2015, Branstetter et al. 2017), a method that combines targeted enrichment of ultraconserved elements (UCEs) with multiplexed, next-generation sequencing. All UCE molecular work was performed following the UCE methodology described in Branstetter et al. (2017). Briefly, the process involves DNA extraction, sample QC, DNA fragmentation (400-600 bp), library preparation, library pooling (equimolar pools of 10 or 11 samples), UCE enrichment, qPCR quantification, final pooling (up to 102 samples per sequencing pool), and sequencing. All sequencing was performed on an Illumina HiSeq 2500 instrument (2x125 bp v4 chemistry; Illumina Inc., San Diego, CA) by the University of Utah genomics core facility. To enrich UCE loci, we used an ant-customized bait set (“ant-specific hym-v2”) that includes 9,898 baits (120 mer) targeting 2,524 UCE loci shared across Hymenoptera and a set of legacy markers (data not used) (Branstetter et al. 2017). The ability of this bait set to successfully enrich UCE loci and resolve relationships in ants has been demonstrated in several studies (Branstetter et al. 2017, Pierce et al. 2017, Ward and Branstetter 2017, Blaimer et al. 2018, Branstetter and Longino 2019, Longino and Branstetter 2020).
UCE matrix assembly: After sequencing, the University of Utah bioinformatics core demultiplexed the data using bcl2fastq v1.8 (Illumina, 2013) and made the data available for download. Once received, the sequence data were cleaned, assembled and aligned using PHYLUCE v1.6 (Faircloth 2016), which includes a set of wrapper scripts that facilitates batch processing of large numbers of samples. Within the PHYLUCE environment, we used the programs ILLUMIPROCESSOR v2.0 (Faircloth 2013), which incorporates TRIMMOMATIC (Bolger et al. 2014), for quality trimming raw reads, TRINITY v2013-02-25 (Grabherr et al. 2011) for de novo assembly of reads into contigs, and LASTZ v1.0 (Harris 2007) for identifying UCE contigs from all contigs. All optional PHYLUCE settings were left at default values for these steps. For the bait sequences file needed to identify and extract UCE contigs, we used the ant-specific hym-v2 bait file. To calculate assembly statistics, including sequencing coverage, we used scripts from the PHYLUCE package (phyluce_assembly_get_trinity_coverage and phyluce_assembly_get_trinity_coverage_for_uce_loci) that call the programs BWA v 0.7.7 (Li and Durban 2010) and GATK v3.8 (McKenna et al. 2010).
After extracting UCE contigs, we aligned each UCE locus using a stand-alone version of the program MAFFT v7.130b (Katoh and Standley 2013) and the L-INS-i algorithm. We then used a PHYLUCE wrapper to trim flanking regions and poorly aligned internal regions using the program GBLOCKS (Talavera and Castresana 2007). The program was run with reduced stringency parameters (b1:0.5, b2:0.5, b3:12, b4:7). We then used another PHYLUCE script to filter the initial set of alignments so that each alignment was required to include data for ≥ 90% of taxa. This resulted in a final set of 1,388 alignments and 1,035,633 bp of sequence data for analysis. To calculate summary statistics for the final data matrix, we used a script from the PHYLUCE package (phyluce_align_get_align_summary_data). Information related to UCE sequencing and assembly results can be found in Supplemental Material, Table S3. All steps, including the phylogenetic analyses described below, were performed on a multicore Linux workstation (40 CPUs and 512 Gb of memory).
Phylogenomic analysis: To partition the UCE data for phylogenetic analysis, we used the Sliding-Window Site Characteristics based on entropy method (SWSC-EN; Tagliacollo and Lanfear 2018), which breaks UCE loci into three regions, corresponding to the right flank, core, and left flank. The theoretical underpinning of the approach comes from the observation that UCE core regions are conserved, while the flanking regions become increasingly more variable (Faircloth et al. 2012). After running the SWSC-EN algorithm, the resulting data subsets were analyzed using PARTITIONFINDER2 (Lanfear et al. 2012, Lanfear et al. 2017). For this analysis we used the rclusterf algorithm, AICc model selection criterion, and the GTR+G model of sequence evolution. The resulting best-fit partitioning scheme included 1,126 data subsets and had a significantly better log likelihood than alternative partitioning schemes (SWSC-EN: -5,608,249.502; By Locus: -5,639,169.680; Unpartitioned: -5,731,679.666).
Using the SWSC-EN partitioning scheme, we inferred phylogenetic relationships of Syscia with the likelihood-based program IQ-TREE v1.5.5 (Nguyen et al. 2015). For the analysis we selected the “-spp” option for partitioning (linked branch lengths but allowing each partition to have its own evolutionary rate) and the GTR+F+G4 model of sequence evolution. To assess branch support, we performed 1,000 replicates of the ultrafast bootstrap approximation (UFB) (Minh et al. 2013, Hoang et al. 2018) and 1,000 replicates of the branch-based, SH-like approximate likelihood ratio test (Guindon et al. 2010). For these support measures, values ≥ 95% and ≥ 80%, respectively, signal that a clade is supported.
COI barcode analysis: Due to the high abundance of mitochondrial DNA in samples and the less-than-perfect efficiency of target enrichment methods, Cytochrome Oxidase I (COI) sequence data, and sometimes entire mitochondrial genomes (see Ströher et al. 2016) are often generated as a byproduct of the UCE sequencing process. To provide a separate assessment of species identities, possibly with more samples included, we extracted COI sequences from our UCE enriched samples and combined them with Syscia COI sequences downloaded from the BOLD database (Ratnasingham and Hebert 2007) (Accessed 16 May 2019). To extract COI from UCE data, we downloaded a complete 658 bp barcode sequence of a Costa Rican Syscia specimen from BOLD (Process ID ACGAE095-10, identified by us as S. benevidesae, one of the new species in this work) and used this as the bait input sequence for a PHYLUCE program (phyluce_assembly_match_contigs_to_barcodes) that extracts COI sequence from bulk sets of contigs.
After extracting COI sequence from UCE sample data, we downloaded accessible barcode sequences from BOLD following a series of steps. First, using the BOLD workbench interface, we searched for all records matching the taxonomy search term “Syscia” or “Cerapachys”. We then copied all of the resulting Barcode Index Numbers (BINs) and performed a second search using these numbers in the identifiers field. This approach recovers taxonomically mislabeled samples because BINs group sequences into units by sequence similarity, not name (Ratnasingham & Hebert 2013). All returned sequences were downloaded examined, and subsequently filtered to remove Old World specimens and entries with no sequence data. We also removed a misidentified sample from Madagascar and a sequence mined from GenBank that had no accompanying specimen data. Because some of the remaining sequences included private, unpublished data, we contacted data owners for permission to use the private sequences in our analyses.
We combined the final set of BOLD sequences with the successfully extracted COI sequences from UCE samples and aligned the data using MAFFT. We visually inspected the resulting alignment for signs of pseudogenes/numts (e.g. presence of stop codons, indels, or highly divergent sequence) or other anomalies using MESQUITE v3.51 (Maddison and Maddison 2018). The final matrix was partitioned by codon position and analyzed with IQ-TREE using GTR+F+G4, 1,000 ultrafast bootstrap replicates, and 1,000 SH-like replicates. Following a preliminary analysis of all samples, we discovered that a set of 79 putative “Cerapachys” samples actually belonged to the phylogenetically distinct genus Neocerapachys. Consequently, we removed these samples from our data set and updated determinations in BOLD. Sample information for the final set of 86 BOLD specimens included in our analysis is available in Supplemental Material, Table S4.
https://vocab.nerc.ac.uk/collection/L08/current/MO/https://vocab.nerc.ac.uk/collection/L08/current/MO/
The IMS-METU DNA barcodes dataset contains data on sequences of barcode genes (COI, ITS, RbcL and MatK) of mainly marine organisms from the Eastern Mediterranean, Aegean, Marmara and Black Sea and some other regions of the World Ocean. The DNA barcode data are supplemented with pictures of samples and sampling details. The DNA barcode data are periodically deposited to the Barcode of Life Data Systems BOLD (http://www.boldsystems.org/). BOLD is a web platform that provides an integrated environment for the assembly and use of DNA barcode data. It delivers an online database for the collection and management of specimen, distributional, and molecular data as well as analytical tools to support their validation. Since its launch in 2005, BOLD has been extended to provide a range of functionality including data organisation, validation, visualisation and publication. BOLD shares a tightly integrated data exchange pipeline with NCBI (GenBank). GenBank puts a default 1-year privacy period on records submitted through BOLD, where the records are deposited in GenBank but are still inaccessible to the public. This privacy period allows BOLD users to gain accessions early in the manuscript writing process and removes the need for rushing to gain accessions once the manuscript is in its final stages of acceptance by a journal. Then the data will be freely available. (http://www.boldsystems.org/index.php/resources/handbook?chapter=6_managingdata.html§ion=publication).
https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft-next-0.3.1https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft-next-0.3.1
This is a representative sequence set for cytochrome oxidase subunit 1 (CO1 or COI) combining all available eukaryotic CO1 sequences from GenBank and BOLD, clustered at 99% similarity.
TODO:
generate and add 7-level taxonomies for each sequence in this rep set.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The genetic variation of the COI gene has a great effect on the final results of the species delimitation studies. However, little research has comprehensively investigated the genetic divergence in COI among Insecta. The fast-growing COI data in BOLD provide an opportunity for comprehensively appraising the genetic variation in COI among Insecta. We calculated the K2P distance of 64,414 insect species downloaded from BOLD. The match ratios of the clustering analysis based on different thresholds were compared among 4,288 genera (35,068 species). Besides, we also compared the match ratios obtained from two species delimitation methods: the clustering analysis (distance-based method) and the bPTP analysis (tree-based method). Furthermore, the effectiveness of two different results of the bPTP analysis: bPTP_h and bPTP_ml was also tested. Approximately one-quarter of the species of Insecta showed high intraspecific genetic variation (> 3%), and a conservative estimate of this value is 12.05-22.58%. The application of empirical thresholds (e.g., 2% and 3%) in the clustering analysis may result in the overestimation of species diversity. In metabarcoding studies, a threshold of 3% can only be used to estimate the insect diversity roughly. As for the clustering analysis, the "threshOpt" or "localMinima" algorithms can provide a priori value for the researcher. Nevertheless, if the minimum interspecific genetic distance of congeneric species was greater than or equal to 2%, it is possible to avoid overestimating the species diversity based on the empirical thresholds. Besides, the match ratios of the bPTP_ml results were higher than those of the bPTP_h results. As for the bPTP analysis, the bPTP_ml results were recommended. If a proper threshold was selected, the clustering analysis may outperform the bPTP analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Insectmobile (Insektmobilen) is a research project at the National History Museum of Denmark, University of Copenhagen, with the goal to investigate the diversity of flying insects in Denmark. In the summer of 2018, 2019 and 2020 almost 400 volunteers collected flying insects using large custom made insect nets mounted on the roof of their cars. The bulk insect samples were processed with a non-destructive DNA extraction DNA metabarcoding protocol (dx.doi.org/10.17504/protocols.io.bmunk6ve) and sequences were assigned taxonomy by importing the fasta file into GBIF's sequence ID tool (https://www.gbif.org/tools/sequence-id). The sequences were queried against a 99% clustered version of the BOLD Public Database v2024-01-06 public data (COI-5P sequences).
The dataset contains unidentified sequences and potential errors and contaminants. For example, even though the primers used were developed as universal primers targeting freshwater insects, you will find other phyla, classes etc. We share these sequences and associated data for the data to be as open as possible, but please do reannotate sequences and filter the data for your specific needs prior to using the data for analysis. Please be aware that the samples may contain gut content of sampled insects and eDNA.
Sequence identification certainty is captured in the identificationRemarks field. The bit score is the required size of a sequence database in which the current match could be found just by chance. The bit score is a log2 scaled and normalized raw score. Each increase by one doubles the required database size (2bit-score). The expect value is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the score of the match increases. Hence, a low expect value is better. How much of the query(input) sequence aligns with the match in the the reference database, in percent. Badges representing different identity thresholds (match types); Blast exact match = identity >= 99% and queryCoverage >= 80%. This is within the threshold of the OTU, Blast ambiguous match = identity >= 99% and queryCoverage >= 80%, but there is at least one more match with similar identity, Blast close match = identity < 99% but > 90% and queryCoverage >= 80%. It is something close to the OTU, maybe the same genus, Blast weak match = there is a match, but with identity < 90% or/and queryCoverage < 80%. Depending on the quality of the sequence, bit score, identity and expect value, a higher taxon could be inferred from this, Blast no match = no match to the reference database.
The Barcode of Life Data System (BOLD) is an informatics workbench aiding the acquisition, storage, analysis and publication of DNA barcode records. CSIRO Marine and Atmospheric Research (CMAR) …Show full descriptionThe Barcode of Life Data System (BOLD) is an informatics workbench aiding the acquisition, storage, analysis and publication of DNA barcode records. CSIRO Marine and Atmospheric Research (CMAR) contributes to this database, as of May 2008, it has contributed about 1000 species of fish, mostly from multiple samples, along with ~100 species of decapods and ~100 species of echinoderms (marine invertebrates). There is DNA data for a specific gene (COI). The collection of data includes GPS location, date, depth, who collected and identified sample, and some have photos. The samples used in providing the information to the Database from CMAR are housed at the Marine Laboratories in Hobart.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COInr is a non-redundant, comprehensive database of COI sequences extracted from NCBI-nt and BOLD. It is not limited to a taxon, a gene region, or a taxonomic resolution. Sequences are dereplicated between databases and within taxa.
Each taxon has a unique taxonomic Identifier (taxID), fundamental to avoid ambiguous associations of homonyms and synonyms in the source database. TaxIDs form a coherent hierarchical system fully compatible with the NCBI taxIDs allowing creating their full or ranked linages.
COInr is a good starting point to create custom databases according to the users’ needs using mkCOInr scripts available at https://github.com/meglecz/mkCOInr
It is possible to select/eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for BLAST, QIIME, RDP classifiers.