41 datasets found
  1. s

    EMBL European Bioinformatics Institute - Nucleotide Sequencing Data

    • geonetwork.soosmap.aq
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). EMBL European Bioinformatics Institute - Nucleotide Sequencing Data [Dataset]. https://geonetwork.soosmap.aq/geonetwork/srv/search
    Explore at:
    Dataset updated
    Apr 21, 2025
    Description

    The European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) is international, innovative and interdisciplinary, and a champion of open data in the life sciences. The EMBL-EBI captures and presents globally comprehensive sequence data as part of the International Nucleotide Sequence Database Collaboration. Data provided to GBIF include geotagged environmental sequences with user-provided taxonomic identifications. This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host="" EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230). The data was then processed as follows: 1. Human sequences were excluded. 2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number. 3. Contigs and whole genome shotgun (WGS) records were added individually. 4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept. 5. The records associated with the same vouchers are aggregated together. 6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: Deduplication v2 gbif/embl-adapter#10 (comment) 7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip More information available here: https://github.com/gbif/embl-adapter#readme You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  2. e

    SMART

    • ebi.ac.uk
    Updated Feb 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 14, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.

  3. INSDC Environment Sample Sequences

    • gbif.org
    • researchdata.edu.au
    Updated Aug 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Environment Sample Sequences [Dataset]. http://doi.org/10.15468/mcmd5g
    Explore at:
    Dataset updated
    Aug 2, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`

    EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

    The data was then processed as follows:

    1. Human sequences were excluded.

    2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

    3. Contigs and whole genome shotgun (WGS) records were added individually.

    4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

    5. The records associated with the same vouchers are aggregated together.

    6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

    7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

    More information available here: https://github.com/gbif/embl-adapter#readme

    You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  4. e

    SFLD

    • ebi.ac.uk
    Updated Sep 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Sep 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.

  5. d

    Peptide Sequence Database

    • dknet.org
    • scicrunch.org
    • +1more
    Updated Jan 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Peptide Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_005764
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    The Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.

  6. INSDC Host Organism Sequences

    • gbif.org
    • researchdata.edu.au
    • +2more
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Host Organism Sequences [Dataset]. http://doi.org/10.15468/e97kmy
    Explore at:
    Dataset updated
    Aug 2, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.

    EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

    The data was then processed as follows:

    1. Human sequences were excluded.

    2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

    3. Contigs and whole genome shotgun (WGS) records were added individually.

    4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

    5. The records associated with the same vouchers are aggregated together.

    6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

    7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

    More information available here: https://github.com/gbif/embl-adapter#readme

    You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  7. e

    PROSITE profiles

    • ebi.ac.uk
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

  8. s

    EMBL-EBI

    • cinergi.sdsc.edu
    resource url
    Updated 1994
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (1994). EMBL-EBI [Dataset]. http://cinergi.sdsc.edu/geoportal/rest/metadata/item/a142d65045534cf4a4fe53a997707070/html
    Explore at:
    resource urlAvailable download formats
    Dataset updated
    1994
    Area covered
    Description

    Link Function: information

  9. n

    ISFinder

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Feb 1, 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2001). ISFinder [Dataset]. http://identifiers.org/RRID:SCR_003020
    Explore at:
    Dataset updated
    Feb 1, 2001
    Description

    Database of a list of insertion sequences isolated from eubacteria and archaea. It is organized into individual files containing their general features (name, size, origin, family.....) as well as their DNA and potential protein sequences. Although most of the entries have been identified as individual elements, a growing number are included from their description in sequenced bacterial genomes. The search engine permits the retrieval and display of individual and groups of ISs based on a combination of their general features. Two levels of search are available. The simple search option enables the user to sort elements using a limited number of basic items whereas the extensive search offers an additional set of possibilities such as comparisons of the sequences of terminal inverted repeats and a variety of different layout displays. Built in links are provided to: the EMBL sequence database, the NCBI taxonomy database and to the ESF plasmid database. At present, only individual sequences can be downloaded one by one for comparison. An on-line BLAST facility is available and in future versions direct access to additional analytical tools will be provided on line. Direct submission of ISs is encouraged using the on-line form provided.

  10. e

    CATH-Gene3D

    • ebi.ac.uk
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Oct 21, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

  11. s

    Histone Database

    • scicrunch.org
    • neuinfo.org
    • +1more
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Histone Database [Dataset]. http://identifiers.org/RRID:SCR_007711
    Explore at:
    Dataset updated
    Jul 22, 2025
    Description

    Histone Database is a database of histones and their corresponding sequences. Sequence- and text-based searches were performed on NCBI's redundant and non-redundant (nr) peptide sequence databases. These databases are derived from GenBank, EMBL, and DDBJ translated DNA coding regions, plus protein sequences from the PDB (Protein Data Bank), SWISS-PROT, the PIR (Protein Information Resource), and the PRF (Protein Research Foundation). :Users can search by keyword, sequence fragment, category, organism, and redundancy of the set.

  12. o

    MetaGraph Sequence Indexes

    • registry.opendata.aws
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Biomedical Informatics Lab, ETH Zurich, Switzerland (2025). MetaGraph Sequence Indexes [Dataset]. https://registry.opendata.aws/metagraph/
    Explore at:
    Dataset updated
    May 10, 2025
    Dataset provided by
    <a href="https://bmi.inf.ethz.ch">Biomedical Informatics Lab, ETH Zurich, Switzerland</a>
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The MetaGraph Sequence Indexes dataset comprises full-text searchable index files for raw sequencing data hosted in major public repositories. These include the European Nucleotide Archive (ENA) managed by the European Bioinformatics Institute (EMBL-EBI), the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI), and the DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA).All index files can be used with the MetaGraph framework for sequence search. Indexes can be jointly used for aggregated search in the cloud or can be individually downloaded for search using local hardware.

  13. e

    PIRSF

    • ebi.ac.uk
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). PIRSF [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Apr 7, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US.

  14. n

    GenBank Database

    • cmr.earthdata.nasa.gov
    Updated Apr 20, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). GenBank Database [Dataset]. https://cmr.earthdata.nasa.gov/search/concepts/C1214138025-SCIOPS.html
    Explore at:
    Dataset updated
    Apr 20, 2017
    Time period covered
    Jan 1, 1970 - Present
    Area covered
    Description

    GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank (at NCBI), together with the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) comprise the International Nucleotide Sequence Database Collaboration. These three organizations exchange data on a daily basis.

    GenBank grows at an exponential rate, with the number of nucleotide bases doubling approximately every 14 months. Currently, GenBank contains more than 13 billion bases from over 100,000 species.

  15. r

    High Throughput Genomic Sequences Division

    • rrid.site
    • dknet.org
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). High Throughput Genomic Sequences Division [Dataset]. http://identifiers.org/RRID:SCR_002150
    Explore at:
    Dataset updated
    Jun 26, 2025
    Description

    Database of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. It was created to accommodate a growing need to make unfinished genomic sequence data rapidly available to the scientific community in a coordinated effort among the International Nucleotide Sequence databases, DDBJ, EMBL, and GenBank. Sequences are prepared for submission by using NCBI's software tools Sequin or tbl2asn. Each center has an FTP directory into which new or updated sequence files are placed. Sequence data in this division are available for BLAST homology searches against either the htgs database or the month database, which includes all new submissions for the prior month. Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first-pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone, which together make up more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences, and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data are unfinished and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank.

  16. f

    ITS1-5.8s_ITS2 Fungal Sequences and Search Results .xlsx

    • auckland.figshare.com
    • figshare.com
    xlsx
    Updated May 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth McKenzie (2019). ITS1-5.8s_ITS2 Fungal Sequences and Search Results .xlsx [Dataset]. http://doi.org/10.17608/k6.auckland.8142947.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 17, 2019
    Dataset provided by
    The University of Auckland
    Authors
    Elizabeth McKenzie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DNA sequences used to identify fungi cultured from human faeces.The ITS1‑5.8s‑ITS2 region of the extracted rDNA of fungal isolates was chosen to be amplified based on its success in identifying a wide range of fungal species [53]. For DNA amplification, 10.0 mL of REDExtract-N-Amp™ PCR Ready Mix; 7.8 mL of PCR-grade H2O; 0.8 mL of 10 mM forward primer (ITS1, sequence TCCGTAGGTGAACCTGCGG); 0.8 mL of 10 mM reverse primer (ITS4, sequence TCCTCCGCTTATTGATATGC); and 1.0 mL of extracted fungal DNA sample were added to a 200 mL Eppendorf PCR tube. The same method was used to prepare the negative control. PCR amplification was performed with a preliminary step of polymerase activation at 94 oC for 2 minutes; 35 cycles of denaturation at 94 oC for 30 seconds, annealing at 51 oC for 20 seconds, and extension at 77 oC for 1 minute; and a final extension step at 72 oC for 8 minutes, using the Eppendorf Vapo. Protect ™ Mastercycler® Pro S.

    To confirm a successful fungal DNA extraction and amplification, 4 mL of the amplified fungal rDNA product of the PCR reaction was loaded onto a 1 % (w/v) agarose gel in a 1x Tris/Borate/EDTA (TBE) buffer, and 1 mL cyanide dye SYBR® DNA gel stain was added for visualisation purposes. One kilobase (1kb) plus DNA ladder (5 mL) and 5 mL of the negative control were also loaded onto the agarose gel. Following the completion of gel electrophoresis, PCR products were visualised with the GelDocTM XR Plus System (BIO‑RAD, USA). The 1kb plus DNA ladder was used to determine the size of the amplified fungal DNA fragments using the Gelanalyzer 2010a quantification programme. The fungal rDNA fragments of the ITS1‑5.8s‑ITS2 region obtained from PCR were then transferred to the Centre of Genomics, Proteomics and Metabolomics DNA sequencing facility for sequencing.

    Capillary Electrophoresis DNA Sequencing (Sanger Sequencing) was used to obtain the DNA sequences of the amplified ITS1‑5.8s‑ITS2 region. Each sample containing fungal DNA template had two reactions performed, one for each primer and were mixed with the ABI PRISMTM BIG DYE Terminator Sequencing Kit version 3.1 (ThermoFisher Scientific) containing DNA polymerase enzyme, a buffer, four DNA nucleotides and four chain-terminating dideoxy nucleotides with fluorescent dyes. The samples were then subjected to cycle sequencing on the thermal cycler Applied Biosystems GeneAmp® PCR System 9700 using standard cycling conditions: a preliminary step of polymerase activation at 96 oC for 1 minute; 25 cycles of denaturation at 96 oC for 10 seconds, annealing at 50 oC for 5 seconds, and extension at 60 oC for 4 minutes. Following the cycle sequencing, the samples were purified using Agencourt® CleanSEQ® magnetic beads in order to remove the excess fluorescent dyes, nucleotides, salts and other contaminants. The remaining purified DNA samples were then separated by size by capillary electrophoresis with the ABI PRISMTM 3130XL Genetic Analyzer using 50 cm capillaries and POP7 polymer. The final data output of the ITS‑5.8s‑ITS2 region DNA sequences was based on the detection of the attached fluorescent dyes excited by a laser.

    Geneious programme version 11.1.5 (www.geneious.com) was used to analyse the raw data [54]. The data included both forward and reverse rDNA sequences for each fungal isolate. These sequences were aligned and ends showing poor quality reads were trimmed, to obtain a consensus sequence. A tool within the Geneious programme, BLAST (Basic Local Alignment Search Tool) developed by Altschul et al. [55], optimised for fast and high similarity search (MegaBLAST version), was used to compare the consensus query sequence with known DNA sequences in GenBank (NCBI genetic sequence database), EMBL (European Molecular Biology Laboratory), DDBJ (DNA DataBank of Japan) and PDB (Protein Data Bank, Worldwide). The search results included: grade percentage score showing combinatorial results of the query input sequence coverage, expectation-value (e-value) and identity value for each hit against the database; identities match and percentage score indicating the extent to which the query DNA sequence matched the database nucleotide sequence; and bit-score showing the quality of alignment and measuring sequence similarity [56]. The higher the score of each result, the higher the certainty of identification of the fungal species. Grade percentage score of >98 % was considered as correct genomic identification.

  17. EMPIAR

    • covid19dataportal.org
    • ebi.ac.uk
    Updated Apr 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMPIAR/EMBL-EBI (2020). EMPIAR [Dataset]. https://www.covid19dataportal.org/proteins
    Explore at:
    Dataset updated
    Apr 21, 2020
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    EMPIAR/EMBL-EBI
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    EMPIAR, the Electron Microscopy Public Image Archive, is a public resource for raw, 2D electron microscopy images.

  18. r

    euHCVdb: The European HCV database

    • rrid.site
    • scicrunch.org
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). euHCVdb: The European HCV database [Dataset]. http://identifiers.org/RRID:SCR_007645/resolver?q=*&i=rrid
    Explore at:
    Dataset updated
    Jun 17, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented May 10, 2017. A pilot effort that has developed a centralized, web-based biospecimen locator that presents biospecimens collected and stored at participating Arizona hospitals and biospecimen banks, which are available for acquisition and use by researchers. Researchers may use this site to browse, search and request biospecimens to use in qualified studies. The development of the ABL was guided by the Arizona Biospecimen Consortium (ABC), a consortium of hospitals and medical centers in the Phoenix area, and is now being piloted by this Consortium under the direction of ABRC. You may browse by type (cells, fluid, molecular, tissue) or disease. Common data elements decided by the ABC Standards Committee, based on data elements on the National Cancer Institute''s (NCI''s) Common Biorepository Model (CBM), are displayed. These describe the minimum set of data elements that the NCI determined were most important for a researcher to see about a biospecimen. The ABL currently does not display information on whether or not clinical data is available to accompany the biospecimens. However, a requester has the ability to solicit clinical data in the request. Once a request is approved, the biospecimen provider will contact the requester to discuss the request (and the requester''s questions) before finalizing the invoice and shipment. The ABL is available to the public to browse. In order to request biospecimens from the ABL, the researcher will be required to submit the requested required information. Upon submission of the information, shipment of the requested biospecimen(s) will be dependent on the scientific and institutional review approval. Account required. Registration is open to everyone., documented August 23, 2016. The euHCVdb is oriented towards protein sequence, structure, function analysis and structural biology of the Hepatitis C Virus. It is monthly updated from the EMBL Nucleotide sequence database and maintained in a relational database management system (PostgreSQL). Programs for parsing the EMBL database flat files, annotating HCV entries, filling up and querying the database used SQL and Java programming languages. Great efforts have been made to develop a fully automatic annotation procedure thanks to a reference set of HCV complete annotated well-characterized genomes of various genotypes. This automatic procedure ensures standardization of nomenclature for all entries and provides genomic regions/proteins present in the entry, bibliographic reference, genotype, interesting sites (e.g. HVR1) or domains (e.g. NS3 helicase), source of the sequence (e.g. isolate) and structural data that are available as protein 3D models. The euHCVdb is funded as part of the HepCVax cluster (EC grant QLK2-CT-2002-01329) and viRgil network of excellence (EC grant LSHM-CT-2004-503359).

  19. s

    Protein-Protein Interaction Database

    • scicrunch.org
    • dknet.org
    • +1more
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Protein-Protein Interaction Database [Dataset]. http://identifiers.org/RRID:SCR_007288
    Explore at:
    Dataset updated
    Jun 24, 2025
    Description

    Mammalian protein-protein interaction database focusing on synaptic proteins. The Protein-Protein Interaction Database was originally a single-person's attempt to integrate a gamut of biological/bibliographical/molecular data and build a framework which might help understanding how cells orchestrate their protein content in order to become what they are: machines with a purpose. This is based on the simple paradigm that functionality like signal cascades are held together in a close space, thereby allowing specific events to occur without the necessity of passive diffusion and random events. The PPID database arose from the need to interpret Proteomic datasets, which were generated analysing the NMDA-receptor complex (see H. Husi, M. A. Ward, J. S. Choudhary, W. P. Blackstock and S. G. Grant (2000). Proteomic analysis of NMDA receptor-adhesion protein signaling complexes. Nat Neurosci 3, 661-669.). To study these clusters of proteins requires unavoidably the handling of large datasets, which PPID is generally aimed and tailored for. This database is unifying molecular entries across three species, namely human, rat and mouse and is is footed on sequence databases such as SwissProt, EMBL, TrEMBL (translated EMBL sequences) and Unigene and the literature database PubMed. A typical entry in PPID holds up to three general entries for the three species, all protein and gene accession numbers associated with them (assembled from Blast2 searches of the databases) and the OMIM entry as maintained by Johns Hopkins University. Furthermore protein sequence information is also included, together with known and novel splice-variants of each molecule as found by ClustalW sequence alignments. Entry points also include protein-binding information together with the literature reference. The whole database is curated manually to insure accuracy and quality. Querying the database will be possible by online browsing and batch-submission for large datasets holding accession number information, as can be generated using software like Mascot for mass-spectrometry. Cluster-analysis of the submitted datasets in the form of a graphical output will be developed as well as an easy-to-use web-interface. An interface is currently being built in collaboration with the Department of Informatics (T. Theodosiou and D. Armstrong) and will be deployed soon The current team of people collating and deploying the database are H. Husi (database mining and information gathering) and T. Theodosiou (web-interface and deployment). Please note that this database is not funded financially, and cannot survive without sponsorship.

  20. e

    PRINTS

    • ebi.ac.uk
    Updated Jun 14, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). PRINTS [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Jun 14, 2012
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). EMBL European Bioinformatics Institute - Nucleotide Sequencing Data [Dataset]. https://geonetwork.soosmap.aq/geonetwork/srv/search

EMBL European Bioinformatics Institute - Nucleotide Sequencing Data

Explore at:
Dataset updated
Apr 21, 2025
Description

The European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) is international, innovative and interdisciplinary, and a champion of open data in the life sciences. The EMBL-EBI captures and presents globally comprehensive sequence data as part of the International Nucleotide Sequence Database Collaboration. Data provided to GBIF include geotagged environmental sequences with user-provided taxonomic identifications. This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host="" EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230). The data was then processed as follows: 1. Human sequences were excluded. 2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number. 3. Contigs and whole genome shotgun (WGS) records were added individually. 4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept. 5. The records associated with the same vouchers are aggregated together. 6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: Deduplication v2 gbif/embl-adapter#10 (comment) 7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip More information available here: https://github.com/gbif/embl-adapter#readme You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

Search
Clear search
Close search
Google apps
Main menu