The European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) is international, innovative and interdisciplinary, and a champion of open data in the life sciences. The EMBL-EBI captures and presents globally comprehensive sequence data as part of the International Nucleotide Sequence Database Collaboration. Data provided to GBIF include geotagged environmental sequences with user-provided taxonomic identifications. This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host="" EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230). The data was then processed as follows: 1. Human sequences were excluded. 2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number. 3. Contigs and whole genome shotgun (WGS) records were added individually. 4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept. 5. The records associated with the same vouchers are aggregated together. 6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: Deduplication v2 gbif/embl-adapter#10 (comment) 7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip More information available here: https://github.com/gbif/embl-adapter#readme You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
Dbfetch is an acronym for database fetch. Dbfetch provides an easy way to retrieve entries from various databases at the EBI in a consistent manner and allows you to retrieve up to 50 entries at a time from various up-to-date biological databases. It can be used from any browser as well as well as within a web-aware scripting tool that uses wget, lynx or similar. From the browser, follow these instructions... * Select a database: If you are using the first form to paste your search items: choose a database name from this form. If you are using the second form to upload your search items: the database name is included at the beginning of each line line of the upload file followed by a colon. * Enter search terms: These MUST BE in the appropriate database format, up to 200 search items can be queried in one run. If you are using the first form: separate search items with a comma or space. If you are using the second form: separate search items with a new line. * Choose an output format: Here you can choose the simpler fasta format, or the databases'''' default format for the chosen database. * Style: You can get your results as text or html. * Retrieve! - You are now ready to fetch your results, by pressing the Retrieve button.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US.
Database of a list of insertion sequences isolated from eubacteria and archaea. It is organized into individual files containing their general features (name, size, origin, family.....) as well as their DNA and potential protein sequences. Although most of the entries have been identified as individual elements, a growing number are included from their description in sequenced bacterial genomes. The search engine permits the retrieval and display of individual and groups of ISs based on a combination of their general features. Two levels of search are available. The simple search option enables the user to sort elements using a limited number of basic items whereas the extensive search offers an additional set of possibilities such as comparisons of the sequences of terminal inverted repeats and a variety of different layout displays. Built in links are provided to: the EMBL sequence database, the NCBI taxonomy database and to the ESF plasmid database. At present, only individual sequences can be downloaded one by one for comparison. An on-line BLAST facility is available and in future versions direct access to additional analytical tools will be provided on line. Direct submission of ISs is encouraged using the on-line form provided.
THIS RESOURCE IS NO LONGER IN SERVICE, documented May 10, 2017. A pilot effort that has developed a centralized, web-based biospecimen locator that presents biospecimens collected and stored at participating Arizona hospitals and biospecimen banks, which are available for acquisition and use by researchers. Researchers may use this site to browse, search and request biospecimens to use in qualified studies. The development of the ABL was guided by the Arizona Biospecimen Consortium (ABC), a consortium of hospitals and medical centers in the Phoenix area, and is now being piloted by this Consortium under the direction of ABRC. You may browse by type (cells, fluid, molecular, tissue) or disease. Common data elements decided by the ABC Standards Committee, based on data elements on the National Cancer Institute''s (NCI''s) Common Biorepository Model (CBM), are displayed. These describe the minimum set of data elements that the NCI determined were most important for a researcher to see about a biospecimen. The ABL currently does not display information on whether or not clinical data is available to accompany the biospecimens. However, a requester has the ability to solicit clinical data in the request. Once a request is approved, the biospecimen provider will contact the requester to discuss the request (and the requester''s questions) before finalizing the invoice and shipment. The ABL is available to the public to browse. In order to request biospecimens from the ABL, the researcher will be required to submit the requested required information. Upon submission of the information, shipment of the requested biospecimen(s) will be dependent on the scientific and institutional review approval. Account required. Registration is open to everyone., documented August 23, 2016. The euHCVdb is oriented towards protein sequence, structure, function analysis and structural biology of the Hepatitis C Virus. It is monthly updated from the EMBL Nucleotide sequence database and maintained in a relational database management system (PostgreSQL). Programs for parsing the EMBL database flat files, annotating HCV entries, filling up and querying the database used SQL and Java programming languages. Great efforts have been made to develop a fully automatic annotation procedure thanks to a reference set of HCV complete annotated well-characterized genomes of various genotypes. This automatic procedure ensures standardization of nomenclature for all entries and provides genomic regions/proteins present in the entry, bibliographic reference, genotype, interesting sites (e.g. HVR1) or domains (e.g. NS3 helicase), source of the sequence (e.g. isolate) and structural data that are available as protein 3D models. The euHCVdb is funded as part of the HepCVax cluster (EC grant QLK2-CT-2002-01329) and viRgil network of excellence (EC grant LSHM-CT-2004-503359).
This lesson provides an interactive introduction to some of EMBL-EBI's openly accessible life sciences data resources. One presentation document provides challenges for learners that require them to interact with some online data resources. The second presentation has hints for solving the challenges and the correct answers. An interactive self-paced version of this lesson is also available as part of EMBL-EBI's on-demand training in A journey through bioinformatics: Explore resources from EMBL-EBI.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.
Histone Database is a database of histones and their corresponding sequences. Sequence- and text-based searches were performed on NCBI's redundant and non-redundant (nr) peptide sequence databases. These databases are derived from GenBank, EMBL, and DDBJ translated DNA coding regions, plus protein sequences from the PDB (Protein Data Bank), SWISS-PROT, the PIR (Protein Information Resource), and the PRF (Protein Research Foundation). :Users can search by keyword, sequence fragment, category, organism, and redundancy of the set.
Mammalian protein-protein interaction database focusing on synaptic proteins. The Protein-Protein Interaction Database was originally a single-person's attempt to integrate a gamut of biological/bibliographical/molecular data and build a framework which might help understanding how cells orchestrate their protein content in order to become what they are: machines with a purpose. This is based on the simple paradigm that functionality like signal cascades are held together in a close space, thereby allowing specific events to occur without the necessity of passive diffusion and random events. The PPID database arose from the need to interpret Proteomic datasets, which were generated analysing the NMDA-receptor complex (see H. Husi, M. A. Ward, J. S. Choudhary, W. P. Blackstock and S. G. Grant (2000). Proteomic analysis of NMDA receptor-adhesion protein signaling complexes. Nat Neurosci 3, 661-669.). To study these clusters of proteins requires unavoidably the handling of large datasets, which PPID is generally aimed and tailored for. This database is unifying molecular entries across three species, namely human, rat and mouse and is is footed on sequence databases such as SwissProt, EMBL, TrEMBL (translated EMBL sequences) and Unigene and the literature database PubMed. A typical entry in PPID holds up to three general entries for the three species, all protein and gene accession numbers associated with them (assembled from Blast2 searches of the databases) and the OMIM entry as maintained by Johns Hopkins University. Furthermore protein sequence information is also included, together with known and novel splice-variants of each molecule as found by ClustalW sequence alignments. Entry points also include protein-binding information together with the literature reference. The whole database is curated manually to insure accuracy and quality. Querying the database will be possible by online browsing and batch-submission for large datasets holding accession number information, as can be generated using software like Mascot for mass-spectrometry. Cluster-analysis of the submitted datasets in the form of a graphical output will be developed as well as an easy-to-use web-interface. An interface is currently being built in collaboration with the Department of Informatics (T. Theodosiou and D. Armstrong) and will be deployed soon The current team of people collating and deploying the database are H. Husi (database mining and information gathering) and T. Theodosiou (web-interface and deployment). Please note that this database is not funded financially, and cannot survive without sponsorship.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
EMPIAR, the Electron Microscopy Public Image Archive, is a public resource for raw, 2D electron microscopy images.
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank (at NCBI), together with the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) comprise the International Nucleotide Sequence Database Collaboration. These three organizations exchange data on a daily basis.
GenBank grows at an exponential rate, with the number of nucleotide bases doubling approximately every 14 months. Currently, GenBank contains more than 13 billion bases from over 100,000 species.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DNA sequences used to identify fungi cultured from human faeces.The ITS1‑5.8s‑ITS2 region of the extracted rDNA of fungal isolates was chosen to be amplified based on its success in identifying a wide range of fungal species [53]. For DNA amplification, 10.0 mL of REDExtract-N-Amp™ PCR Ready Mix; 7.8 mL of PCR-grade H2O; 0.8 mL of 10 mM forward primer (ITS1, sequence TCCGTAGGTGAACCTGCGG); 0.8 mL of 10 mM reverse primer (ITS4, sequence TCCTCCGCTTATTGATATGC); and 1.0 mL of extracted fungal DNA sample were added to a 200 mL Eppendorf PCR tube. The same method was used to prepare the negative control. PCR amplification was performed with a preliminary step of polymerase activation at 94 oC for 2 minutes; 35 cycles of denaturation at 94 oC for 30 seconds, annealing at 51 oC for 20 seconds, and extension at 77 oC for 1 minute; and a final extension step at 72 oC for 8 minutes, using the Eppendorf Vapo. Protect ™ Mastercycler® Pro S.
To confirm a successful fungal DNA extraction and amplification, 4 mL of the amplified fungal rDNA product of the PCR reaction was loaded onto a 1 % (w/v) agarose gel in a 1x Tris/Borate/EDTA (TBE) buffer, and 1 mL cyanide dye SYBR® DNA gel stain was added for visualisation purposes. One kilobase (1kb) plus DNA ladder (5 mL) and 5 mL of the negative control were also loaded onto the agarose gel. Following the completion of gel electrophoresis, PCR products were visualised with the GelDocTM XR Plus System (BIO‑RAD, USA). The 1kb plus DNA ladder was used to determine the size of the amplified fungal DNA fragments using the Gelanalyzer 2010a quantification programme. The fungal rDNA fragments of the ITS1‑5.8s‑ITS2 region obtained from PCR were then transferred to the Centre of Genomics, Proteomics and Metabolomics DNA sequencing facility for sequencing.
Capillary Electrophoresis DNA Sequencing (Sanger Sequencing) was used to obtain the DNA sequences of the amplified ITS1‑5.8s‑ITS2 region. Each sample containing fungal DNA template had two reactions performed, one for each primer and were mixed with the ABI PRISMTM BIG DYE Terminator Sequencing Kit version 3.1 (ThermoFisher Scientific) containing DNA polymerase enzyme, a buffer, four DNA nucleotides and four chain-terminating dideoxy nucleotides with fluorescent dyes. The samples were then subjected to cycle sequencing on the thermal cycler Applied Biosystems GeneAmp® PCR System 9700 using standard cycling conditions: a preliminary step of polymerase activation at 96 oC for 1 minute; 25 cycles of denaturation at 96 oC for 10 seconds, annealing at 50 oC for 5 seconds, and extension at 60 oC for 4 minutes. Following the cycle sequencing, the samples were purified using Agencourt® CleanSEQ® magnetic beads in order to remove the excess fluorescent dyes, nucleotides, salts and other contaminants. The remaining purified DNA samples were then separated by size by capillary electrophoresis with the ABI PRISMTM 3130XL Genetic Analyzer using 50 cm capillaries and POP7 polymer. The final data output of the ITS‑5.8s‑ITS2 region DNA sequences was based on the detection of the attached fluorescent dyes excited by a laser.
Geneious programme version 11.1.5 (www.geneious.com) was used to analyse the raw data [54]. The data included both forward and reverse rDNA sequences for each fungal isolate. These sequences were aligned and ends showing poor quality reads were trimmed, to obtain a consensus sequence. A tool within the Geneious programme, BLAST (Basic Local Alignment Search Tool) developed by Altschul et al. [55], optimised for fast and high similarity search (MegaBLAST version), was used to compare the consensus query sequence with known DNA sequences in GenBank (NCBI genetic sequence database), EMBL (European Molecular Biology Laboratory), DDBJ (DNA DataBank of Japan) and PDB (Protein Data Bank, Worldwide). The search results included: grade percentage score showing combinatorial results of the query input sequence coverage, expectation-value (e-value) and identity value for each hit against the database; identities match and percentage score indicating the extent to which the query DNA sequence matched the database nucleotide sequence; and bit-score showing the quality of alignment and measuring sequence similarity [56]. The higher the score of each result, the higher the certainty of identification of the fungal species. Grade percentage score of >98 % was considered as correct genomic identification.
The European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) is international, innovative and interdisciplinary, and a champion of open data in the life sciences. The EMBL-EBI captures and presents globally comprehensive sequence data as part of the International Nucleotide Sequence Database Collaboration. Data provided to GBIF include geotagged environmental sequences with user-provided taxonomic identifications. This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host="" EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230). The data was then processed as follows: 1. Human sequences were excluded. 2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number. 3. Contigs and whole genome shotgun (WGS) records were added individually. 4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept. 5. The records associated with the same vouchers are aggregated together. 6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: Deduplication v2 gbif/embl-adapter#10 (comment) 7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip More information available here: https://github.com/gbif/embl-adapter#readme You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md