The European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) is international, innovative and interdisciplinary, and a champion of open data in the life sciences. The EMBL-EBI captures and presents globally comprehensive sequence data as part of the International Nucleotide Sequence Database Collaboration. Data provided to GBIF include geotagged environmental sequences with user-provided taxonomic identifications. This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host="" EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230). The data was then processed as follows: 1. Human sequences were excluded. 2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number. 3. Contigs and whole genome shotgun (WGS) records were added individually. 4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept. 5. The records associated with the same vouchers are aggregated together. 6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: Deduplication v2 gbif/embl-adapter#10 (comment) 7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip More information available here: https://github.com/gbif/embl-adapter#readme You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Dbfetch is an acronym for database fetch. Dbfetch provides an easy way to retrieve entries from various databases at the EBI in a consistent manner and allows you to retrieve up to 50 entries at a time from various up-to-date biological databases. It can be used from any browser as well as well as within a web-aware scripting tool that uses wget, lynx or similar. From the browser, follow these instructions... * Select a database: If you are using the first form to paste your search items: choose a database name from this form. If you are using the second form to upload your search items: the database name is included at the beginning of each line line of the upload file followed by a colon. * Enter search terms: These MUST BE in the appropriate database format, up to 200 search items can be queried in one run. If you are using the first form: separate search items with a comma or space. If you are using the second form: separate search items with a new line. * Choose an output format: Here you can choose the simpler fasta format, or the databases'''' default format for the chosen database. * Style: You can get your results as text or html. * Retrieve! - You are now ready to fetch your results, by pressing the Retrieve button.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accession numbers of the profilin cDNA sequences obtained after RT-PCR from pollen of five plant species: Olea europaea, Betula pendula, Corylus avellana, Phleum pratense and Zea mays.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
The BioStudies database holds descriptions of biological studies, links to data from these studies in other databases at EMBL-EBI or outside, as well as data that do not fit in the structured archives at EMBL-EBI. The database can accept a wide range of types of studies described via a simple format. It also enables manuscript authors to submit supplementary information and link to it from the publication.
EMBL-EBI portal to enable researchers to upload, access and analyse COVID-19 related reference data and specialist datasets submitted to EMBL-EBI and other major centers for biomedical data. Used to facilitate data sharing and analysis to accelerate coronavirus research.
The aim of the COVID-19 Data Portal is to facilitate data sharing and analysis, and to accelerate coronavirus research. EMBL-EBI and partners have set up the COVID-19 Data Portal, which will bring together relevant datasets submitted to EMBL-EBI and other major centres for biomedical data. The aim is to facilitate data sharing and analysis, and to accelerate coronavirus research. The COVID-19 Data Portal will enable researchers to upload, access and analyse COVID-19 related reference data and specialist datasets. The COVID-19 Data Portal will be the primary entry point into the functions of a wider project, the European COVID-19 Data Platform.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 20 mammals from the EMBL database used in the phylogenetic experiments. The accession number is given on the last column.
https://dl.dropboxusercontent.com/u/120673642/BRAEMBL.jpg" alt="" />
The EMBL Australia Bioinformatics Resource, located at The University of Queensland, provides an Australian-based entry to many of the data services of the European Bioinformatics Institute (EBI). The EBI is an institute within the European Molecular Biology Laboratory (EMBL) and is the world's premier life sciences data resource.
Five collections of nucleotide and protein sequences derived from Australian dwelling plants and animals (identified through the Atlas of Living Australia's instances of the Australian Faunal Directory and the Australian Plant Census) are available from within the EMBL-EBI database.
These data collections contain all current internationally published nucleotide and protein sequences derived from Australian dwelling (native and common/significant introduced (e.g. crop/weed/feral)) organisms and are structured using a taxonomical hierarchy to facilitate searching by species.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(ET-1) Endothelin-1; (ETAR) Endothelin A-receptor; (ETBR) Endothelin B-receptor; (ECE-1) Endothelin-Converting-Enzyme-1; (SDH) Succinate Dehydrogenase.*GeneBank accession number at www.ncbi.nlm.nih.gov.
Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
BioModels Database is a reliable repository of computational models of biological processes. It hosts models described in peer-reviewed scientific literature and models generated automatically from pathway resources (Path2Models). A large number of models collected from literature are manually curated and semantically enriched with cross-references from external data resources.
BioModels Linked Dataset captures the content of the models in BioModels Database (primarilly encoded in the SBML format) in RDF.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Proteomics Identifications Database (PRIDE, http://www.ebi.ac.uk/pride) is one of the main repositories of MS derived proteomics data.
Database of protein structure predictions by AlphaFold that are freely and openly available to global scientific community. Included are nearly all catalogued proteins known to science. Provides programmatic access to and interactive visualization of predicted atomic coordinates, per residue and pairwise model confidence estimates and predicted aligned errors.
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. The euHCVdb is a Hepatitis C Virus database oriented towards protein sequence, structure and function analyses and structural biology of HCV. In order to make the existing HCV databases as complementary as possible, the current developments are coordinated with the other databases (Japan and Los Alamos) as part of an international collaborative effort. It is monthly updated from the EMBL Nucleotide sequence database and maintained in a relational database management system. Programs for parsing the EMBL database flat files, annotating HCV entries, filling up and querying the database used SQL and Java programming languages. Great efforts have been made to develop a fully automatic annotation procedure thanks to a reference set of HCV complete annotated well-characterized genomes of various genotypes. This automatic procedure ensures standardization of nomenclature for all entries and provides genomic regions/proteins present in the entry, bibliographic reference, genotype, interesting sites or domains, source of the sequence and structural data that are available as protein 3D models. Hepatitis C, Hepatitis C Virus, Hepatitis C Virus protein .
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs.
It is available in RDF form through EMBL-EBI's RDF Platform.
The European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) is international, innovative and interdisciplinary, and a champion of open data in the life sciences. The EMBL-EBI captures and presents globally comprehensive sequence data as part of the International Nucleotide Sequence Database Collaboration. Data provided to GBIF include geotagged environmental sequences with user-provided taxonomic identifications. This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host="" EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230). The data was then processed as follows: 1. Human sequences were excluded. 2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number. 3. Contigs and whole genome shotgun (WGS) records were added individually. 4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept. 5. The records associated with the same vouchers are aggregated together. 6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: Deduplication v2 gbif/embl-adapter#10 (comment) 7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip More information available here: https://github.com/gbif/embl-adapter#readme You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md