100+ datasets found
  1. s

    EMBL European Bioinformatics Institute - Nucleotide Sequencing Data

    • geonetwork.soosmap.aq
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). EMBL European Bioinformatics Institute - Nucleotide Sequencing Data [Dataset]. https://geonetwork.soosmap.aq/geonetwork/srv/search
    Explore at:
    Dataset updated
    Apr 21, 2025
    Description

    The European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) is international, innovative and interdisciplinary, and a champion of open data in the life sciences. The EMBL-EBI captures and presents globally comprehensive sequence data as part of the International Nucleotide Sequence Database Collaboration. Data provided to GBIF include geotagged environmental sequences with user-provided taxonomic identifications. This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host="" EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230). The data was then processed as follows: 1. Human sequences were excluded. 2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number. 3. Contigs and whole genome shotgun (WGS) records were added individually. 4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept. 5. The records associated with the same vouchers are aggregated together. 6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: Deduplication v2 gbif/embl-adapter#10 (comment) 7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip More information available here: https://github.com/gbif/embl-adapter#readme You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  2. e

    SFLD

    • ebi.ac.uk
    Updated Sep 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Sep 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.

  3. s

    EBI Dbfetch

    • scicrunch.org
    • dknet.org
    • +2more
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). EBI Dbfetch [Dataset]. http://identifiers.org/RRID:SCR_004393
    Explore at:
    Dataset updated
    Jun 28, 2025
    Description

    Dbfetch is an acronym for database fetch. Dbfetch provides an easy way to retrieve entries from various databases at the EBI in a consistent manner and allows you to retrieve up to 50 entries at a time from various up-to-date biological databases. It can be used from any browser as well as well as within a web-aware scripting tool that uses wget, lynx or similar. From the browser, follow these instructions... * Select a database: If you are using the first form to paste your search items: choose a database name from this form. If you are using the second form to upload your search items: the database name is included at the beginning of each line line of the upload file followed by a colon. * Enter search terms: These MUST BE in the appropriate database format, up to 200 search items can be queried in one run. If you are using the first form: separate search items with a comma or space. If you are using the second form: separate search items with a new line. * Choose an output format: Here you can choose the simpler fasta format, or the databases'''' default format for the chosen database. * Style: You can get your results as text or html. * Retrieve! - You are now ready to fetch your results, by pressing the Retrieve button.

  4. INSDC Environment Sample Sequences

    • gbif.org
    • researchdata.edu.au
    Updated Jun 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Environment Sample Sequences [Dataset]. http://doi.org/10.15468/mcmd5g
    Explore at:
    Dataset updated
    Jun 28, 2025
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Authors
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`

    EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

    The data was then processed as follows:

    1. Human sequences were excluded.

    2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

    3. Contigs and whole genome shotgun (WGS) records were added individually.

    4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

    5. The records associated with the same vouchers are aggregated together.

    6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

    7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

    More information available here: https://github.com/gbif/embl-adapter#readme

    You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  5. f

    GenBank™/EMBL Database entries.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose C. Jimenez-Lopez; Sonia Morales; Antonio J. Castro; Dieter Volkmann; María I. Rodríguez-García; Juan de D. Alché (2023). GenBank™/EMBL Database entries. [Dataset]. http://doi.org/10.1371/journal.pone.0030878.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jose C. Jimenez-Lopez; Sonia Morales; Antonio J. Castro; Dieter Volkmann; María I. Rodríguez-García; Juan de D. Alché
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accession numbers of the profilin cDNA sequences obtained after RT-PCR from pollen of five plant species: Olea europaea, Betula pendula, Corylus avellana, Phleum pratense and Zea mays.

  6. s

    EMBL-EBI

    • cinergi.sdsc.edu
    resource url
    Updated 1994
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (1994). EMBL-EBI [Dataset]. http://cinergi.sdsc.edu/geoportal/rest/metadata/item/a142d65045534cf4a4fe53a997707070/html
    Explore at:
    resource urlAvailable download formats
    Dataset updated
    1994
    Area covered
    Description

    Link Function: information

  7. INSDC Host Organism Sequences

    • gbif.org
    • researchdata.edu.au
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Host Organism Sequences [Dataset]. http://doi.org/10.15468/e97kmy
    Explore at:
    Dataset updated
    Jun 28, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.

    EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

    The data was then processed as follows:

    1. Human sequences were excluded.

    2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

    3. Contigs and whole genome shotgun (WGS) records were added individually.

    4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

    5. The records associated with the same vouchers are aggregated together.

    6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

    7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

    More information available here: https://github.com/gbif/embl-adapter#readme

    You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  8. e

    SMART

    • ebi.ac.uk
    Updated Feb 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 14, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.

  9. b

    BioStudies database

    • bioregistry.io
    • registry.identifiers.org
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). BioStudies database [Dataset]. http://identifiers.org/re3data:r3d100012627
    Explore at:
    Dataset updated
    Apr 28, 2021
    Description

    The BioStudies database holds descriptions of biological studies, links to data from these studies in other databases at EMBL-EBI or outside, as well as data that do not fit in the structured archives at EMBL-EBI. The database can accept a wide range of types of studies described via a simple format. It also enables manuscript authors to submit supplementary information and link to it from the publication.

  10. s

    EMBL-EBI COVID-19 Portal

    • scicrunch.org
    Updated Jan 25, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). EMBL-EBI COVID-19 Portal [Dataset]. http://identifiers.org/RRID:SCR_018337
    Explore at:
    Dataset updated
    Jan 25, 2016
    Description

    EMBL-EBI portal to enable researchers to upload, access and analyse COVID-19 related reference data and specialist datasets submitted to EMBL-EBI and other major centers for biomedical data. Used to facilitate data sharing and analysis to accelerate coronavirus research.

    The aim of the COVID-19 Data Portal is to facilitate data sharing and analysis, and to accelerate coronavirus research. EMBL-EBI and partners have set up the COVID-19 Data Portal, which will bring together relevant datasets submitted to EMBL-EBI and other major centres for biomedical data. The aim is to facilitate data sharing and analysis, and to accelerate coronavirus research. The COVID-19 Data Portal will enable researchers to upload, access and analyse COVID-19 related reference data and specialist datasets. The COVID-19 Data Portal will be the primary entry point into the functions of a wider project, the European COVID-19 Data Platform.

  11. f

    The 20 mammals from the EMBL database used in the phylogenetic experiments....

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liviu P. Dinu; Radu Tudor Ionescu; Alexandru I. Tomescu (2023). The 20 mammals from the EMBL database used in the phylogenetic experiments. The accession number is given on the last column. [Dataset]. http://doi.org/10.1371/journal.pone.0104006.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Liviu P. Dinu; Radu Tudor Ionescu; Alexandru I. Tomescu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The 20 mammals from the EMBL database used in the phylogenetic experiments. The accession number is given on the last column.

  12. r

    Australian Collections in the EMBL Australia Bioinformatics Resource

    • researchdata.edu.au
    Updated Jun 13, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    QFAB (2013). Australian Collections in the EMBL Australia Bioinformatics Resource [Dataset]. https://researchdata.edu.au/australian-collections-embl-bioinformatics-resource/124829
    Explore at:
    Dataset updated
    Jun 13, 2013
    Dataset provided by
    QFAB
    Area covered
    Description

    https://dl.dropboxusercontent.com/u/120673642/BRAEMBL.jpg" alt="" />

    The EMBL Australia Bioinformatics Resource, located at The University of Queensland, provides an Australian-based entry to many of the data services of the European Bioinformatics Institute (EBI). The EBI is an institute within the European Molecular Biology Laboratory (EMBL) and is the world's premier life sciences data resource.

    Five collections of nucleotide and protein sequences derived from Australian dwelling plants and animals (identified through the Atlas of Living Australia's instances of the Australian Faunal Directory and the Australian Plant Census) are available from within the EMBL-EBI database.

    These data collections contain all current internationally published nucleotide and protein sequences derived from Australian dwelling (native and common/significant introduced (e.g. crop/weed/feral)) organisms and are structured using a taxonomical hierarchy to facilitate searching by species.

  13. e

    CATH-Gene3D

    • ebi.ac.uk
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Oct 21, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

  14. f

    Porcine primer sequences: forward (For.) and reverse (Rev.), with transcript...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Zoerner; Lars Wiklund; Adriana Miclescu; Cecile Martijn (2023). Porcine primer sequences: forward (For.) and reverse (Rev.), with transcript (RT-PCR product) length and EMBL database accession number. [Dataset]. http://doi.org/10.1371/journal.pone.0064792.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Frank Zoerner; Lars Wiklund; Adriana Miclescu; Cecile Martijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (ET-1) Endothelin-1; (ETAR) Endothelin A-receptor; (ETBR) Endothelin B-receptor; (ECE-1) Endothelin-Converting-Enzyme-1; (SDH) Succinate Dehydrogenase.*GeneBank accession number at www.ncbi.nlm.nih.gov.

  15. d

    European Nucleotide Archive (ENA)

    • dknet.org
    • scicrunch.org
    Updated Jan 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). European Nucleotide Archive (ENA) [Dataset]. http://identifiers.org/RRID:SCR_006515
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.

  16. BioModels RDF

    • data.wu.ac.at
    api/sparql, meta/void +1
    Updated Jul 30, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMBL European Bioinformatics Institute (2016). BioModels RDF [Dataset]. https://data.wu.ac.at/odso/datahub_io/OTM1NzU5NTEtYWFhMy00NWYwLThkZmMtNmJkOWI2OWM2MWIz
    Explore at:
    api/sparql, meta/void, rdfAvailable download formats
    Dataset updated
    Jul 30, 2016
    Dataset provided by
    European Molecular Biology Laboratoryhttp://www.embl.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    BioModels Database is a reliable repository of computational models of biological processes. It hosts models described in peer-reviewed scientific literature and models generated automatically from pathway resources (Path2Models). A large number of models collected from literature are manually curated and semantically enriched with cross-references from external data resources.

    BioModels Linked Dataset captures the content of the models in BioModels Database (primarilly encoded in the SBML format) in RDF.

  17. Proteomics IDEntification Database (PRIDE)

    • covid19dataportal.org
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMBL-EBI (2025). Proteomics IDEntification Database (PRIDE) [Dataset]. https://www.covid19dataportal.org/expression
    Explore at:
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    EMBL-EBI
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Proteomics Identifications Database (PRIDE, http://www.ebi.ac.uk/pride) is one of the main repositories of MS derived proteomics data.

  18. s

    AlphaFold Protein Structure Database

    • scicrunch.org
    Updated Nov 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). AlphaFold Protein Structure Database [Dataset]. http://identifiers.org/RRID:SCR_023662
    Explore at:
    Dataset updated
    Nov 19, 2021
    Description

    Database of protein structure predictions by AlphaFold that are freely and openly available to global scientific community. Included are nearly all catalogued proteins known to science. Provides programmatic access to and interactive visualization of predicted atomic coordinates, per residue and pairwise model confidence estimates and predicted aligned errors.

  19. n

    HCVDB - Hepatitis C Virus Database

    • neuinfo.org
    • dknet.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). HCVDB - Hepatitis C Virus Database [Dataset]. http://identifiers.org/RRID:SCR_007703
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. The euHCVdb is a Hepatitis C Virus database oriented towards protein sequence, structure and function analyses and structural biology of HCV. In order to make the existing HCV databases as complementary as possible, the current developments are coordinated with the other databases (Japan and Los Alamos) as part of an international collaborative effort. It is monthly updated from the EMBL Nucleotide sequence database and maintained in a relational database management system. Programs for parsing the EMBL database flat files, annotating HCV entries, filling up and querying the database used SQL and Java programming languages. Great efforts have been made to develop a fully automatic annotation procedure thanks to a reference set of HCV complete annotated well-characterized genomes of various genotypes. This automatic procedure ensures standardization of nomenclature for all entries and provides genomic regions/proteins present in the entry, bibliographic reference, genotype, interesting sites or domains, source of the sequence and structural data that are available as protein 3D models. Hepatitis C, Hepatitis C Virus, Hepatitis C Virus protein .

  20. ChEMBL RDF

    • data.wu.ac.at
    api/sparql, meta/void +1
    Updated Jul 30, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMBL European Bioinformatics Institute (2016). ChEMBL RDF [Dataset]. https://data.wu.ac.at/odso/datahub_io/YzA0YTM3MTItN2NlZC00ODk1LTk3YzUtZDcxYjgwZDcyZTUw
    Explore at:
    ttl, api/sparql, meta/voidAvailable download formats
    Dataset updated
    Jul 30, 2016
    Dataset provided by
    European Molecular Biology Laboratoryhttp://www.embl.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    License

    http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa

    Description

    ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs.

    It is available in RDF form through EMBL-EBI's RDF Platform.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). EMBL European Bioinformatics Institute - Nucleotide Sequencing Data [Dataset]. https://geonetwork.soosmap.aq/geonetwork/srv/search

EMBL European Bioinformatics Institute - Nucleotide Sequencing Data

Explore at:
Dataset updated
Apr 21, 2025
Description

The European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) is international, innovative and interdisciplinary, and a champion of open data in the life sciences. The EMBL-EBI captures and presents globally comprehensive sequence data as part of the International Nucleotide Sequence Database Collaboration. Data provided to GBIF include geotagged environmental sequences with user-provided taxonomic identifications. This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host="" EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230). The data was then processed as follows: 1. Human sequences were excluded. 2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number. 3. Contigs and whole genome shotgun (WGS) records were added individually. 4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept. 5. The records associated with the same vouchers are aggregated together. 6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: Deduplication v2 gbif/embl-adapter#10 (comment) 7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip More information available here: https://github.com/gbif/embl-adapter#readme You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

Search
Clear search
Close search
Google apps
Main menu