25 datasets found
  1. e

    PROSITE profiles

    • ebi.ac.uk
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

  2. f

    Bioinformatics Goes to School—New Avenues for Teaching Contemporary Biology

    • plos.figshare.com
    doc
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louisa Wood; Philipp Gebhardt (2023). Bioinformatics Goes to School—New Avenues for Teaching Contemporary Biology [Dataset]. http://doi.org/10.1371/journal.pcbi.1003089
    Explore at:
    docAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Louisa Wood; Philipp Gebhardt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Since 2010, the European Molecular Biology Laboratory's (EMBL) Heidelberg laboratory and the European Bioinformatics Institute (EMBL-EBI) have jointly run bioinformatics training courses developed specifically for secondary school science teachers within Europe and EMBL member states. These courses focus on introducing bioinformatics, databases, and data-intensive biology, allowing participants to explore resources and providing classroom-ready materials to support them in sharing this new knowledge with their students.In this article, we chart our progress made in creating and running three bioinformatics training courses, including how the course resources are received by participants and how these, and bioinformatics in general, are subsequently used in the classroom. We assess the strengths and challenges of our approach, and share what we have learned through our interactions with European science teachers.

  3. m

    CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

    • data.mendeley.com
    Updated Dec 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farah Zaib Khan (2018). CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object) [Dataset]. http://doi.org/10.17632/6wtpgr3kbj.1
    Explore at:
    Dataset updated
    Dec 4, 2018
    Authors
    Farah Zaib Khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages. First step, "Pre-align'' accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step. The next step "Align'' also accepts the human reference genome as input along with the output files from "Pre-align'' and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format. The BAM files generated after "Align'' are sorted with "SAMtool sort''. Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in "Post-align'' step.

    This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.6.0 or use https://pypi.org/project/cwlprov/ to explore

  4. e

    SFLD

    • ebi.ac.uk
    Updated Sep 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Sep 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.

  5. Reanalysis of dataset PXD005780: “Relating human gut metagenome and...

    • data.niaid.nih.gov
    • ebi.ac.uk
    xml
    Updated May 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengbo Wang; Juan Antonio Vizcaino (2022). Reanalysis of dataset PXD005780: “Relating human gut metagenome and metaproteome” [Dataset]. https://data.niaid.nih.gov/resources?id=pxd032303
    Explore at:
    xmlAvailable download formats
    Dataset updated
    May 4, 2022
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    Shengbo Wang; Juan Antonio Vizcaino
    Variables measured
    Proteomics
    Description

    This project contains raw data, intermediate files and results is a re-analysis of the publicly available dataset from the PRIDE dataset PXD005780. The RAW files were processed using ThermoRawFileParser, SearchGUI and PeptideShaker through standard settings (see ‘Data Processing Protocol’). This reanalysis work is part of the MetaPUF (MetaProteomics with Unknown Function) project, which is a collaboration between EMBL-EBI and the University of Luxembourg. The dataset was selected with the following conditions: 1. It has been made publicly available in PRIDE and focuses on metaproteomics of the human gut; 2. The corresponding metagenomics assemblies were also available from ENA (European Nucleotide Archive) or MGnify. The processed peptide reports for each sample are available to view at the contig level on the MGnify website. In total, the reanalysis identified 15,417 unique proteins from 15 samples.

  6. f

    Data_Sheet_7_Integrated Analysis of Microarray and RNA-Seq Data for the...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maryum Nisar; Rehan Zafar Paracha; Iqra Arshad; Sidra Adil; Sabaoon Zeb; Rumeza Hanif; Mehak Rafiq; Zamir Hussain (2023). Data_Sheet_7_Integrated Analysis of Microarray and RNA-Seq Data for the Identification of Hub Genes and Networks Involved in the Pancreatic Cancer.xlsx [Dataset]. http://doi.org/10.3389/fgene.2021.663787.s007
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Frontiers
    Authors
    Maryum Nisar; Rehan Zafar Paracha; Iqra Arshad; Sidra Adil; Sabaoon Zeb; Rumeza Hanif; Mehak Rafiq; Zamir Hussain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pancreatic cancer (PaCa) is the seventh most fatal malignancy, with more than 90% mortality rate within the first year of diagnosis. Its treatment can be improved the identification of specific therapeutic targets and their relevant pathways. Therefore, the objective of this study is to identify cancer specific biomarkers, therapeutic targets, and their associated pathways involved in the PaCa progression. RNA-seq and microarray datasets were obtained from public repositories such as the European Bioinformatics Institute (EBI) and Gene Expression Omnibus (GEO) databases. Differential gene expression (DE) analysis of data was performed to identify significant differentially expressed genes (DEGs) in PaCa cells in comparison to the normal cells. Gene co-expression network analysis was performed to identify the modules co-expressed genes, which are strongly associated with PaCa and as well as the identification of hub genes in the modules. The key underlaying pathways were obtained from the enrichment analysis of hub genes and studied in the context of PaCa progression. The significant pathways, hub genes, and their expression profile were validated against The Cancer Genome Atlas (TCGA) data, and key biomarkers and therapeutic targets with hub genes were determined. Important hub genes identified included ITGA1, ITGA2, ITGB1, ITGB3, MET, LAMB1, VEGFA, PTK2, and TGFβ1. Enrichment analysis characterizes the involvement of hub genes in multiple pathways. Important ones that are determined are ECM–receptor interaction and focal adhesion pathways. The interaction of overexpressed surface proteins of these pathways with extracellular molecules initiates multiple signaling cascades including stress fiber and lamellipodia formation, PI3K-Akt, MAPK, JAK/STAT, and Wnt signaling pathways. Identified biomarkers may have a strong influence on the PaCa early stage development and progression. Further, analysis of these pathways and hub genes can help in the identification of putative therapeutic targets and development of effective therapies for PaCa.

  7. HAVEN Datasets

    • zenodo.org
    txt, zip
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blessy Antony; Blessy Antony; Maryam Haghani; Maryam Haghani; Adam Lauring; Adam Lauring; Anuj Karpatne; Anuj Karpatne; T. M. Murali; T. M. Murali (2025). HAVEN Datasets [Dataset]. http://doi.org/10.5281/zenodo.15540019
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blessy Antony; Blessy Antony; Maryam Haghani; Maryam Haghani; Adam Lauring; Adam Lauring; Anuj Karpatne; Anuj Karpatne; T. M. Murali; T. M. Murali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the datasets used for the pre-training, fine-tuning, and evaluation of HAVEN: Hierarchical Attention for Viral protEin-based host iNference.

    HAVEN is a viral protein language model pre-trained on 1.2 million protein sequences belonging to Viridae (viruses). It is fine-tuned to predict the hosts from which the viral protein sequence was sampled ("virus-host").

    The viral protein sequences are downloaded from UniRef90. The host for the viral sequences are from the European Nucleotide Archive (ENA) maintained by European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI).

  8. Data from: INNUENDO whole genome and core genome MLST schemas and datasets...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Rossi; Mirko Rossi; Mickael Santos Da Silva; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Diogo Nuno Silva; Miguel Paulo Machado; Miguel Paulo Machado; Mónica Oleastro; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Javier Garaizar; Joseba Bikandi; Joseba Bikandi; Friederike Hilbert; João André Carriço; João André Carriço; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Saara Salmenlinna; Marjaana Hakkinen; Friederike Hilbert (2020). INNUENDO whole genome and core genome MLST schemas and datasets for Salmonella enterica [Dataset]. http://doi.org/10.5281/zenodo.1323684
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Rossi; Mirko Rossi; Mickael Santos Da Silva; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Diogo Nuno Silva; Miguel Paulo Machado; Miguel Paulo Machado; Mónica Oleastro; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Javier Garaizar; Joseba Bikandi; Joseba Bikandi; Friederike Hilbert; João André Carriço; João André Carriço; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Saara Salmenlinna; Marjaana Hakkinen; Friederike Hilbert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    As reference dataset, 4,307 public available draft or complete genome assemblies and available metadata of Salmonella enterica have been downloaded from public repositories (i.e. EnteroBase, National Center for Biotechnology Information NCBIand The European Bioinformatics Institute EMBL-EBI; accessed April 2017). The collection includes 1,465 S. Enteritidis, 2,442 S.Typhimurium, and 400 of other frequently isolated serovars in Europe. The dataset includes also 153 S.Typhimurium variant 4,[5],12:i:- collected from different Italian regions between 2012 and 2014 during a surveillance study and 129 S. Enteritidis belonging to the INNUENDO sequence dataset (PRJEB27020). The 282 additional genomes were assembled using INNUca v3.1.

    File 'Metadata/Senterica_metadata.txt' contains metadata information for each strain including source classification, host taxa, year and country of isolation, serotype, classical pubMLST 7 genes ST classification, and source/method of the assembly.

    The directory 'Genomes' contains all the 4,589 assemblies of the strains listed in 'Metadata/Senterica_metadata.txt'. Please note that genomes marked as 'Enterobase' have been downloaded from Enterobase webpage http://enterobase.warwick.ac.uk.

    Schema creation and validation

    The wgMLST schema from EnteroBase have been downloaded and curated using chewBBACA AutoAlleleCDSCuration for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using chewBBACA Schema Evaluation and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the Salmonella genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the chewBBACA Allele Calling engine in more than 1% of a dataset composed by 4,589 Salmonella genomes.

    File 'Schemas/Senterica_wgMLST_ 8558_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 8,558 loci.

    File 'Schemas/Senterica_cgMLST_ 3255_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 3,255 loci and has been defined as the loci present in at least the 99% of the 4,589 Salmonella genomes. Genomes have no more than 2% of missing loci.

    File 'Allele_Profles/Senterica_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 4,589 Salmonella genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software.

    File 'Allele_Profles/Senterica_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 4,589 Salmonella genomes of the dataset. Please note that missing loci are indicated with a zero.

    Additional citations

    The schema are prepared to be used with chewBBACA. When using the schema in this repository please cite also:

    Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166 http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000166

    Salmonella enterica schema is a derivation of EnteroBase Salmonella EnteroBase wgMLST schema. When using the schema in this repository please cite also:

    Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of Salmonella. PLoS Genet 14 (4):e1007261. https://doi.org/10.1371/journal.pgen.1007261

  9. Data from: Saccharomyces genome database informs human biology

    • ckan.grassroots.tools
    pdf
    Updated Aug 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (2019). Saccharomyces genome database informs human biology [Dataset]. https://ckan.grassroots.tools/dataset/a474c44c-efd7-48cc-98b2-fe0f0c209bd5
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 7, 2019
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is an expertly curated database of literature-derived functional information for the model organism budding yeast, Saccharomyces cerevisiae. SGD constantly strives to synergize new types of experimental data and bioinformatics predictions with existing data, and to organize them into a comprehensive and up-to-date information resource. The primary mission of SGD is to facilitate research into the biology of yeast and to provide this wealth of information to advance, in many ways, research on other organisms, even those as evolutionarily distant as humans. To build such a bridge between biological kingdoms, SGD is curating data regarding yeast-human complementation, in which a human gene can successfully replace the function of a yeast gene, and/or vice versa. These data are manually curated from published literature, made available for download, and incorporated into a variety of analysis tools provided by SGD.

  10. Reanalysis of dataset PXD003791: “MuSt multiomics - Integrated multi-omics...

    • data.niaid.nih.gov
    xml
    Updated Jun 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengbo Wang; Juan Antonio Vizcaino (2022). Reanalysis of dataset PXD003791: “MuSt multiomics - Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes” [Dataset]. https://data.niaid.nih.gov/resources?id=pxd034617
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jun 17, 2022
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    Shengbo Wang; Juan Antonio Vizcaino
    Variables measured
    Proteomics
    Description

    This project contains raw data, intermediate files and results is a re-analysis of the publicly available dataset from the PRIDE dataset PXD003791. The RAW files were processed using ThermoRawFileParser, SearchGUI and PeptideShaker through standard settings (see ‘Data Processing Protocol’). This reanalysis work is part of the MetaPUF (MetaProteomics with Unknown Function) project, which is a collaboration between EMBL-EBI and the University of Luxembourg. The dataset was selected with the following conditions: 1. It has been made publicly available in PRIDE and focuses on metaproteomics of the human gut; 2. The corresponding metagenomics assemblies were also available from ENA (European Nucleotide Archive) or MGnify. The processed peptide reports for each sample are available to view at the contig level on the MGnify website. In total, the reanalysis identified 59,613 unique proteins from 36 samples.

  11. s

    IMGT/HLA

    • scicrunch.org
    • dknet.org
    • +2more
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). IMGT/HLA [Dataset]. http://identifiers.org/RRID:SCR_002971
    Explore at:
    Dataset updated
    Nov 28, 2022
    Description

    Database for sequences of the human major histocompatibility complex (HLA) and includes the official sequences for the WHO Nomenclature Committee For Factors of the HLA System. It currently contains 9,310 allele sequences (2013) along with detailed information concerning the material from which the sequence was derived and data on the validation of the sequences. It is established procedure for authors to submit the sequences directly to the IMGT/HLA Database for checking and assignment of an official name prior to publication, this avoids the problems associated with renaming published sequences and the confusion of multiple names for the same sequence. The need for reasonably rapid publication of new HLA allele sequences has necessitated an annual meeting of the WHO Nomenclature Committee for Factors of the HLA System. Additionally they now publish monthly HLA nomenclature updates both in journals and online to provide quick and easy access to new sequence information. The IMGT/HLA Database is part of the international ImMunoGeneTics project. In collaboration with the Imperial Cancer Research Fund (ICRF) and European Bioinformatics Institute (EBI) they have developed an Oracle database to house the HLA sequences in such a way as to allow users to present complex queries about the sequence, sequence features, references, contacts and allele designations to the database via a graphical user interface over the web. The IMGT/HLA Database Submission Tool allows direct submission of sequences to the WHO HLA Nomenclature Committee for Factors of the HLA System. The IMGT/HLA Database provides an FTP site for the retrieval of sequences in a number of pre-formatted files.

  12. e

    SMART

    • ebi.ac.uk
    Updated Feb 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 14, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.

  13. n

    Data from: The genomic landscape of ribosomal peptides containing thiazole...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 11, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Courtney L. Cox; James R. Doroghazi; Douglas A. Mitchell (2016). The genomic landscape of ribosomal peptides containing thiazole and oxazole heterocycles [Dataset]. http://doi.org/10.5061/dryad.7q830
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 11, 2016
    Dataset provided by
    University of Illinois Urbana-Champaign
    Authors
    Courtney L. Cox; James R. Doroghazi; Douglas A. Mitchell
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background: Ribosomally synthesized and post-translationally modified peptides (RiPPs) are a burgeoning class of natural products with diverse activity that share a similar origin and common features in their biosynthetic pathways. The precursor peptides of these natural products are ribosomally produced, upon which a combination of modification enzymes installs diverse functional groups. This genetically encoded peptide-based strategy allows for rapid diversification of these natural products by mutation in the precursor genes merged with unique combinations of modification enzymes. Thiazole/oxazole-modified microcins (TOMMs) are a class of RiPPs defined by the presence of heterocycles derived from cysteine, serine, and threonine residues in the precursor peptide. TOMMs encompass a number of different families, including but not limited to the linear azol(in)e-containing peptides (streptolysin S, microcin B17, and plantazolicin), cyanobactins, thiopeptides, and bottromycins. Although many TOMMs have been explored, the increased availability of genome sequences has illuminated several unexplored TOMM producers. Methods: All YcaO domain-containing proteins (D protein) and the surrounding genomic regions were were obtained from the European Molecular Biology Laboratory (EMBL) and the European Bioinformatics Institute (EBI). MultiGeneBlast was used to group gene clusters contain a D protein. A number of techniques were used to identify TOMM biosynthetic gene clusters from the D protein containing gene clusters. Precursor peptides from these gene clusters were also identified. Both sequence similarity and phylogenetic analysis were used to classify the 20 diverse TOMM clusters identified. Results: Given the remarkable structural and functional diversity displayed by known TOMMs, a comprehensive bioinformatic study to catalog and classify the entire RiPP class was undertaken. Here we report the bioinformatic characterization of nearly 1,500 TOMM gene clusters from genomes in the European Molecular Biology Laboratory (EMBL) and the European Bioinformatics Institute (EBI) sequence repository. Genome mining suggests a complex diversification of modification enzymes and precursor peptides to create more than 20 distinct families of TOMMs, nine of which have not heretofore been described. Many of the identified TOMM families have an abundance of diverse precursor peptide sequences as well as unfamiliar combinations of modification enzymes, signifying a potential wealth of novel natural products on known and unknown biosynthetic scaffolds. Phylogenetic analysis suggests a widespread distribution of TOMMs across multiple phyla; however, producers of similar TOMMs are generally found in the same phylum with few exceptions. Conclusions: The comprehensive genome mining study described herein has uncovered a myriad of unique TOMM biosynthetic clusters and provides an atlas to guide future discovery efforts. These biosynthetic gene clusters are predicted to produce diverse final products, and the identification of additional combinations of modification enzymes could expand the potential of combinatorial natural product biosynthesis.

  14. BioSamples RDF

    • data.wu.ac.at
    api/sparql, meta/void +1
    Updated Jul 30, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMBL European Bioinformatics Institute (2016). BioSamples RDF [Dataset]. https://data.wu.ac.at/odso/datahub_io/OWU5ZjQwN2ItN2E3Mi00OGMyLWE0OTYtZGI1MTkzMjI5YjQw
    Explore at:
    api/sparql, meta/void, ttlAvailable download formats
    Dataset updated
    Jul 30, 2016
    Dataset provided by
    European Molecular Biology Laboratoryhttp://www.embl.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Description

    The BioSamples database aggregates sample information for reference samples (e.g. Coriell Cell lines) and samples for which data exist in one of the EBI's assay databases such as ArrayExpress, the European Nucleotide Archive or PRoteomics Identificates DatabasE. It provides links to assays an specific samples, and accepts direct submissions of sample information.

  15. The human phosphoproteome map based on PRIDE data

    • data.niaid.nih.gov
    • ebi.ac.uk
    xml
    Updated Feb 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Jarnuczak; Juan Antonio Vizcaino (2019). The human phosphoproteome map based on PRIDE data [Dataset]. https://data.niaid.nih.gov/resources?id=pxd012174
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Feb 13, 2019
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    EBI
    Authors
    Andrew Jarnuczak; Juan Antonio Vizcaino
    Variables measured
    Proteomics
    Description

    This project contains raw data, intermediate files and results used to create the PRIDE human phosphoproteome map. The map is based on joint reanalysis of 110 publicly available human datasets. All relevant datasets were retrieved from the PRIDE database, and after manual curation, only assays that employed dedicated phospho-enrichment sample preparation strategies (e. g. metal oxide affinity chromatography, anti-P-Tyr antibodies, etc.) were included. Raw files were jointly processed with MaxQuant computational platform using standard settings (see Data Processing Protocol). In total, the joint analysis allowed identification of 252,189 phosphosites at 1% peptide spectrum match false discovery rate (PSM FDR) (MQ search results available in ‘txt-100PTM’ folder), of which 121,896 passed the additional 1% site localization FDR threshold (MQ search results available in ‘txt-001PTM’ folder).

  16. e

    CATH-Gene3D

    • ebi.ac.uk
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Oct 21, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

  17. Data from: Ensembl Genomes 2018: an integrated omics infrastructure for...

    • ckan.grassroots.tools
    pdf
    Updated Aug 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (2019). Ensembl Genomes 2018: an integrated omics infrastructure for non-vertebrate species [Dataset]. https://ckan.grassroots.tools/ar/dataset/25cd369b-485b-48e7-9a2c-c2e168b47f45
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 7, 2019
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including genome sequence, gene models, transcript sequence, genetic variation, and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments and expansions. These include the incorporation of almost 20 000 additional genome sequences and over 35 000 tracks of RNA-Seq data, which have been aligned to genomic sequence and made available for visualization. Other advances since 2015 include the release of the database in Resource Description Framework (RDF) format, a large increase in community-derived curation, a new high-performance protein sequence search, additional cross-references, improved annotation of non-protein-coding genes, and the launch of pre-release and archival sites. Collectively, these changes are part of a continuing response to the increasing quantity of publicly-available genome-scale data, and the consequent need to archive, integrate, annotate and disseminate these using automated, scalable methods.

  18. Z

    Data from: CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farah Zaib Khan (2020). CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_2632836
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Stian Soiland-Reyes
    Farah Zaib Khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see CWLProv 0.6.0 or use the cwlprov Python tool to explore.

    The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages.

    First step, Pre-align, accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step.

    The next step Align also accepts the human reference genome as input along with the output files from Pre-align and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format.

    The BAM files generated after lign are sorted with SAMtool sort'.

    Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in Post-align step.

    Steps to reproduce

    This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker.

    Install gsutils

    export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"

    echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" |
    sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg |
    sudo apt-key add -

    sudo apt-get update && sudo apt-get install google-cloud-sdk

    Get the data and make the analysis environment ready:

    git clone https://github.com/FarahZKhan/topmed-workflows.git cd topmed-workflows git checkout cwlprov_testing cd aligner/sbg-alignment-cwl

    this is a custom script download google bucket files from json files and create a local json

    it needs gsutil to be installed though

    git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git

    Wait... this should download ~18Gb.

    python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json

    Run the following commands to create the CWLProv Research Object:

    time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new

    zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux

    sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha25

  19. e

    PRINTS

    • ebi.ac.uk
    Updated Jun 14, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). PRINTS [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Jun 14, 2012
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.

  20. Data from: Ensembl Genomes 2020—enabling non-vertebrate genomic research

    • ckan.grassroots.tools
    pdf
    Updated Sep 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (2022). Ensembl Genomes 2020—enabling non-vertebrate genomic research [Dataset]. https://ckan.grassroots.tools/dataset/6463ab50-ba71-44b6-85b0-f2aa9e67fef2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 15, 2022
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    jats:titleAbstract/jats:title jats:pEnsembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of interfaces to genomic data across the tree of life, including reference genome sequence, gene models, transcriptional data, genetic variation and comparative analysis. Data may be accessed via our website, online tools platform and programmatic interfaces, with updates made four times per year (in synchrony with Ensembl). Here, we provide an overview of Ensembl Genomes, with a focus on recent developments. These include the continued growth, more robust and reproducible sets of orthologues and paralogues, and enriched views of gene expression and gene function in plants. Finally, we report on our continued deeper integration with the Ensembl project, which forms a key part of our future strategy for dealing with the increasing quantity of available genome-scale data across the tree of life./jats:p

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/

PROSITE profiles

Explore at:
Dataset updated
Feb 5, 2025
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

Search
Clear search
Close search
Google apps
Main menu