25 datasets found

e
PROSITE profiles
ebi.ac.uk
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
f
Bioinformatics Goes to School—New Avenues for Teaching Contemporary Biology
plos.figshare.com
doc
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louisa Wood; Philipp Gebhardt (2023). Bioinformatics Goes to School—New Avenues for Teaching Contemporary Biology [Dataset]. http://doi.org/10.1371/journal.pcbi.1003089
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1003089
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS Computational Biology
Authors
Louisa Wood; Philipp Gebhardt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Since 2010, the European Molecular Biology Laboratory's (EMBL) Heidelberg laboratory and the European Bioinformatics Institute (EMBL-EBI) have jointly run bioinformatics training courses developed specifically for secondary school science teachers within Europe and EMBL member states. These courses focus on introducing bioinformatics, databases, and data-intensive biology, allowing participants to explore resources and providing classroom-ready materials to support them in sharing this new knowledge with their students.In this article, we chart our progress made in creating and running three bioinformatics training courses, including how the course resources are received by participants and how these, and bioinformatics in general, are subsequently used in the classroom. We assess the strengths and challenges of our approach, and share what we have learned through our interactions with European science teachers.
m
CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)
data.mendeley.com
Updated Dec 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farah Zaib Khan (2018). CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object) [Dataset]. http://doi.org/10.17632/6wtpgr3kbj.1
Explore at:
Unique identifier
https://doi.org/10.17632/6wtpgr3kbj.1
Dataset updated
Dec 4, 2018
Authors
Farah Zaib Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages. First step, "Pre-align'' accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step. The next step "Align'' also accepts the human reference genome as input along with the output files from "Pre-align'' and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format. The BAM files generated after "Align'' are sorted with "SAMtool sort''. Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in "Post-align'' step.

This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.6.0 or use https://pypi.org/project/cwlprov/ to explore
e
SFLD
ebi.ac.uk
Updated Sep 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Sep 7, 2018
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Reanalysis of dataset PXD005780: “Relating human gut metagenome and...
data.niaid.nih.gov
ebi.ac.uk
xml
Updated May 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengbo Wang; Juan Antonio Vizcaino (2022). Reanalysis of dataset PXD005780: “Relating human gut metagenome and metaproteome” [Dataset]. https://data.niaid.nih.gov/resources?id=pxd032303
Explore at:
xmlAvailable download formats
Dataset updated
May 4, 2022
Dataset provided by
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Authors
Shengbo Wang; Juan Antonio Vizcaino
Variables measured
Proteomics
Description
This project contains raw data, intermediate files and results is a re-analysis of the publicly available dataset from the PRIDE dataset PXD005780. The RAW files were processed using ThermoRawFileParser, SearchGUI and PeptideShaker through standard settings (see ‘Data Processing Protocol’). This reanalysis work is part of the MetaPUF (MetaProteomics with Unknown Function) project, which is a collaboration between EMBL-EBI and the University of Luxembourg. The dataset was selected with the following conditions: 1. It has been made publicly available in PRIDE and focuses on metaproteomics of the human gut; 2. The corresponding metagenomics assemblies were also available from ENA (European Nucleotide Archive) or MGnify. The processed peptide reports for each sample are available to view at the contig level on the MGnify website. In total, the reanalysis identified 15,417 unique proteins from 15 samples.
f
Data_Sheet_7_Integrated Analysis of Microarray and RNA-Seq Data for the...
frontiersin.figshare.com
xlsx
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maryum Nisar; Rehan Zafar Paracha; Iqra Arshad; Sidra Adil; Sabaoon Zeb; Rumeza Hanif; Mehak Rafiq; Zamir Hussain (2023). Data_Sheet_7_Integrated Analysis of Microarray and RNA-Seq Data for the Identification of Hub Genes and Networks Involved in the Pancreatic Cancer.xlsx [Dataset]. http://doi.org/10.3389/fgene.2021.663787.s007
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.663787.s007
Dataset updated
Jun 10, 2023
Dataset provided by
Frontiers
Authors
Maryum Nisar; Rehan Zafar Paracha; Iqra Arshad; Sidra Adil; Sabaoon Zeb; Rumeza Hanif; Mehak Rafiq; Zamir Hussain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pancreatic cancer (PaCa) is the seventh most fatal malignancy, with more than 90% mortality rate within the first year of diagnosis. Its treatment can be improved the identification of specific therapeutic targets and their relevant pathways. Therefore, the objective of this study is to identify cancer specific biomarkers, therapeutic targets, and their associated pathways involved in the PaCa progression. RNA-seq and microarray datasets were obtained from public repositories such as the European Bioinformatics Institute (EBI) and Gene Expression Omnibus (GEO) databases. Differential gene expression (DE) analysis of data was performed to identify significant differentially expressed genes (DEGs) in PaCa cells in comparison to the normal cells. Gene co-expression network analysis was performed to identify the modules co-expressed genes, which are strongly associated with PaCa and as well as the identification of hub genes in the modules. The key underlaying pathways were obtained from the enrichment analysis of hub genes and studied in the context of PaCa progression. The significant pathways, hub genes, and their expression profile were validated against The Cancer Genome Atlas (TCGA) data, and key biomarkers and therapeutic targets with hub genes were determined. Important hub genes identified included ITGA1, ITGA2, ITGB1, ITGB3, MET, LAMB1, VEGFA, PTK2, and TGFβ1. Enrichment analysis characterizes the involvement of hub genes in multiple pathways. Important ones that are determined are ECM–receptor interaction and focal adhesion pathways. The interaction of overexpressed surface proteins of these pathways with extracellular molecules initiates multiple signaling cascades including stress fiber and lamellipodia formation, PI3K-Akt, MAPK, JAK/STAT, and Wnt signaling pathways. Identified biomarkers may have a strong influence on the PaCa early stage development and progression. Further, analysis of these pathways and hub genes can help in the identification of putative therapeutic targets and development of effective therapies for PaCa.
HAVEN Datasets
zenodo.org
txt, zip
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessy Antony; Blessy Antony; Maryam Haghani; Maryam Haghani; Adam Lauring; Adam Lauring; Anuj Karpatne; Anuj Karpatne; T. M. Murali; T. M. Murali (2025). HAVEN Datasets [Dataset]. http://doi.org/10.5281/zenodo.15540019
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15540019
Dataset updated
May 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Blessy Antony; Blessy Antony; Maryam Haghani; Maryam Haghani; Adam Lauring; Adam Lauring; Anuj Karpatne; Anuj Karpatne; T. M. Murali; T. M. Murali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the datasets used for the pre-training, fine-tuning, and evaluation of HAVEN: Hierarchical Attention for Viral protEin-based host iNference.

HAVEN is a viral protein language model pre-trained on 1.2 million protein sequences belonging to Viridae (viruses). It is fine-tuned to predict the hosts from which the viral protein sequence was sampled ("virus-host").

The viral protein sequences are downloaded from UniRef90. The host for the viral sequences are from the European Nucleotide Archive (ENA) maintained by European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI).
Data from: INNUENDO whole genome and core genome MLST schemas and datasets...
zenodo.org
explore.openaire.eu
+1more
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Rossi; Mirko Rossi; Mickael Santos Da Silva; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Diogo Nuno Silva; Miguel Paulo Machado; Miguel Paulo Machado; Mónica Oleastro; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Javier Garaizar; Joseba Bikandi; Joseba Bikandi; Friederike Hilbert; João André Carriço; João André Carriço; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Saara Salmenlinna; Marjaana Hakkinen; Friederike Hilbert (2020). INNUENDO whole genome and core genome MLST schemas and datasets for Salmonella enterica [Dataset]. http://doi.org/10.5281/zenodo.1323684
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1323684
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Rossi; Mirko Rossi; Mickael Santos Da Silva; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Diogo Nuno Silva; Miguel Paulo Machado; Miguel Paulo Machado; Mónica Oleastro; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Javier Garaizar; Joseba Bikandi; Joseba Bikandi; Friederike Hilbert; João André Carriço; João André Carriço; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Saara Salmenlinna; Marjaana Hakkinen; Friederike Hilbert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset

As reference dataset, 4,307 public available draft or complete genome assemblies and available metadata of Salmonella enterica have been downloaded from public repositories (i.e. EnteroBase, National Center for Biotechnology Information NCBIand The European Bioinformatics Institute EMBL-EBI; accessed April 2017). The collection includes 1,465 S. Enteritidis, 2,442 S.Typhimurium, and 400 of other frequently isolated serovars in Europe. The dataset includes also 153 S.Typhimurium variant 4,[5],12:i:- collected from different Italian regions between 2012 and 2014 during a surveillance study and 129 S. Enteritidis belonging to the INNUENDO sequence dataset (PRJEB27020). The 282 additional genomes were assembled using INNUca v3.1.

File 'Metadata/Senterica_metadata.txt' contains metadata information for each strain including source classification, host taxa, year and country of isolation, serotype, classical pubMLST 7 genes ST classification, and source/method of the assembly.

The directory 'Genomes' contains all the 4,589 assemblies of the strains listed in 'Metadata/Senterica_metadata.txt'. Please note that genomes marked as 'Enterobase' have been downloaded from Enterobase webpage http://enterobase.warwick.ac.uk.

Schema creation and validation

The wgMLST schema from EnteroBase have been downloaded and curated using chewBBACA AutoAlleleCDSCuration for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using chewBBACA Schema Evaluation and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the Salmonella genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the chewBBACA Allele Calling engine in more than 1% of a dataset composed by 4,589 Salmonella genomes.

File 'Schemas/Senterica_wgMLST_ 8558_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 8,558 loci.

File 'Schemas/Senterica_cgMLST_ 3255_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 3,255 loci and has been defined as the loci present in at least the 99% of the 4,589 Salmonella genomes. Genomes have no more than 2% of missing loci.

File 'Allele_Profles/Senterica_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 4,589 Salmonella genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software.

File 'Allele_Profles/Senterica_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 4,589 Salmonella genomes of the dataset. Please note that missing loci are indicated with a zero.

Additional citations

The schema are prepared to be used with chewBBACA. When using the schema in this repository please cite also:

Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166 http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000166

Salmonella enterica schema is a derivation of EnteroBase Salmonella EnteroBase wgMLST schema. When using the schema in this repository please cite also:

Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of Salmonella. PLoS Genet 14 (4):e1007261. https://doi.org/10.1371/journal.pgen.1007261
Data from: Saccharomyces genome database informs human biology
ckan.grassroots.tools
pdf
Updated Aug 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Bioinformatics Institute (2019). Saccharomyces genome database informs human biology [Dataset]. https://ckan.grassroots.tools/dataset/a474c44c-efd7-48cc-98b2-fe0f0c209bd5
Explore at:
pdfAvailable download formats
Dataset updated
Aug 7, 2019
Dataset provided by
European Bioinformatics Institutehttp://www.ebi.ac.uk/
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is an expertly curated database of literature-derived functional information for the model organism budding yeast, Saccharomyces cerevisiae. SGD constantly strives to synergize new types of experimental data and bioinformatics predictions with existing data, and to organize them into a comprehensive and up-to-date information resource. The primary mission of SGD is to facilitate research into the biology of yeast and to provide this wealth of information to advance, in many ways, research on other organisms, even those as evolutionarily distant as humans. To build such a bridge between biological kingdoms, SGD is curating data regarding yeast-human complementation, in which a human gene can successfully replace the function of a yeast gene, and/or vice versa. These data are manually curated from published literature, made available for download, and incorporated into a variety of analysis tools provided by SGD.
Reanalysis of dataset PXD003791: “MuSt multiomics - Integrated multi-omics...
data.niaid.nih.gov
xml
Updated Jun 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengbo Wang; Juan Antonio Vizcaino (2022). Reanalysis of dataset PXD003791: “MuSt multiomics - Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes” [Dataset]. https://data.niaid.nih.gov/resources?id=pxd034617
Explore at:
xmlAvailable download formats
Dataset updated
Jun 17, 2022
Dataset provided by
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Authors
Shengbo Wang; Juan Antonio Vizcaino
Variables measured
Proteomics
Description
This project contains raw data, intermediate files and results is a re-analysis of the publicly available dataset from the PRIDE dataset PXD003791. The RAW files were processed using ThermoRawFileParser, SearchGUI and PeptideShaker through standard settings (see ‘Data Processing Protocol’). This reanalysis work is part of the MetaPUF (MetaProteomics with Unknown Function) project, which is a collaboration between EMBL-EBI and the University of Luxembourg. The dataset was selected with the following conditions: 1. It has been made publicly available in PRIDE and focuses on metaproteomics of the human gut; 2. The corresponding metagenomics assemblies were also available from ENA (European Nucleotide Archive) or MGnify. The processed peptide reports for each sample are available to view at the contig level on the MGnify website. In total, the reanalysis identified 59,613 unique proteins from 36 samples.
s
IMGT/HLA
scicrunch.org
dknet.org
+2more
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). IMGT/HLA [Dataset]. http://identifiers.org/RRID:SCR_002971
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002971
Dataset updated
Nov 28, 2022
Description
Database for sequences of the human major histocompatibility complex (HLA) and includes the official sequences for the WHO Nomenclature Committee For Factors of the HLA System. It currently contains 9,310 allele sequences (2013) along with detailed information concerning the material from which the sequence was derived and data on the validation of the sequences. It is established procedure for authors to submit the sequences directly to the IMGT/HLA Database for checking and assignment of an official name prior to publication, this avoids the problems associated with renaming published sequences and the confusion of multiple names for the same sequence. The need for reasonably rapid publication of new HLA allele sequences has necessitated an annual meeting of the WHO Nomenclature Committee for Factors of the HLA System. Additionally they now publish monthly HLA nomenclature updates both in journals and online to provide quick and easy access to new sequence information. The IMGT/HLA Database is part of the international ImMunoGeneTics project. In collaboration with the Imperial Cancer Research Fund (ICRF) and European Bioinformatics Institute (EBI) they have developed an Oracle database to house the HLA sequences in such a way as to allow users to present complex queries about the sequence, sequence features, references, contacts and allele designations to the database via a graphical user interface over the web. The IMGT/HLA Database Submission Tool allows direct submission of sequences to the WHO HLA Nomenclature Committee for Factors of the HLA System. The IMGT/HLA Database provides an FTP site for the retrieval of sequences in a number of pre-formatted files.
e
SMART
ebi.ac.uk
Updated Feb 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 14, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
n
Data from: The genomic landscape of ribosomal peptides containing thiazole...
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 11, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Courtney L. Cox; James R. Doroghazi; Douglas A. Mitchell (2016). The genomic landscape of ribosomal peptides containing thiazole and oxazole heterocycles [Dataset]. http://doi.org/10.5061/dryad.7q830
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7q830
Dataset updated
Sep 11, 2016
Dataset provided by
University of Illinois Urbana-Champaign
Authors
Courtney L. Cox; James R. Doroghazi; Douglas A. Mitchell
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background: Ribosomally synthesized and post-translationally modified peptides (RiPPs) are a burgeoning class of natural products with diverse activity that share a similar origin and common features in their biosynthetic pathways. The precursor peptides of these natural products are ribosomally produced, upon which a combination of modification enzymes installs diverse functional groups. This genetically encoded peptide-based strategy allows for rapid diversification of these natural products by mutation in the precursor genes merged with unique combinations of modification enzymes. Thiazole/oxazole-modified microcins (TOMMs) are a class of RiPPs defined by the presence of heterocycles derived from cysteine, serine, and threonine residues in the precursor peptide. TOMMs encompass a number of different families, including but not limited to the linear azol(in)e-containing peptides (streptolysin S, microcin B17, and plantazolicin), cyanobactins, thiopeptides, and bottromycins. Although many TOMMs have been explored, the increased availability of genome sequences has illuminated several unexplored TOMM producers. Methods: All YcaO domain-containing proteins (D protein) and the surrounding genomic regions were were obtained from the European Molecular Biology Laboratory (EMBL) and the European Bioinformatics Institute (EBI). MultiGeneBlast was used to group gene clusters contain a D protein. A number of techniques were used to identify TOMM biosynthetic gene clusters from the D protein containing gene clusters. Precursor peptides from these gene clusters were also identified. Both sequence similarity and phylogenetic analysis were used to classify the 20 diverse TOMM clusters identified. Results: Given the remarkable structural and functional diversity displayed by known TOMMs, a comprehensive bioinformatic study to catalog and classify the entire RiPP class was undertaken. Here we report the bioinformatic characterization of nearly 1,500 TOMM gene clusters from genomes in the European Molecular Biology Laboratory (EMBL) and the European Bioinformatics Institute (EBI) sequence repository. Genome mining suggests a complex diversification of modification enzymes and precursor peptides to create more than 20 distinct families of TOMMs, nine of which have not heretofore been described. Many of the identified TOMM families have an abundance of diverse precursor peptide sequences as well as unfamiliar combinations of modification enzymes, signifying a potential wealth of novel natural products on known and unknown biosynthetic scaffolds. Phylogenetic analysis suggests a widespread distribution of TOMMs across multiple phyla; however, producers of similar TOMMs are generally found in the same phylum with few exceptions. Conclusions: The comprehensive genome mining study described herein has uncovered a myriad of unique TOMM biosynthetic clusters and provides an atlas to guide future discovery efforts. These biosynthetic gene clusters are predicted to produce diverse final products, and the identification of additional combinations of modification enzymes could expand the potential of combinatorial natural product biosynthesis.
BioSamples RDF
data.wu.ac.at
api/sparql, meta/void +1
Updated Jul 30, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EMBL European Bioinformatics Institute (2016). BioSamples RDF [Dataset]. https://data.wu.ac.at/odso/datahub_io/OWU5ZjQwN2ItN2E3Mi00OGMyLWE0OTYtZGI1MTkzMjI5YjQw
Explore at:
api/sparql, meta/void, ttlAvailable download formats
Dataset updated
Jul 30, 2016
Dataset provided by
European Molecular Biology Laboratoryhttp://www.embl.org/
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Description
The BioSamples database aggregates sample information for reference samples (e.g. Coriell Cell lines) and samples for which data exist in one of the EBI's assay databases such as ArrayExpress, the European Nucleotide Archive or PRoteomics Identificates DatabasE. It provides links to assays an specific samples, and accepts direct submissions of sample information.
The human phosphoproteome map based on PRIDE data
data.niaid.nih.gov
ebi.ac.uk
xml
Updated Feb 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Jarnuczak; Juan Antonio Vizcaino (2019). The human phosphoproteome map based on PRIDE data [Dataset]. https://data.niaid.nih.gov/resources?id=pxd012174
Explore at:
xmlAvailable download formats
Dataset updated
Feb 13, 2019
Dataset provided by
European Bioinformatics Institutehttp://www.ebi.ac.uk/
EBI
Authors
Andrew Jarnuczak; Juan Antonio Vizcaino
Variables measured
Proteomics
Description
This project contains raw data, intermediate files and results used to create the PRIDE human phosphoproteome map. The map is based on joint reanalysis of 110 publicly available human datasets. All relevant datasets were retrieved from the PRIDE database, and after manual curation, only assays that employed dedicated phospho-enrichment sample preparation strategies (e. g. metal oxide affinity chromatography, anti-P-Tyr antibodies, etc.) were included. Raw files were jointly processed with MaxQuant computational platform using standard settings (see Data Processing Protocol). In total, the joint analysis allowed identification of 252,189 phosphosites at 1% peptide spectrum match false discovery rate (PSM FDR) (MQ search results available in ‘txt-100PTM’ folder), of which 121,896 passed the additional 1% site localization FDR threshold (MQ search results available in ‘txt-001PTM’ folder).
e
CATH-Gene3D
ebi.ac.uk
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Oct 21, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Data from: Ensembl Genomes 2018: an integrated omics infrastructure for...
ckan.grassroots.tools
pdf
Updated Aug 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Bioinformatics Institute (2019). Ensembl Genomes 2018: an integrated omics infrastructure for non-vertebrate species [Dataset]. https://ckan.grassroots.tools/ar/dataset/25cd369b-485b-48e7-9a2c-c2e168b47f45
Explore at:
pdfAvailable download formats
Dataset updated
Aug 7, 2019
Dataset provided by
European Bioinformatics Institutehttp://www.ebi.ac.uk/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including genome sequence, gene models, transcript sequence, genetic variation, and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments and expansions. These include the incorporation of almost 20 000 additional genome sequences and over 35 000 tracks of RNA-Seq data, which have been aligned to genomic sequence and made available for visualization. Other advances since 2015 include the release of the database in Resource Description Framework (RDF) format, a large increase in community-derived curation, a new high-performance protein sequence search, additional cross-references, improved annotation of non-protein-coding genes, and the launch of pre-release and archival sites. Collectively, these changes are part of a continuing response to the increasing quantity of publicly-available genome-scale data, and the consequent need to archive, integrate, annotate and disseminate these using automated, scalable methods.
Z
Data from: CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farah Zaib Khan (2020). CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_2632836
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Stian Soiland-Reyes
Farah Zaib Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see CWLProv 0.6.0 or use the cwlprov Python tool to explore.

The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages.

First step, Pre-align, accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step.

The next step Align also accepts the human reference genome as input along with the output files from Pre-align and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format.

The BAM files generated after lign are sorted with SAMtool sort'.

Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in Post-align step.

Steps to reproduce

This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker.

Install gsutils

export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"

echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" |
sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg |
sudo apt-key add -

sudo apt-get update && sudo apt-get install google-cloud-sdk

Get the data and make the analysis environment ready:

git clone https://github.com/FarahZKhan/topmed-workflows.git cd topmed-workflows git checkout cwlprov_testing cd aligner/sbg-alignment-cwl

this is a custom script download google bucket files from json files and create a local json

it needs gsutil to be installed though

git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git

Wait... this should download ~18Gb.

python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json

Run the following commands to create the CWLProv Research Object:

time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new

zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux

sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha25
e
PRINTS
ebi.ac.uk
Updated Jun 14, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2012). PRINTS [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Jun 14, 2012
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.
Data from: Ensembl Genomes 2020—enabling non-vertebrate genomic research
ckan.grassroots.tools
pdf
Updated Sep 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Bioinformatics Institute (2022). Ensembl Genomes 2020—enabling non-vertebrate genomic research [Dataset]. https://ckan.grassroots.tools/dataset/6463ab50-ba71-44b6-85b0-f2aa9e67fef2
Explore at:
pdfAvailable download formats
Dataset updated
Sep 15, 2022
Dataset provided by
European Bioinformatics Institutehttp://www.ebi.ac.uk/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
jats:titleAbstract/jats:title jats:pEnsembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of interfaces to genomic data across the tree of life, including reference genome sequence, gene models, transcriptional data, genetic variation and comparative analysis. Data may be accessed via our website, online tools platform and programmatic interfaces, with updates made four times per year (in synchrony with Ensembl). Here, we provide an overview of Ensembl Genomes, with a focus on recent developments. These include the continued growth, more robust and reproducible sets of orthologues and paralogues, and enriched views of gene expression and gene function in plants. Finally, we report on our continued deeper integration with the Ensembl project, which forms a key part of our future strategy for dealing with the increasing quantity of available genome-scale data across the tree of life./jats:p

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/

PROSITE profiles

Explore at:

Dataset updated

Feb 5, 2025

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

Clear search

Close search

Google apps

Main menu

PROSITE profiles

Bioinformatics Goes to School—New Avenues for Teaching Contemporary Biology

CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

SFLD

Reanalysis of dataset PXD005780: “Relating human gut metagenome and...

Data_Sheet_7_Integrated Analysis of Microarray and RNA-Seq Data for the...

HAVEN Datasets

Data from: INNUENDO whole genome and core genome MLST schemas and datasets...

Data from: Saccharomyces genome database informs human biology

Reanalysis of dataset PXD003791: “MuSt multiomics - Integrated multi-omics...

IMGT/HLA

SMART

Data from: The genomic landscape of ribosomal peptides containing thiazole...

BioSamples RDF

The human phosphoproteome map based on PRIDE data

CATH-Gene3D

Data from: Ensembl Genomes 2018: an integrated omics infrastructure for...

Data from: CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

this is a custom script download google bucket files from json files and create a local json

it needs gsutil to be installed though

Wait... this should download ~18Gb.

PRINTS

Data from: Ensembl Genomes 2020—enabling non-vertebrate genomic research

PROSITE profiles