Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Since 2010, the European Molecular Biology Laboratory's (EMBL) Heidelberg laboratory and the European Bioinformatics Institute (EMBL-EBI) have jointly run bioinformatics training courses developed specifically for secondary school science teachers within Europe and EMBL member states. These courses focus on introducing bioinformatics, databases, and data-intensive biology, allowing participants to explore resources and providing classroom-ready materials to support them in sharing this new knowledge with their students.In this article, we chart our progress made in creating and running three bioinformatics training courses, including how the course resources are received by participants and how these, and bioinformatics in general, are subsequently used in the classroom. We assess the strengths and challenges of our approach, and share what we have learned through our interactions with European science teachers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages. First step, "Pre-align'' accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step. The next step "Align'' also accepts the human reference genome as input along with the output files from "Pre-align'' and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format. The BAM files generated after "Align'' are sorted with "SAMtool sort''. Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in "Post-align'' step.
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.6.0 or use https://pypi.org/project/cwlprov/ to explore
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
This project contains raw data, intermediate files and results is a re-analysis of the publicly available dataset from the PRIDE dataset PXD005780. The RAW files were processed using ThermoRawFileParser, SearchGUI and PeptideShaker through standard settings (see ‘Data Processing Protocol’). This reanalysis work is part of the MetaPUF (MetaProteomics with Unknown Function) project, which is a collaboration between EMBL-EBI and the University of Luxembourg. The dataset was selected with the following conditions: 1. It has been made publicly available in PRIDE and focuses on metaproteomics of the human gut; 2. The corresponding metagenomics assemblies were also available from ENA (European Nucleotide Archive) or MGnify. The processed peptide reports for each sample are available to view at the contig level on the MGnify website. In total, the reanalysis identified 15,417 unique proteins from 15 samples.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pancreatic cancer (PaCa) is the seventh most fatal malignancy, with more than 90% mortality rate within the first year of diagnosis. Its treatment can be improved the identification of specific therapeutic targets and their relevant pathways. Therefore, the objective of this study is to identify cancer specific biomarkers, therapeutic targets, and their associated pathways involved in the PaCa progression. RNA-seq and microarray datasets were obtained from public repositories such as the European Bioinformatics Institute (EBI) and Gene Expression Omnibus (GEO) databases. Differential gene expression (DE) analysis of data was performed to identify significant differentially expressed genes (DEGs) in PaCa cells in comparison to the normal cells. Gene co-expression network analysis was performed to identify the modules co-expressed genes, which are strongly associated with PaCa and as well as the identification of hub genes in the modules. The key underlaying pathways were obtained from the enrichment analysis of hub genes and studied in the context of PaCa progression. The significant pathways, hub genes, and their expression profile were validated against The Cancer Genome Atlas (TCGA) data, and key biomarkers and therapeutic targets with hub genes were determined. Important hub genes identified included ITGA1, ITGA2, ITGB1, ITGB3, MET, LAMB1, VEGFA, PTK2, and TGFβ1. Enrichment analysis characterizes the involvement of hub genes in multiple pathways. Important ones that are determined are ECM–receptor interaction and focal adhesion pathways. The interaction of overexpressed surface proteins of these pathways with extracellular molecules initiates multiple signaling cascades including stress fiber and lamellipodia formation, PI3K-Akt, MAPK, JAK/STAT, and Wnt signaling pathways. Identified biomarkers may have a strong influence on the PaCa early stage development and progression. Further, analysis of these pathways and hub genes can help in the identification of putative therapeutic targets and development of effective therapies for PaCa.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the datasets used for the pre-training, fine-tuning, and evaluation of HAVEN: Hierarchical Attention for Viral protEin-based host iNference.
HAVEN is a viral protein language model pre-trained on 1.2 million protein sequences belonging to Viridae (viruses). It is fine-tuned to predict the hosts from which the viral protein sequence was sampled ("virus-host").
The viral protein sequences are downloaded from UniRef90. The host for the viral sequences are from the European Nucleotide Archive (ENA) maintained by European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
As reference dataset, 4,307 public available draft or complete genome assemblies and available metadata of Salmonella enterica have been downloaded from public repositories (i.e. EnteroBase, National Center for Biotechnology Information NCBIand The European Bioinformatics Institute EMBL-EBI; accessed April 2017). The collection includes 1,465 S. Enteritidis, 2,442 S.Typhimurium, and 400 of other frequently isolated serovars in Europe. The dataset includes also 153 S.Typhimurium variant 4,[5],12:i:- collected from different Italian regions between 2012 and 2014 during a surveillance study and 129 S. Enteritidis belonging to the INNUENDO sequence dataset (PRJEB27020). The 282 additional genomes were assembled using INNUca v3.1.
File 'Metadata/Senterica_metadata.txt' contains metadata information for each strain including source classification, host taxa, year and country of isolation, serotype, classical pubMLST 7 genes ST classification, and source/method of the assembly.
The directory 'Genomes' contains all the 4,589 assemblies of the strains listed in 'Metadata/Senterica_metadata.txt'. Please note that genomes marked as 'Enterobase' have been downloaded from Enterobase webpage http://enterobase.warwick.ac.uk.
Schema creation and validation
The wgMLST schema from EnteroBase have been downloaded and curated using chewBBACA AutoAlleleCDSCuration for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using chewBBACA Schema Evaluation and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the Salmonella genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the chewBBACA Allele Calling engine in more than 1% of a dataset composed by 4,589 Salmonella genomes.
File 'Schemas/Senterica_wgMLST_ 8558_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 8,558 loci.
File 'Schemas/Senterica_cgMLST_ 3255_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 3,255 loci and has been defined as the loci present in at least the 99% of the 4,589 Salmonella genomes. Genomes have no more than 2% of missing loci.
File 'Allele_Profles/Senterica_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 4,589 Salmonella genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software.
File 'Allele_Profles/Senterica_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 4,589 Salmonella genomes of the dataset. Please note that missing loci are indicated with a zero.
Additional citations
The schema are prepared to be used with chewBBACA. When using the schema in this repository please cite also:
Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166 http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000166
Salmonella enterica schema is a derivation of EnteroBase Salmonella EnteroBase wgMLST schema. When using the schema in this repository please cite also:
Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of Salmonella. PLoS Genet 14 (4):e1007261. https://doi.org/10.1371/journal.pgen.1007261
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is an expertly curated database of literature-derived functional information for the model organism budding yeast, Saccharomyces cerevisiae. SGD constantly strives to synergize new types of experimental data and bioinformatics predictions with existing data, and to organize them into a comprehensive and up-to-date information resource. The primary mission of SGD is to facilitate research into the biology of yeast and to provide this wealth of information to advance, in many ways, research on other organisms, even those as evolutionarily distant as humans. To build such a bridge between biological kingdoms, SGD is curating data regarding yeast-human complementation, in which a human gene can successfully replace the function of a yeast gene, and/or vice versa. These data are manually curated from published literature, made available for download, and incorporated into a variety of analysis tools provided by SGD.
This project contains raw data, intermediate files and results is a re-analysis of the publicly available dataset from the PRIDE dataset PXD003791. The RAW files were processed using ThermoRawFileParser, SearchGUI and PeptideShaker through standard settings (see ‘Data Processing Protocol’). This reanalysis work is part of the MetaPUF (MetaProteomics with Unknown Function) project, which is a collaboration between EMBL-EBI and the University of Luxembourg. The dataset was selected with the following conditions: 1. It has been made publicly available in PRIDE and focuses on metaproteomics of the human gut; 2. The corresponding metagenomics assemblies were also available from ENA (European Nucleotide Archive) or MGnify. The processed peptide reports for each sample are available to view at the contig level on the MGnify website. In total, the reanalysis identified 59,613 unique proteins from 36 samples.
Database for sequences of the human major histocompatibility complex (HLA) and includes the official sequences for the WHO Nomenclature Committee For Factors of the HLA System. It currently contains 9,310 allele sequences (2013) along with detailed information concerning the material from which the sequence was derived and data on the validation of the sequences. It is established procedure for authors to submit the sequences directly to the IMGT/HLA Database for checking and assignment of an official name prior to publication, this avoids the problems associated with renaming published sequences and the confusion of multiple names for the same sequence. The need for reasonably rapid publication of new HLA allele sequences has necessitated an annual meeting of the WHO Nomenclature Committee for Factors of the HLA System. Additionally they now publish monthly HLA nomenclature updates both in journals and online to provide quick and easy access to new sequence information. The IMGT/HLA Database is part of the international ImMunoGeneTics project. In collaboration with the Imperial Cancer Research Fund (ICRF) and European Bioinformatics Institute (EBI) they have developed an Oracle database to house the HLA sequences in such a way as to allow users to present complex queries about the sequence, sequence features, references, contacts and allele designations to the database via a graphical user interface over the web. The IMGT/HLA Database Submission Tool allows direct submission of sequences to the WHO HLA Nomenclature Committee for Factors of the HLA System. The IMGT/HLA Database provides an FTP site for the retrieval of sequences in a number of pre-formatted files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background: Ribosomally synthesized and post-translationally modified peptides (RiPPs) are a burgeoning class of natural products with diverse activity that share a similar origin and common features in their biosynthetic pathways. The precursor peptides of these natural products are ribosomally produced, upon which a combination of modification enzymes installs diverse functional groups. This genetically encoded peptide-based strategy allows for rapid diversification of these natural products by mutation in the precursor genes merged with unique combinations of modification enzymes. Thiazole/oxazole-modified microcins (TOMMs) are a class of RiPPs defined by the presence of heterocycles derived from cysteine, serine, and threonine residues in the precursor peptide. TOMMs encompass a number of different families, including but not limited to the linear azol(in)e-containing peptides (streptolysin S, microcin B17, and plantazolicin), cyanobactins, thiopeptides, and bottromycins. Although many TOMMs have been explored, the increased availability of genome sequences has illuminated several unexplored TOMM producers. Methods: All YcaO domain-containing proteins (D protein) and the surrounding genomic regions were were obtained from the European Molecular Biology Laboratory (EMBL) and the European Bioinformatics Institute (EBI). MultiGeneBlast was used to group gene clusters contain a D protein. A number of techniques were used to identify TOMM biosynthetic gene clusters from the D protein containing gene clusters. Precursor peptides from these gene clusters were also identified. Both sequence similarity and phylogenetic analysis were used to classify the 20 diverse TOMM clusters identified. Results: Given the remarkable structural and functional diversity displayed by known TOMMs, a comprehensive bioinformatic study to catalog and classify the entire RiPP class was undertaken. Here we report the bioinformatic characterization of nearly 1,500 TOMM gene clusters from genomes in the European Molecular Biology Laboratory (EMBL) and the European Bioinformatics Institute (EBI) sequence repository. Genome mining suggests a complex diversification of modification enzymes and precursor peptides to create more than 20 distinct families of TOMMs, nine of which have not heretofore been described. Many of the identified TOMM families have an abundance of diverse precursor peptide sequences as well as unfamiliar combinations of modification enzymes, signifying a potential wealth of novel natural products on known and unknown biosynthetic scaffolds. Phylogenetic analysis suggests a widespread distribution of TOMMs across multiple phyla; however, producers of similar TOMMs are generally found in the same phylum with few exceptions. Conclusions: The comprehensive genome mining study described herein has uncovered a myriad of unique TOMM biosynthetic clusters and provides an atlas to guide future discovery efforts. These biosynthetic gene clusters are predicted to produce diverse final products, and the identification of additional combinations of modification enzymes could expand the potential of combinatorial natural product biosynthesis.
The BioSamples database aggregates sample information for reference samples (e.g. Coriell Cell lines) and samples for which data exist in one of the EBI's assay databases such as ArrayExpress, the European Nucleotide Archive or PRoteomics Identificates DatabasE. It provides links to assays an specific samples, and accepts direct submissions of sample information.
This project contains raw data, intermediate files and results used to create the PRIDE human phosphoproteome map. The map is based on joint reanalysis of 110 publicly available human datasets. All relevant datasets were retrieved from the PRIDE database, and after manual curation, only assays that employed dedicated phospho-enrichment sample preparation strategies (e. g. metal oxide affinity chromatography, anti-P-Tyr antibodies, etc.) were included. Raw files were jointly processed with MaxQuant computational platform using standard settings (see Data Processing Protocol). In total, the joint analysis allowed identification of 252,189 phosphosites at 1% peptide spectrum match false discovery rate (PSM FDR) (MQ search results available in ‘txt-100PTM’ folder), of which 121,896 passed the additional 1% site localization FDR threshold (MQ search results available in ‘txt-001PTM’ folder).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including genome sequence, gene models, transcript sequence, genetic variation, and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments and expansions. These include the incorporation of almost 20 000 additional genome sequences and over 35 000 tracks of RNA-Seq data, which have been aligned to genomic sequence and made available for visualization. Other advances since 2015 include the release of the database in Resource Description Framework (RDF) format, a large increase in community-derived curation, a new high-performance protein sequence search, additional cross-references, improved annotation of non-protein-coding genes, and the launch of pre-release and archival sites. Collectively, these changes are part of a continuing response to the increasing quantity of publicly-available genome-scale data, and the consequent need to archive, integrate, annotate and disseminate these using automated, scalable methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see CWLProv 0.6.0 or use the cwlprov Python tool to explore.
The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages.
First step, Pre-align, accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step.
The next step Align also accepts the human reference genome as input along with the output files from Pre-align and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format.
The BAM files generated after lign are sorted with SAMtool sort'.
Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in Post-align step.
Steps to reproduce
This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker.
Install gsutils
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" |
sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg |
sudo apt-key add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
Get the data and make the analysis environment ready:
git clone https://github.com/FarahZKhan/topmed-workflows.git cd topmed-workflows git checkout cwlprov_testing cd aligner/sbg-alignment-cwl
git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git
python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json
Run the following commands to create the CWLProv Research Object:
time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new
zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux
sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha25
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
jats:titleAbstract/jats:title jats:pEnsembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of interfaces to genomic data across the tree of life, including reference genome sequence, gene models, transcriptional data, genetic variation and comparative analysis. Data may be accessed via our website, online tools platform and programmatic interfaces, with updates made four times per year (in synchrony with Ensembl). Here, we provide an overview of Ensembl Genomes, with a focus on recent developments. These include the continued growth, more robust and reproducible sets of orthologues and paralogues, and enriched views of gene expression and gene function in plants. Finally, we report on our continued deeper integration with the Ensembl project, which forms a key part of our future strategy for dealing with the increasing quantity of available genome-scale data across the tree of life./jats:p
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.