Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets in fastqsanger.gz format representing re-sequencing of human mitochondria
Facebook
TwitterWith NGS technologies, life sciences face a raw data deluge. Classical analysis processes of such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to directly focus on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. Dedicated to whole genome assembly-free treatments, the Colibread tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of de Bruijn graph and bloom filter, such analyses can be performed in few hours, using small amounts of memory. Applications on real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. With the Colibread Galaxy tools suite, we give the possibility to a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows to keep the maximum of biological information from data and use very low memory footprint.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data provided here are part of a Galaxy Training Network tutorial for genome annotation with Maker.
It is based on data used in another Maker tutorial.
The full genome was downloaded from NCBI, and mitochondria sequence removed from it for simplicity.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reference and custom annotation data expected as input by Galaxy SARS-CoV-2 variation analysis workflows developed by covid19.galaxyproject.org
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Proteogenomics combines large-scale genomic and transcriptomic data with mass-spectrometry-based proteomic data to discover novel protein sequence variants and improve genome annotation. In contrast with conventional proteomic applications, proteogenomic analysis requires a number of additional data processing steps. Ideally, these required steps would be integrated and automated via a single software platform offering accessibility for wet-bench researchers as well as flexibility for user-specific customization and integration of new software tools as they emerge. Toward this end, we have extended the Galaxy bioinformatics framework to facilitate proteogenomic analysis. Using analysis of whole human saliva as an example, we demonstrate Galaxy’s flexibility through the creation of a modular workflow incorporating both established and customized software tools that improve depth and quality of proteogenomic results. Our customized Galaxy-based software includes automated, batch-mode BLASTP searching and a Peptide Sequence Match Evaluator tool, both useful for evaluating the veracity of putative novel peptide identifications. Our complex workflow (approximately 140 steps) can be easily shared using built-in Galaxy functions, enabling their use and customization by others. Our results provide a blueprint for the establishment of the Galaxy framework as an ideal solution for the emerging field of proteogenomics.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Galaxy is an open source, web-based platform for data intensive biomedical research. It makes accessible bioinformatics applications to users lacking programming skills, enabling them to easily build analysis workflows for NGS data.
The course "Exome analysis using Galaxy" is aimed at PhD student, biologists, clinicians and researchers who are analysing, or need to analyse in the near future, high throughput exome sequencing data. The aim of the course is to make participants familiarise with the Galaxy platform and prepare them to work independently, using state-of-the art tools for the analysis of exome sequencing data.
The course will be delivered using a mixture of lectures and computer based hands-on practical sessions. Lectures will provide an up-to-date overview of the strategies for the analysis of exome next-generation experiments, starting from the raw sequence data. Analyses include sequence quality control, alignment to a reference genome, refinement of aligned sequences, variant calling, annotation and interpretation, and tools for visual inspection of results. Participants will apply the knowledge gained during the course to the analysis of Illumina’s real exome datasets, and implement workflows to reproduce the complete analysis. After the course, participants will be able to create pipeline for their individual analyses.
Those are the needed datasets for this course.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The table summarizes the report generated by Metavisitor from a batch of 40 sequence datasets (S14 File). Metadata associated with each indicated sequence dataset as well as the ability of Metavisitor to detect HIV in datasets and patients are indicated.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data provided here are part of a Galaxy Training Network tutorial for genome annotation with funannotate.
Genome was assembled following the GTN Flye assembly tutorial, then masked with RepeatMasker.
RNASeq data: SRR8534859 reads were mapped to the genome using STAR (toolshed.g2.bx.psu.edu/repos/iuc/rgrnastar/rna_star/2.7.8a+galaxy0), then the bam was downsampled (10% with toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_DownsampleSam/2.18.2.1) to reduce the size of the dataset. Fastq files were then extracted from the resulting bam file (toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_SamToFastq/2.18.2.1).
SwissProt_subset.fasta is a subset of SwissProt proteins that are known to have some similarity with the genome (found using Diamond against the genome, then extracting sequences matching with e-value < 0.0001).
Facebook
TwitterWe have developed a multi-step viral genome assembly pipeline named VirAmp, that combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface. Our pipeline allows users to assemble, analyze and interpret high coverage viral sequencing data with an ease and efficiency that was not possible previously. Our software makes a large number of genome assembly and related tools available to life scientists and automates the currently recommended best practices into a single, easy to use interface. We tested our pipeline with three different datasets from human herpes simplex virus (HSV). VirAmp provides a user-friendly interface and a complete pipeline for viral genome analysis. We make our software available via an Amazon Elastic Cloud disk image that can be easily launched by anyone with an Amazon web service account. A demonstration version of our system can be found at http://viramp.com. We also maintain detailed documentation on each tool and methodology at http://docs.viramp.com. Here in GigaDB you will find an archived version of the tools as they were published.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains all materials used to evaluate how Large Language Models (LLMs) support scientific workflow development in Galaxy and Nextflow. It includes the full set of prompts, LLM responses, and generated workflows analyzed in the study. The package provides six PDF files: (1) LLMs’ understanding of fundamental scientific workflow and workflow-system concepts, and (2) their domain knowledge of Galaxy and Nextflow platforms, including architecture, key features, and reproducibility mechanisms. It also includes workflow-specific background questions for both systems, covering domain tasks such as SNP-rich exon detection, peak-to-gene association, methylation analysis, and QC pipelines.
The package further provides the complete workflows generated by GPT-4o, Gemini 2.5 Flash, and DeepSeek-V3 for a set of benchmark tasks, detailing tool selections, execution steps, file transformations, and workflow structure. Together, these artifacts enable full transparency and reproducibility of our multi-dimensional assessment of LLMs’ conceptual reasoning, domain understanding, and workflow-generation capabilities across two major scientific workflow systems.
The first two files provide foundational insights. The first file, Table-2 Fundamental_Concepts_Of_Scientific_Workflow_and_SWS, includes LLM-generated responses to conceptual questions about scientific workflows and workflow systems, evaluating the understanding of GPT-4o, Gemini 2.5 Flash, and DeepSeek-V3. The second file, Table-3 LLMs Understanding of Galaxy and Nextflow, further explores LLMs’ domain-specific knowledge by addressing background questions about the Galaxy and Nextflow platforms, including their architecture, tools, reproducibility, and key features such as Galaxy’s ToolShed or Nextflow’s DSL concepts and nf-core integration.
The next two files, Table-4 and Table-5, contain workflow-specific background questions designed to assess LLM comprehension of domain-level specific tasks within Galaxy and Nextflow, respectively. These include tasks such as identifying SNP-rich exons, associating peaks with genes, or understanding methylation data processing. The final two files, LLMs Generated workflows using Galaxy Workflow System and LLMs generated workflows using Nextflow Workflow System, showcase the actual workflows generated by LLMs in response to structured prompts. Each file presents detailed, step-by-step workflows for different tasks, comparing how each LLM structures, sequences, and explains the analyses using real-world tools and formats (e.g., FastQC, BEDTools, MultiQC). These documents together form a multi-dimensional assessment of LLMs’ capability in generating, reasoning about, and structuring scientific workflows.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data provided here are part of a Galaxy Training Network tutorial for genome annotation with funannotate.
Genome was assembled following the GTN Flye assembly tutorial, then masked with RepeatMasker.
RNASeq data: SRR12951075 and SRR8534859 reads were mapped to the genome using STAR (toolshed.g2.bx.psu.edu/repos/iuc/rgrnastar/rna_star/2.7.8a+galaxy0), then bam mere merged (toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_MergeSamFiles/2.18.2.1) and downsampled (10% with toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_DownsampleSam/2.18.2.1) to reduce the size of the dataset. Fastq fiels were then extracted from the resulting bam file (toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_SamToFastq/2.18.2.1).
SwissProt_subset.fasta is a subset of SwissProt proteins that are known to have some similarity with the genome (found using Diamond against the genome, then extracting sequences matching with e-value < 0.0001).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload was to support a Galaxy tutorial on the Lumpy Skin Disease virus genome prepared by the Defend2020 project. To access the data deposited in Genbank and the Sequence Read Archive please refer to the following deposits.
The LSDV isolate Kubash/KAZ/16 sequence has been deposited in GenBank under accession number MN642592, and raw data have been submitted to the SRA under BioProject number PRJNA587601.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data needed for the 'Long non-coding RNAs (lncRNAs) annotation with FEELnc' tutorial (Galaxy Training Material).The assembly was generated following the 'Genome assembly using PacBio data' tutorial.The annotation was generated following the 'Genome annotation with Funannotate ' tutorial.
The bam file is RNASeq SRR8534859_1.fastq.gz and SRR8534859_2.fastq.gz mapping on the genome assembly.
Facebook
TwitterMetaproteomics characterizes proteins expressed by microorganism communities (microbiome) present in environmental samples or a host organism (e.g. human), revealing insights into the molecular functions conferred by these communities. Compared to conventional proteomics, metaproteomics presents unique data analysis challenges, including the use large protein databases derived from hundreds of organisms, as well as numerous processing steps to ensure data quality. This data analysis complexity limits the use of metaproteomics for many researchers. In response, we have developed an accessible and flexible metaproteomics workflow within the Galaxy bioinformatics framework. Via analysis of human oral tissue exudate samples, we have established a modular Galaxy-based workflow that automates a reduction method for searching large sequence databases, enabling comprehensive identification of host proteins (human) as well as meta-proteins from the non-host organisms. Downstream, automated processing steps enable BLASTP analysis and evaluation/visualization of peptide sequence match quality, maximizing confidence in results. Outputted results are compatible with tools for taxonomic and functional characterization (e.g. Unipept, MEGAN5). Galaxy also allows for the sharing of complete workflows with others, promoting reproducibility and also providing a template for further modification and improvement. Our results provide a blueprint for establishing Galaxy as a solution for metaproteomic data analysis.
Facebook
TwitterAnalyzing high-throughput DNA sequence data is a fundamental skill in modern biology. However, real and perceived barriers such as massive file sizes, substantial computational requirements, and lack of instructor background knowledge can discourage faculty from incorporating high-throughput sequence data into their courses. We developed a straightforward and detailed tutorial that guides students through the analysis of RNA sequencing (RNA-seq) data using Galaxy, a public web-based bioinformatics platform. The tutorial stretches over three laboratory periods (~8 hours) and is appropriate for undergraduate molecular biology and genetics courses. Sequence files are imported into a student's Galaxy user account directly from the National Center for Biotechnology Information Sequence Read Archive (NCBI SRA), eliminating the need for on-site file storage. Using Galaxy's graphical user interface and a defined set of analysis tools, students perform sequence quality assessment and trimming, map individual sequence reads to a genome, generate a counts table, and carry out differential gene expression analysis. All of these steps are performed "in the cloud," using offsite computational infrastructure. The provided tutorial utilizes RNA-seq data from a published study focused on nematode infection of Arabidopsis thaliana. Based on their analysis of the data, students are challenged to develop new hypotheses about how plants respond to nematode parasitism. However, the workflow is flexible and can accommodate alternative data sets from NCBI SRA or the instructor. Overall, this resource provides a simple introduction to the analysis of "big data" in the undergraduate classroom, with limited prior background and infrastructure required for successful implementation.
Facebook
TwitterThe input data of this tutorial is from an RNA-seq experiment looking for differentially expressed genes in D. melanogaster (fruit fly) between two experimental conditions. Please use the ‘fastqsanger’ File Format.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ReadMe. This file gives instructions concerning the prerequisites and the installation of sRNAPipe. (TXT 3Â kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Implementation of genomic variants calling as an installable GALAXY workflows using NGS data. Repository contains two separate sets of simulated ebola test data. One for SNPs and INDELs calling and another for Structural Variants calling.
Facebook
TwitterThis record includes training materials associated with the Australian BioCommons webinar ‘Here’s one we prepared earlier: (re)creating bioinformatics methods and workflows with Galaxy Australia’. This webinar took place on 26 October 2022. Event description Have you discovered a brilliant bioinformatics workflow but you’re not quite sure how to use it? In this webinar we will introduce the power of Galaxy for construction and (re)use of reproducible workflows, whether building workflows from scratch, recreating them from published descriptions and/or extracting from Galaxy histories. Using an established bioinformatics method, we’ll show you how to: Use the workflows creator in Galaxy Australia Build a workflow based on a published method Annotate workflows so that you (and others) can understand them Make workflows finable and citable (important and very easy to do!) Materials are shared under a Creative Commons Attribution 4.0 International agreement unless otherwise specified and were current at the time of the event. Files and materials included in this record: Event metadata (PDF): Information about the event including, description, event URL, learning objectives, prerequisites, technical requirements etc. Index of training materials (PDF): List and description of all materials associated with this event including the name, format, location and a brief description of each file. GalaxyWorkflows_Slides (PDF): A PDF copy of the slides presented during the webinar. Materials shared elsewhere: A recording of this webinar is available on the Australian BioCommons YouTube Channel: https://youtu.be/IMkl6p7hkho
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data provided here are part of a Galaxy Training Network tutorial for manual curation of eukaryotic genome annotation using Apollo.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets in fastqsanger.gz format representing re-sequencing of human mitochondria