Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The top row includes problems about RNA secondary structure predictions and the middle row includes problems about alignment of biological sequences. Note that the estimators in the same column corresponds to each other.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets, conda environments and Softwares for the course "Population Genomics" of Prof Kasper Munch. This course material is maintained by the health data science sandbox. This webpage shows the latest version of the course material.
The data is connected to the following repository: https://github.com/hds-sandbox/Popgen_course_aarhus. The original course material from Prof Kasper Munch is at https://github.com/kaspermunch/PopulationGenomicsCourse.
Description
The participants will after the course have detailed knowledge of the methods and applications required to perform a typical population genomic study.
The participants must at the end of the course be able to:
The course introduces key concepts in population genomics from generation of population genetic data sets to the most common population genetic analyses and association studies. The first part of the course focuses on generation of population genetic data sets. The second part introduces the most common population genetic analyses and their theoretical background. Here topics include analysis of demography, population structure, recombination and selection. The last part of the course focus on applications of population genetic data sets for association studies in relation to human health.
Curriculum
The curriculum for each week is listed below. "Coop" refers to a set of lecture notes by Graham Coop that we will use throughout the course.
Course plan
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accompanying note for supplementary materials
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Microsatellites, also known as SSRs or STRs, are polymorphic DNA regions with tandem repetitions of a nucleotide motif of size 1–6 base pairs with a broad range of applications in many fields, such as comparative genomics, molecular biology, and forensics. However, the majority of researchers do not have computational training and struggle while running command-line tools or very limited web tools for their SSR research, spending a considerable amount of time learning how to execute the software and conducting the post-processing data tabulation in other tools or manually—time that could be used directly in data analysis. We present EasySSR, a user-friendly web tool with command-line full functionality, designed for practical use in batch identifying and comparing SSRs in sequences, draft, or complete genomes, not requiring previous bioinformatic skills to run. EasySSR requires only a FASTA and an optional GENBANK file of one or more genomes to identify and compare STRs. The tool can automatically analyze and compare SSRs in whole genomes, convert GenBank to PTT files, identify perfect and imperfect SSRs and coding and non-coding regions, compare their frequencies, abundancy, motifs, flanking sequences, and iterations, producing many outputs ready for download such as PTT files, interactive charts, and Excel tables, giving the user the data ready for further analysis in minutes. EasySSR was implemented as a web application, which can be executed from any browser and is available for free at https://computationalbiology.ufpa.br/easyssr/. Tutorials, usage notes, and download links to the source code can be found at https://github.com/engbiopct/EasySSR.
Inference of species trees plays a crucial role in advancing our understanding of evolutionary relationships and has immense significance for diverse biological and medical applications. Extensive genome sequencing efforts are currently in progress across a broad spectrum of life forms, holding the potential to unravel the intricate branching patterns within the tree of life. However, estimating species trees starting from raw genome sequences is quite challenging, and the current cutting-edge methodologies require a series of error-prone steps that are neither entirely automated nor standardized. ROADIES leverages recent advances in phylogenetic inference to allow multi-copy genes, eliminating the need to detect orthology. ROADIES also employs a paradigm-shifting strategy that randomly selects segments of the input genomes to generate gene trees. This eliminates the need to pre-select loci such as functional genes using cumbersome annotation steps, align whole genomes, or choose a sing..., , , # Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES
https://doi.org/10.5061/dryad.tht76hf73
ROADIES is a novel pipeline designed for phylogenetic tree inference of the species directly from their raw genomic assemblies.
For further details related to how to run the tool ROADIES, please refer to our Wiki:Â https://turakhia.ucsd.edu/ROADIES/
This repository contains the output files generated by ROADIES (v0.1.0) (https://github.com/TurakhiaLab/ROADIES/releases/tag/v0.1.0) for estimating the species tree for the following datasets (in accurate mode of operation):
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Subjective data models dataset
This dataset is comprised of data collected from study participants, for a study into how people working with biological data perceive data, and whether or not this perception of data aligns with a person's experiential and educational background. We call the concept of what data looks like to an individual a "subjective data model".
Todo: link paper/preprint once published.
Computational python analysis code: https://doi.org/10.5281/zenodo.7022789 and https://github.com/yochannah/subjective-data-models-analysis
Files
Transcripts of the recorded sessions are attached and have been verified by a second researcher. These files are all in plain text .txt format. Note that participant 3 did not agree to sharing the transcript of their interview.
Interview paper files This folder has digital and photographed versions of the files shown to the participants for the file mapping task. Note that the original files are from the NCBI and from FlyBase.
Videos and stills from the recordings have been deleted in line with the Data Management Plan and Ethical Review.
anonymous_participant_list.csv
shows which files have transcripts associated (not all participants agreed to share transcripts), what the order of Tasks A and B were, the date of interview, and what entities participants added to the set provided (if any). See the paper methods for more info about why entities were added to the set.
cards.txt
is a full list of the cards presented in the tasks.
background survey
and background manual annotations
are the select survey data about participant background and manual additions to this where necessary, e.g. to interpret free text.
codes.csv
shows the qualitative codes used within the transcripts.
entry_point.csv
is a record of participants' identified entry points into the data.
file_mapping_responses
shows a record of responses to the file mapping task.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each of the tar.gz compressed directories corresponds to prepTG databases (for the zol suite) featuring distinct, representative genomes for one of the six genera containing ESKAPE pathogens. Representative genomes for each genus/taxon were selected using skDER v1.0.7 in greedy mode with 99% ANI and 90% AF cutoffs.
The compressed folders also contain an extra file, corresponding to a species tree of the representative genomes constructed using GToTree with Universal markers (ribosomal proteins) from Hug et al. 2016 and in best-hits mode. Note, GToTree was modified to always use -super5 mode for SCG alignments for computational efficiency. Also, note, because genomes can be dropped by GToTree prior to phylogeny inference (e.g. if they lack enough SCGs), not all genomes in the database might be represented in the phylogenies.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Projects in chemo- and bioinformatics often consist of scattered data in various types and are difficult to access in a meaningful way for efficient data analysis. Data is usually too diverse to be even manipulated effectively. Sdfconf is data manipulation and analysis software to address this problem in a logical and robust manner. Other software commonly used for such tasks are either not designed with molecular and/or conformational data in mind or provide only a narrow set of tasks to be accomplished. Furthermore, many tools are only available within commercial software packages. Sdfconf is a flexible, robust, and free-of-charge tool for linking data from various sources for meaningful and efficient manipulation and analysis of molecule data sets. Sdfconf packages molecular structures and metadata into a complete ensemble, from which one can access both the whole data set and individual molecules and/or conformations. In this software note, we offer some practical examples of the utilization of sdfconf.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
On this Zenodo link, we share the data that is required to reproduce all the analyses from our publication "satuRn: Scalable Analysis of differential Transcript Usage for bulk and single-cell RNA-sequencing applications".
This repository includes input transcript-level expression matrices and metadata for all datasets, as well as intermediate results and final outputs of the respective DTU analyses. For a more elaborate description of the data, we refer to the companion GitHub for our publications; https://github.com/statOmics/satuRnPaper. Note that this is version 1.0.0 of the data (uploaded on 2021-01-14). If any changes were to be made to the datasets in the future, this will also be communicated on our companion GitHub page.
https://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/
Dataset description Data consists of CRAM file from capture-based gene panel sequencing (Twist Bioscience) of 252 genes selected based on their relevance in lymphoid malignancies. The panel also included genome-wide backbone probes for copy-number analysis. The preprared libraries were then subsequenlty equenced in paired-end mode (2x150bp) on the Illumina NovaSeq 6000 (Illumina Inc.). BALSAMIC was used to analyze the FASTQ files and aligning them to reference genome. Trimmed reads were mapped to the reference genome hg19 using BWA MEM v0.7.15 4. The resulting SAM files were converted to BAM files and sorted using samtools v1.6. Duplicated reads were marked using Picard tools MarkDuplicate v2.17.0. And finally converted to CRAM files using samtools v1.6.
Note: CRAM is a sequencing read file format that is highly space efficient by using reference-based compression of sequence data and offers both lossless and lossy modes of compression: https://www.ebi.ac.uk/ena/cram/
Data Access Statement The data is under restricted access and can be accessed upon request through the email-adress below. The targeted sequence datasets are only to be used for research aimed at advancing the understanding of genetic factors in the chronic lymphocytic leukemia. Applications aimed at method development including bioinformatics would not be considered as acceptable for use of this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each of the tar.gz compressed directories corresponds to prepTG databases (for the zol suite) featuring distinct, representative genomes for one of the ten well studied taxa/genera. Representative genomes for each genus/taxon were selected using skDER (https://zenodo.org/deposit/8273156).
The compressed folders also contain an extra file, corresponding to a species tree of the representative genomes constructed using GToTree. Note, GToTree was modified to always use -super5 mode for SCG alignments for computational efficiency, but otherwise run with default parameters. Also, note, because genomes can be dropped by GToTree prior to phylogeny inference (e.g. if they lack enough SCGs), not all genomes in the database might be represented in the phylogenies. The phylum level SCGs from GToTree corresponding to each taxon were used (e.g. Actinobacteria SCGs were used for Cutibacterium).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains replication data for the paper titled "Geometric Transformers for Protein Interface Contact Prediction". The dataset consists of pickled Python dictionaries containing pairs of DGLGraphs that can be used to train and validate protein interface contact prediction models. It also contains our best model checkpoints saved as PyTorch LightningModules. Our GitHub repository, DeepInteract, linked in the "Additional notes" metadata section below provides more details on how we use these files as examples for cross-validation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Phylogenetic re-analyses of Insulin-like Growth Factor Binding Proteins (IGFBPs) based on amino acid sequences. The sequences and alignment described in Ocampo Daza et al. (2011) Endocrinology 152(6):2278-89 (link below) were used to analyze additional IGFBP sequences identified in the genome databases of Anolis carolinensis (anole lizard), Latimeria chalumnae (coelacanth) and Lepisosteus oculatus (spotted gar). Phylogenetic trees were made using neighbor joining (NJ) and phylogenetic maximum likelihood (PhyML) methods, both supported by bootstrap analyses (details below). Figures (PDF-files) of the finished trees are included in the files IGFBP_NJ_figure.pdf and IGFBP_PhyML_figure.pdf. Branch colors are based on chromosomal locations and follow the trees published in Ocampo Daza et al. (2011) (link below). Species abbreviations Homo sapiens (Hsa, human), Mus musculus (Mmu, mouse), Canis familiaris (Cfa, dog), Monodelphis domestica (Mdo, opossum), Gallus gallus (Gga, chicken), Taeniopygia guttata (Tgu, zebra finch), Anolis carolinensis (Aca, anole lizard), Latimeria chalumnae (Lch, coelacanth), Lepisosteus oculatus (Loc, spotted gar), Danio rerio (Dre, zebrafish), Oryzias latipes (Ola, medaka),Gasterosteus aculeatus (Gac, stickleback), Tetraodon nigroviridis (Tni, green-spotted pufferfish),Takifugu rubripes (Tru, fugu), Ciona intestinalis (Cin, vase tunicate), Ciona savignyi (Csa, Pacific transparent tunicate) and Branchiostoma floridae (Bfl, Florida lancelet). Sequences used Detailed information about all sequences that were used is included in the file Sequence_info_Tab1.xlsx (MS Excel spreadsheet). This includes database identifiers and chromosome/linkage group locations as well as notes on the manual curation/annotation of the sequences. Alignment The full amino acid sequence alignment used for the phylogenetic analyses is included in an interleaved format (.aln) and a sequential format (.fasta) in the files IGFBP_alignment_interleaved.aln and IGFBP_alignment_sequential.fasta. The alignment was made using the ClustalW algorithm and edited manually as described in Ocampo Daza et al. (2011) Endocrinology 152(6):2278-89 (link below). Anole lizard, coelacanth and spotted gar sequences marked with asterisks are fragments and do not span the full length of the alignment (details in the file Sequence_info_Tab1.xlsx). Phylogenetic analysis, NJ method The Neighbor Joining tree was made in ClustalX 2.0, with settings as described in Ocampo Daza et al. (2011) (link below). The tree is supported by a bootstrap analysis with 1000 bootstrap replicates. The raw output is included in the file IGFBP_NJ.txt and the final tree, rooted with the lancelet IGFBP sequence, is included in the file IGFBP_NJ_rooted.phb. Both files are in the Newick/Phylip data format. Phylogenetic trees, PhyML method The Phylogenetic Maximum Likelihood tree was made using the PhyML3.0 algorithm implemented through the web-based interface available at http://www.atgc-montpellier.fr/phyml/. The following settings were used: . Amino acid subst. model : LG. Proportion of invariable sites : estimated. Number of subst. rate categs : 8. Gamma distribution parameter : estimated. 'Middle' of each rate class : mean. Amino acid equilibrium frequencies : empirical. Optimise tree topology : yes. Tree topology search : NNIs. Starting tree : BioNJ. Add random input tree : no. Optimise branch lengths : yes. Optimise substitution model parameters : yes The tree is supported by a bootstrap analysis with 100 bootstrap replicates. The final tree, rooted with the lancelet IGFBP sequence, is included in the file IGFBP_PhyML.phb (Newick/Phylip format). The raw output files of the PhyML analysis are included in the following files: . igfbp_ml_121119_phy_stdout.txt . igfbp_ml_121119_phy_phyml_tree.txt . igfbp_ml_121119_phy_phyml_stats.txt . igfbp_ml_121119_phy_phyml_boot_trees.txt . igfbp_ml_121119_phy_phyml_boot_stats File formats All phylogenetic data is included in the Newick/Phylip format. For more information on the PhyML output files and data formats, see http://www.atgc-montpellier.fr/download/papers/phyml_manual_2009.pdf.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The following repository holds all of the analysis performed for the Manuscript titled "X-ray Lightsheet Microscopy: 3-Dimensional Localization of Non-Blinking Scintillating Nanoparticles with Nanometer Accuracy". The results were acquired with the use of the "Multi-Orientation MAXWELL" software, hosted on the following GitHub repository:https://github.com/SierraD/Multi-Orientation-Maxwell.The files include ThunderSTORM[1] results and protocol files obtained from analyzing the Tiff format images acquired via MAXWELL imaging for the sample "Slide 2-8" of microspheres coated with scintillating nanoparticles. The results files are CSV files, directly output from the ThunderSTORM software used to acquire the super resolution localizations. The ThunderSTORM parameters are found in the Protocol text files. The ThunderSTORM parameters were set to the default parameters, with changes made to the pixel size, which was set to 1 to deal with the anisotropic pixel size, as well as the fitting method, which was set to Maximum Likelihood Estimation to maximize the accuracy.[1] Ovesný, M. et al. ThunderSTORM; a comprehensive ImageJ plug-in for PALM and STORM data analysis and super-resolution imaging. BioInformatics, Applications Note, 30 (16), 2389-2390 (2014).Please see the following GitHub repository for software and documentation for the Multi-Orientation MAXWELL analysis method:https://github.com/SierraD/Multi-Orientation-Maxwell/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We review behavioural change models (BCM) for infectious disease transmission in humans. Following the Cochrane collaboration guidelines and the PRISMA statement, our systematic search and selection yielded 178 papers covering the period 2010–2015. We observe an increasing trend in published BCMs, frequently coupled to (re)emergence events, and propose a categorization by distinguishing how information translates into preventive actions. Behaviour is usually captured by introducing information as a dynamic parameter (76/178) or by introducing an economic objective function, either with (26/178) or without (37/178) imitation. Approaches using information thresholds (29/178) and exogenous behaviour formation (16/178) are also popular. We further classify according to disease, prevention measure, transmission model (with 81/178 population, 6/178 metapopulation and 91/178 individual-level models) and the way prevention impacts transmission. We highlight the minority (15%) of studies that use any real-life data for parametrization or validation and note that BCMs increasingly use social media data and generally incorporate multiple sources of information (16/178), multiple types of information (17/178) or both (9/178). We conclude that individual-level models are increasingly used and useful to model behaviour changes. Despite recent advancements, we remain concerned that most models are purely theoretical and lack representative data and a validation process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
June 2023 Version
This archive contains materials (datasets, exercises and slides, etc) used for the Introduction to bulk RNAseq analysis workshop taught at the University of Copenhagen by the Center for Health Data Science (HeaDS). The course repo can be found on Github:
Assignments.zip contains exercises for the preprocessing part of the course, like fastqc and multiqc examples of bulk RNAseq experiments
Data.zip contains count matrices (both traditional counts and salmon pseudocounts), as well as sample metadata (samplesheet.csv) and backup results from the preprocessing pipeline.
Notes.zip contains supplementary materials such as extra pdfs for more information on bulk RNAseq technology.
Slides.zip contains all the slides used in the workshop.
Raw_reads.zip contains the raw reads from the bulk RNAseq experiment (10.1016/j.celrep.2014.10.054) used in this course.
Theoretical work suggests that sexual conflict should promote the maintenance of genetic diversity by the opposing directions of selection on males and females. If such conflict is pervasive, it could potentially lead to genomic heterogeneity in levels of genetic diversity an idea that so far has not been empirically tested on a genome-wide scale. We used large-scale population genomic and transcriptomic data from the collared flycatcher (Ficedula albicollis) to analyse how sexual conflict, for which we use sex-biased gene expression as a proxy, relates to genetic variability. Here, we demonstrate that the extent of sex-biased gene expression of both male-biased and female-biased genes is significantly correlated with levels of nucleotide diversity in gene sequences and that this correlation extends to diversity levels also in intergenic DNA and introns. We find signatures of balancing selection in sex-biased genes but also note that relaxed purifying selection could potential...
GlycoSuiteDB is a curated database of carbohydrate (glycan) structures sourced from published material. This entry corresponds to a catologue of structures reported in the publication - A new O-glycosidically linked tri-hexosamine core structure in sheep gastric mucin: a preliminary note.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The top row includes problems about RNA secondary structure predictions and the middle row includes problems about alignment of biological sequences. Note that the estimators in the same column corresponds to each other.